Writing Style Fingerprint Tool Easily Fooled 96
Urchin writes "Some of the techniques used by literary detectives and courts of law to identify the authorship of text are easily fooled, say US researchers. They found that non-professional writers could hide their identity from 'stylometric' techniques by writing in the style of novelist Cormac McCarthy. Stylometric methods have been used in a number of high-profile legal cases in recent decades, including the 'Unabomber' trial. 'We would strongly suggest that courts examine their methods of stylometry against the possibility of adversarial attacks,' say the researchers."
Could have told you writing analysis was bogus.... (Score:3, Insightful)
Duh! (Score:4, Insightful)
Re:Could have told you writing analysis was bogus. (Score:5, Insightful)
a common feature of correlations (Score:4, Insightful)
Stylometrics is essentially a correlational field: it's not that people inherently must write in unique styles that are identifiable from a few measurable features: there is no strong genetic causation for handwriting or anything like that, which would mean that a handwriting style really does truly identify an individual or narrow set of individuals. Rather, it's that, all else being equal, people in practice, do tend to write in a way that lets the stylometric features distinguish them. But, when all else isn't equal, and people are actively trying to thwart that sort of analysis, they are, unsurprisingly, able to do so in a lot of cases.
I suspect that a lot of forensic analysis runs into this problem: it takes some fact that empirically is true among the general population, but only because the general population is not actively trying to thwart you. The set of robust empirical truths about people, that hold up even when the person is aware that you're trying to use it against them and actively trying to keep you from doing so, is much smaller.
Re:a common feature of correlations (Score:4, Insightful)
Re:No surprise (Score:4, Insightful)
The theory goes like this: the chances of getting a false positive on a part sample are something like 1/50million. You have 50 million people on the database. This means You'd expect a false positive on every search. If you're unlucky enough to live close enough to a crime to have committed it, you could easily find yourself in court.
You'll then have to defend yourself based on a 1 in 50 million probability to a jury who won't understand the statistics. If you haven't got a solid alibi, it would be a tough thing to do.
There's probably a good Terry Pratchett quote about 1 in a million chances to be used here.
Misrepresents forensic linguistics (Score:5, Insightful)
As the article says "the study only attacked some of the less complex stylometry techniques". In fact, I'm surprised that they even considered lexical density because that varies greatly within a single author's writing. It's usually high at the beginning of a text, usually (not always) gradually falls off, jumps when they change subject, and so on. I'm not aware of it's being used in forensic linguistics (although it is used in analysing texts to identify, for example, objective divisions within a text).
The sort of thing that they used in the Derek Bentley [wikipedia.org] (which contributed to the partial posthumous pardon) was analysis of his statement, which had
That all pointed to the statement not being Bentley's own words, but rather being the police version of his answers to a series of police questions that had been removed from the statement. One aspect of his original trial was a statement "I did not know he was going to use the gun", which was taken as evidence that he knew his accomplice, Craig, had a gun (and the inconsistency with the denial that he knew this, later in the statement, was taken as evidence that he was lying). Since the linguistic analysis shows that this was probably a reply to a question, it seems more likely that it went something like:
No.
Which makes sense because he knew at the time of the interview that Craig had a gun.
Yes, of course this sort of thing can be gamed, but it wasn't credible that Bentley would have been capable of such sophisticated gaming. The important thing as far as this thread is concerned is that forensic linguistics doesn't plug in a single measure, turn a handle and come out with a yes/no answer; it uses a whole range of measures and builds up an overall picture of what probably happened.
Re:Did you RTFA? (Score:5, Insightful)
No, but they knew they were being analyzed and for what. It's trivial to change my style (well, maybe not in English, I don't tend to have the word pool to draw from) and become someone else. If I know in advance that my writing would be used to find me.
You can, probably, given time and persistance, sift through the thousands and millions of board messages posted everywhere on the internet and find out who I am in other boards. I didn't try to hide my identity against comparison of writing styles.
I could see this working if applied to notes and texts written by someone who didn't have any reason to assume it would become the subject of an investigation. I'd deem it utterly worthless, though, when applied to ransom notes and the like.
No information is better than bad information... (Score:5, Insightful)
> I don't think anyone has ever sold writing analysis as a unique identifier. But it can be useful.
One problem with that is the human tendency to be overconfident as to how good these tests are. This happens everywhere. Court, business, whatever.
Say you have some metric at work (e.g. lines of code) that's easy to measure. If it's the only measure management has, it's what they'll use to measure how good you're doing. This applies even if the results are absurd, because they would rather believe that they have *some* idea what's going on than to accept the fact that they have no idea what's going on.
In summary, sometimes NO information is better than bad information, but people are very reluctant to accept that fact.
Re:Could have told you writing analysis was bogus. (Score:3, Insightful)
It is completely subjective and there is no real hard science to support such tests.
I beg to differ. There's very little subjective in stylometrics, the subjective part is interpreting the results, but definitely not producing them. Take a look at http://en.wikipedia.org/wiki/Stylometry [wikipedia.org] and tell me which of the methods described there you think is "completely subjective".
The main problem with stylometry is not the methods, but the data. As TFA describes, changing writing style throw off the results - at least to some extent. Stylometrics relies on the fact that old habits die hard, but if someone is aware that the text they are producing might be subjected to stylometric analyses, they can employ various mechanisms to avoid identification and will probably have a better chance at succeeding than if writing casually. However, most texts used in court has been produced casually (letters, emails, text messages) and almost always have some unique traits specific to their author. Even in cases where people plagiarize a known author, they always miss some subtlety in his/her style that gives away the plagiarism. These subtle differences in style are usually caught somewhere in the stylometric analysis.
It occurs to me now that you may be talking about hand-writing analysis, in which case my reply is completely irrelevant and you have completely missed the point of summary and TFA.
All evidence is tentative (Score:3, Insightful)
Comment removed (Score:3, Insightful)