Forgot your password?
typodupeerror
Privacy Security Your Rights Online

Linguistics Identifies Anonymous Users 215

Posted by Soulskill
from the that's-why-i-run-my-emails-through-google-translate-a-few-times dept.
mask.of.sanity writes "Researchers have examined writing styles to identify previously anonymous carders and hackers operating on underground forums. Up to 80 percent of users who wrote at least 5000 words across their posts could be identified using linguistic techniques. Techniques such as stylometric analysis were used to track users who posted across different forums, and could even be used to unveil authors of thesis papers or blogs who had taken to underground networks."
This discussion has been archived. No new comments can be posted.

Linguistics Identifies Anonymous Users

Comments Filter:
  • Anonymous First Post (Score:5, Informative)

    by Anonymous Coward on Wednesday January 09, 2013 @03:23AM (#42529101)

    Anonymous First Post... you'll never guess who I am

    • by Anonymous Coward on Wednesday January 09, 2013 @03:25AM (#42529121)

      4990.5 more words please.

      • by Anonymous Coward on Wednesday January 09, 2013 @04:09AM (#42529349)
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in tincidunt nisi. Vivamus quis ligula non lorem feugiat congue ut a ipsum. Vivamus iaculis elementum tellus eget ullamcorper. Nam sed lacus at felis volutpat egestas. Aliquam hendrerit mauris a felis fringilla tristique. Proin commodo eleifend leo suscipit pulvinar. Praesent velit lectus, venenatis ac volutpat vitae, scelerisque sed diam. Integer eu felis quis erat ultricies sodales. Etiam eu turpis massa. In vel velit nec purus tristique vestibulum. Cras eleifend diam ut dolor facilisis convallis. Morbi velit ligula, aliquam vitae ullamcorper et, dapibus sed augue. Nullam euismod urna in purus condimentum suscipit. Fusce dolor magna, dictum quis elementum quis, mollis in sem. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla imperdiet lectus sit amet risus interdum vel congue odio venenatis. Proin lobortis urna ac tortor auctor id porttitor urna auctor. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Integer viverra consequat nisl, ac adipiscing dui feugiat quis. Ut ut tortor urna. Pellentesque velit orci, mollis eu venenatis quis, convallis nec risus. Donec quis enim ac ante placerat accumsan. Fusce ut erat in tortor ullamcorper aliquam. Aenean ut est turpis. Nam ut elit justo. Suspendisse potenti. Praesent et nulla eget sem interdum pellentesque. Nunc sagittis metus sed mauris lacinia consequat. Fusce velit velit, semper at euismod a, euismod vitae enim. Vivamus elementum commodo faucibus. Suspendisse dictum rutrum leo at lobortis. Nam ac lectus id velit hendrerit rutrum vitae at mauris. Integer quis ante ullamcorper dui gravida auctor eu ut lectus. Curabitur laoreet sapien at tortor elementum consectetur. Etiam faucibus tempor sem, sed ultricies felis semper eget. Suspendisse odio lacus, interdum eu rhoncus ut, iaculis vitae enim. Morbi egestas ultricies lorem at tempus. Donec iaculis purus vel tellus cursus elementum. Nulla fermentum vulputate lorem sit amet pellentesque. Nunc quam lacus, consectetur et convallis non, pharetra dapibus diam. Maecenas laoreet ornare vehicula. Phasellus vitae odio diam. Ut facilisis nisi eu sapien elementum sit amet molestie arcu consectetur. Nulla in tortor urna, in elementum tellus. Maecenas convallis nunc purus, eget pretium purus. Suspendisse nec nibh ac augue condimentum adipiscing quis et lorem. Integer eget lorem velit. Nullam volutpat metus sit amet ante feugiat ac cursus sem congue. Pellentesque dolor nulla, facilisis id hendrerit eget, commodo eu urna. Donec ut interdum nibh. Sed nunc nisi, commodo non congue vitae, tempus ut ligula. Donec massa dui, viverra eget tempus ut, ornare eu ligula. Proin quis posuere diam. Phasellus at risus quam, id cursus odio. Sed fermentum, tortor eu iaculis sollicitudin, erat augue ornare nisi, eu mattis neque massa ac odio. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Sed varius, orci eget rhoncus egestas, mi nisl mattis sapien, non ultrices nulla elit porta nunc. Praesent mauris lectus, ultrices at interdum quis, euismod accumsan arcu. Pellentesque in dolor libero, vitae tincidunt dui. Nunc rhoncus ante in nulla sagittis ullamcorper. Curabitur velit odio, tempus sit amet lobortis eu, condimentum sit amet massa. Maecenas convallis facilisis arcu, quis accumsan velit tincidunt ut. Vivamus a ante orci, at mattis nibh. Nulla in diam est, vitae semper purus. Donec ut odio augue. Etiam tempor ultricies luctus. Quisque fringilla tincidunt rutrum. Phasellus et justo ut lorem imperdiet semper. Maecenas et justo lectus, ac dictum dui. Morbi sit amet venenatis neque. Donec interdum enim vel velit commodo pulvinar. Aenean nisl erat, bibendum id tincidunt sed, sagittis sagittis mi. Curabitur dui urna, venenatis id placerat nec, consectetur sagittis mi. Phasellus eleifend condimentum lorem et blandit. Pellentesque at lorem nisl, quis ullamcorper nisi. Suspendisse potenti. In id orci massa, in hendrerit ligula. Nunc elementum mi in nisl posuere ut tincidunt nibh placerat. Mauris venen
        • by Anonymous Coward on Wednesday January 09, 2013 @04:52AM (#42529537)

          I identified you. You are Cicero.

        • by ignavus (213578)

          tl;dr

      • .5?
    • by mrbester (200927)

      We can narrow it down to someone who is particular about correct capitalisation (and therefore probably spelling, punctuation and grammar) denoting an education and attention to detail not normally seen in forum posts. As this is a more technical forum you most likely program in a language where letter case is of paramount importance and have done so for at least 5 years in a professional position. You probably also write reports indicating a level of seniority.

      That should reduce the number of likely candid

    • Classic - kudos to you for a great laugh. I was thinking though, "this study doesn't help much because it's rare to find places where people write more than a line or two anymore."

      Go back to the old days of Usenet (80s, early 90s) and posts were long, well thought-out, and useful. Look at OLGA, for example, which collected written music in TAB format for guitarists (ha - remember when THAT was the biggest threat to the music industry?). Tons of useful stuff. Hardly anyone does that anymore; it's mostly

      • Classic - kudos to you for a great laugh. I was thinking though, "this study doesn't help much because it's rare to find places where people write more than a line or two anymore." Go back to the old days of Usenet (80s, early 90s) and posts were long, well thought-out, and useful.

        Not particularly... you just hang out in the 'wrong' places (not just websites/forums, but usenet groups as well), both for the length and for the nature of the writing.

        It's been tough to get people to pay attention to the forum

  • by Omnifarious (11933) * <`eric-slash' `at' `omnifarious.org'> on Wednesday January 09, 2013 @03:35AM (#42529167) Homepage Journal

    I worked for a smallish (but not incredibly tiny, maybe 100 employees) company and wrote a letter to the CEO once. We'd been castigated by someone who'd taken over the local office because the company was doing poorly. A number of austerity measures were implemented. I did not find those to be that annoying because I realized it was either that or not have a job. But the castigation didn't sit well with me. We were in trouble because of the decisions of a few bad managers, not the behavior of average employees.

    So I wrote a letter about it. He stripped my name off and presented it in an executive meeting to all the people directly under him. He asked "Why am I getting letters like this?". Everybody who worked in my office immediately knew who it was. I had a distinctive writing voice, and a strong reputation.

    It did not lead to me being fired. I was actually highly respected there. It led to me being encouraged to have an honest sit-down talk with the new manager for our division (the guy who'd made the speech I wasn't happy about). I think we both came away from that meeting a lot happier about the other.

    But that was a strong lesson to me. If I ever really want to be anonymous I'm going to have to purposely work on adopting a completely different writing style. And I will have to keep a wall up between styles and never 'slip'.

    • You can also have someone else write it for you.
      • If you're trying to avoid having two different identities associated when you're having an IRC conversation or something, that could get really tricky.

      • by GNious (953874)

        Write it in a different language, then run it through 5 different translation engines across a dusin languages, ending in which-ever is the native language of the recipient.... that should throw them for a loop.

      • I don't know about him, but I try to pick my words very carefully. I've reworded the sentences in this post over a few dozen times already. Handing it off to someone else would make me cringe.
    • by toygeek (473120)

      Alfred is that you! Its been years, old buddy! How are you?

    • But that was a strong lesson to me. If I ever really want to be anonymous I'm going to have to purposely work on adopting a completely different writing style. And I will have to keep a wall up between styles and never 'slip'.

      Have someone you trust, who is not in the company, rewrite your missive for you. That's probably the safest way.

    • I hand wrote a terrible review for one of my trainers at my last job. She matched me based on my signature on the sign in sheet.

      Kind of dumb of me to figure anything hand written was really anonymous, though.

    • by tnk1 (899206)

      Style in that case may have been important, but having a fuller appreciation of your personality than we would on Slashdot, your co-workers might also have seen the concerns that were raised as being unique to you or the fact that you wrote the letter at all might have immediately narrowed the possibilities down considerably as many people tend to either just bitch behind backs or they just go head down and tolerate it.

      • Yes, you're right. I was probably one of only 2-5 people who would've written such a letter who worked in that office. So yeah, that probably helped at least as much as style.

    • by Darinbob (1142669)

      I have actually gone back and changed things I wrote before submitting as Anonymous Coward, or on a second account on other forums, because it looked too much like how I write. I've even gone and changed things I submit normally because it felt too much like me. I do find myself making spelling or grammar mistakes that I know are wrong but which just come out when I don't slow down.

      So I think smart people could get around this sort of problem. However a lot of posters today just go ahead and post their f

  • In addition to these metrics, other can be added as well, e.g.: post date, size, tabulation, punctuation, capitalization, regional vocabulary, etc. Also, once you can add frequency-space analysis, naive bayesian filters, in order to increase precision, or to probe against other texts. Anyone interested about investing in text-rewriter technology in order to both detect similarities and automatic-rewrite?
  • by kawabago (551139) on Wednesday January 09, 2013 @03:40AM (#42529203)
    I'd be rather surprised if someone else couldn't.
  • by Nossie (753694) <IanHarvie.4Development@Net> on Wednesday January 09, 2013 @03:45AM (#42529233)

    "Leetspeak, an alternative alphabet popular in some forum circles, cannot be translated."

    *sigh* does this mean I must resent people that use this form of communication less?

    I'm not so sure I can stoop so low.

  • by joshamania (32599) <jggramlichNO@SPAMyahoo.com> on Wednesday January 09, 2013 @03:55AM (#42529281) Homepage

    This is so bad I don't know where to begin. There is nothing, ever, that excuses this. For every zodiac crazy serial killer or copyright scofflaw they try to apply this to (and fail) there will be thousands and thousands of people that will be persecuted by organizations and governments for expressing their opinions. While this won't have a big effect in the West for half a generation, oppressive governments are going to be all over this.

    And then, in ten or fifteen years, the youth will have grown with this technology and become accustomed to it...accepting it. Just like facebook has been accepted.

    I'd move to Mars when it's possible but some bureaucrat will analyze everything I've ever written on the interwebz (and I've been mostly not stupid about shit I've written online since 1995 or so) and make some arbitrary decision about how I'm not acceptable because I'm not a huge fan of authority or some such crap.

    Way to go humanity.

    • by famebait (450028)

      Not to mention: Mars will be worse.

    • by aaaaaaargh! (1150173) on Wednesday January 09, 2013 @06:07AM (#42529937)

      Are you serious?

      You write as if some new method had been invented. There is no news in the above article. Authorship identification has been a reliable tool for many decades, a whole branch of linguistics (forensic linguistics) deals with it and similar topics like dialect recognition. Under certain circumstances you can even identify personality treats of the author, check out content analysis software like LIWC [liwc.net] for example.

      And, yes, plenty of serial killers and blackmailers have been captured with the help of these methods.

      • by joshamania (32599)

        See, someone has already drank the kool-aid. :-) Identify personality traits...sigh. You speak in the language of big brother. So once this method/technique/software gets outside of whatever biolab it is currently sequestered in how long, you think, before it's used for police phishing expeditions.

        "Hey Bob, I'm bored...ever since they legalized pot I've had nothing to arrest people for for no reason. What's this I hear about linguistics and personality traits?"

        There are MILLIONS of people in prison in t

    • by rmstar (114746)

      This is so bad I don't know where to begin.

      Well, I for one look forward to the mess these methods will cause in academia, where it is likely that they can be used to identify the authors of referee reports.

      • by doom (14564)

        Well, I for one look forward to the mess these methods will cause in academia, where it is likely that they can be used to identify the authors of referee reports.

        It's not needed. There's already a limited pool of "peers" to use for "anonymous" peer review, and by definition they all know each other, and are familiar with each others patterns of thought. "Oh look, Fred at MIT is hassling us about using linear regression again."

    • Haven't you heard? We can take "thing X" that has confirmed kills of 260 million people, but if we say, "think of the children" then people take to the streets demanding "thing X".

      • by joshamania (32599)

        Negative...we take"thing X" that has confirmed kills of 2 people and confirmed annoyance of 1 busybody and it's "BWAAAAAAAA THINK OF THE CHILDREN!" and then we arrest everyone for being a pedophile.

  • google translate (Score:5, Interesting)

    by sl149q (1537343) on Wednesday January 09, 2013 @04:03AM (#42529321)

    One way to change a bunch of the stylistic queues would be to convert your message to another language and back using Google Translate. Depending on the intermediate language(s) and possibly using different translators should neutralize some things.

    • by sdnoob (917382) on Wednesday January 09, 2013 @04:08AM (#42529341)

      using chinese as an intermediary will give you text written by motherboard manual writers. perfect cover.

    • by iktos (166530) *

      I just tried that with a couple of paragraphs: Google Translate returns the exact text including mis-spellings even though it had correctly identified what the mis-spelled words actually should be.
      This suggests that there are language independent methods of "identifying" writers.

    • If you RTFA you'll see that the researchers themselves used Google Translate to convert most of the "bad stuff" into English, because the source text was in Russian. That, right there, makes me question the validity of this research. Also given that they were reading underground forums it's not clear to me how they verified they'd cross-correlated posts correctly. Something to read up on later, I think.
    • It can also alter the meaning of your text. Translation is an inexact art, at best, even for skilled and experienced practitioners - which automatic translators emphatically are *not*.

      This goes times ten if your text includes technical terms, or wording which relies on alternate meanings or connotation. (Things a native reader would either know, or would be reasonably expected to infer from context.) This is why writing in English from non-English speakers (for example) often looks so funny when you enco

  • and could even be used to unveil authors of thesis papers or blogs who had taken to underground networks.

    ... a good reason to do it like zu Guttenberg then... Nobody will tie any of his underground writings to his thesis...

  • by nightgeometry (661444) on Wednesday January 09, 2013 @04:17AM (#42529369) Journal
    Isn't this just the same software that college use to detect plagiarism and whether someone else wrote that essay for you? I thought it was in common use in academia.
    • Re:College essays (Score:4, Insightful)

      by ForgedArtificer (1777038) on Wednesday January 09, 2013 @04:35AM (#42529457) Homepage

      Actually, it's the exact opposite.

      Anti-plagiarism software searches for the same content with completely different styles.

      Writer identification involves searching for the same style amongst completely different content.

      • Fair point on plagiarism. The college I do some work for _claims_ to be able to spot when someone else has written your essay for you though. And in fact I thought this did tie into plagiarism - in that the software also aims to identify when writing style changes. Though that was told me by one of their prof's, who whilst not a complete idiot probably was only parroting what he was told.
    • by AmiMoJo (196126) *

      On 4chan plagiarism is encouraged. It's called a "meme". In fact copy-pasta is a meme in itself.

  • Pad all communications with cut/paste from various, unrelated news articles and such, for and aft, randomly alternating how much is padded on each side.

    Or, you can do what I do and use a different font for each letter.

  • Why all the civil-liberties hand-wringing? Just how hard is it to read some of the papers on stylometric analysis to see what markers are used, then write a script that randomises them but preserves the sense of the text. Make it a Firefox plugin so it's done automatically. It's a better solution than using Google translate to go English to $language, $language to English.

    For extra fun, change your text so its stylometric markers match up with E. L. James, or the leader writer of the Washington Post.
  • The climate change community has a lot of trouble with extremely articulate, anonymous climate deniers, who appear to show up in force and sabotage discussions of climate change on blogs, etc.

    I should imagine that such an algorithm might enable researchers to build profiles over denialist astroturf, and correlate them with known people working for known rightwing think tanks. Employed properly, this might have a massive impact on the rightwing black PR industry.

    • by superwiz (655733)
      Well, at least, that's what they'll claim. And they'll, of course, attempt to use their clout as "the science guys" to claim that their deductions are accurate. They might even get some legitimate linguists researchers on their side. Grant money buys a lot of consensus.
      • Science doesn't work that way.

        You earn a name for yourself by successfully challenging the status quo. But in order to do that, you need evidence that'll take the scrutiny. So far, there is overwhelming consensus -- bad news for the deniers, because if there is ANY credible evidence refuting AGW, you'd have a million guys all over it.

        Something just tells me that -- at least for the wingnut'o'sphere -- there nothing "common" about common sense at all.

    • by joshamania (32599)

      But this is the exact kind of evil political use that this stuff is going to be used for. It doesn't make it right because you're using it on Republicans. If anything it makes it MORE wrong because of your acceptability standard...because when they turn around and use it on you they'll have had your prior support.

  • This same story keeps cropping up in various forms, but we've been doing this at least since the 80s or 90s. I don't know why it keeps being rehashed or why people continually seem surprised by it at this point.

    • I suppose the idea that you can bring up authors for a text "out from nowhere" is always an curious concept.
  • "Up to 80 percent of users who wrote at least 5000 words across their posts could be identified using linguistic techniques. Techniques such as stylometric analysis were used to track users who posted across different forums, and could even be used to unveil authors of thesis papers or blogs who had taken to underground networks."

    Not really new. I heard about the techniques long time ago - in mid 90s - in a context of a MS-DOS tool which was unintentionally designed to foil the identification methods.

    It was designed for Russian and Belarussian languages (but for English I gather the task should be even easier) and was a byproduct of Prolog-based system for natural language processing and translation. This particular program was allowing to improve or change writing style, e.g. simplify dry legalese or formalize spoken-like text. I

  • by toutankh (1544253) on Wednesday January 09, 2013 @07:05AM (#42530209)

    After reading TFA I cannot find any convincing experimental validation. I see a lot of "can" and conditional tense (maybe that's the author's style), but nothing on the validation of the approach. Where is the experimental data, including the number of anonymous users correctly and incorrectly identified on forums?

  • They didn't identify 80% of the users, they managed to make a guess in 80% of the cases, which they didn't even bother to try to verify. There's no proof that their technique actually works.

  • This strikes me as akin to a Lie Detector. I think an honest court would side with the accused 100% of the time as even this cannot absolutely proove they were the author.

    Though sadly, a Roberts/Scalia/Thomas Supreme Court would rule against such an individual and for the corporation or state security organs. Dicks.
  • Aren't those cunning linguists clever? The answer always seems to be right on the tip of their tongue. They don't diddle around. They seem to be able to lick any problem.

  • welcome our stylistic overlords
  • There are tiny timing differences as one types. these are quite distinctive between individuals if you collect enough data. Its related to how an individual learns type; Motor memory of word-phrases versus typing a new word for the first time. Even the pattern of common typing errors and recovery.
  • by LuSiDe (755770)

    I'm curious how this would apply to the Zodiac case. Oh wait, it doesn't:

    * He used symbols in communication.
    * Voice recognition didn't solve the case.
    * DNA evidence didn't solve the case.
    * Copycats functioned as noise, might've even given him credit.

We are Microsoft. Unix is irrelevant. Openness is futile. Prepare to be assimilated.

Working...