Researchers Use Machine-Learning Techniques To De-Anonymize Coders (wired.com) 66

Posted by BeauHD on Sunday August 12, 2018 @12:16PM from the digital-fingerprints dept.

At the DefCon hacking conference on Friday, Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, presented a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples. "Their work could be useful in a plagiarism dispute, for instance, but it could also have privacy implications, especially for the thousands of developers who contribute open source code to the world," reports Wired. From the report: First, the algorithm they designed identifies all the features found in a selection of code samples. That's a lot of different characteristics. Think of every aspect that exists in natural language: There's the words you choose, which way you put them together, sentence length, and so on. Greenstadt and Caliskan then narrowed the features to only include the ones that actually distinguish developers from each other, trimming the list from hundreds of thousands to around 50 or so. The researchers don't rely on low-level features, like how code was formatted. Instead, they create "abstract syntax trees," which reflect code's underlying structure, rather than its arbitrary components. Their technique is akin to prioritizing someone's sentence structure, instead of whether they indent each line in a paragraph.

The method also requires examples of someone's work to teach an algorithm to know when it spots another one of their code samples. If a random GitHub account pops up and publishes a code fragment, Greenstadt and Caliskan wouldn't necessarily be able to identify the person behind it, because they only have one sample to work with. (They could possibly tell that it was a developer they hadn't seen before.) Greenstadt and Caliskan, however, don't need your life's work to attribute code to you. It only takes a few short samples.

Researchers Use Machine-Learning Techniques To De-Anonymize Coders

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 66 Comments Log In/Create an Account

Comments Filter:

Satoshi Nakamoto... (Score:5, Interesting)

by chrisvdb ( 149510 ) writes: on Sunday August 12, 2018 @12:34PM (#57111800)

... could be an interesting use case.

- Re: (Score:2)
  
  by hcs_$reboot ( 1536101 ) writes:
  
  What do you do when there is actually a team of devs. SN could be a group of university students, which code is a mix of styles.
  - Re: Satoshi Nakamoto... (Score:2)
    
    by WarJolt ( 990309 ) writes:
    
    You should be able to classify small code segments.
    I seriously doubt it will work well with pure functional programs. Functional programmers tend to converge on similar programs.
- Re: (Score:1)
  
  by Satoshi Nakamoto ( 5487968 ) writes:
  
  Oh shit.
- Re: (Score:2)
  
  by ShanghaiBill ( 739463 ) writes:
  
  Identifying criminal malware authors is the obvious application
  How often do you have the source code to malware?
  - Re: (Score:2)
    
    by Kiwikwi ( 2734467 ) writes:
    
    From TFA:
    it’s possible to de-anonymize a programmer using only their compiled binary code.
- Re:Malware authors (Score:4, Funny)
  
  by Aighearach ( 97333 ) writes: on Sunday August 12, 2018 @04:56PM (#57113006)
  
  is the obvious application
  I just want to know what my old Perl code does. Maybe this can help!
  
- Re:All I've got to say (Score:5, Funny)
  
  by hcs_$reboot ( 1536101 ) writes: on Sunday August 12, 2018 @01:49PM (#57112166)
  
  No need of a complex AI engine. 1) Using line numbers smells an old dev used to early Basic stuff, 2) Using 10 and 20 confirms 1), 3) "Hello" not even "Hello, world" confirms that the dev has no other experience and is probably a lousy programmer, 4) print / goto shows a total lack of imagination, and 5) posted anonymously, so we're looking at an old degenerated pretending-programmer not really proud of his code, posting anonymously in the hope of getting a desperate funny mod while being actually almost certain to leave an unappreciated lousy post. That was easy.
  
- Re: (Score:2)
  
  by JustAnotherOldGuy ( 4145623 ) writes:
  
  10 print "Hello" 20 goto 10
  Based on an in-depth analysis of this code and its many unique "signature" elements, I can state with 100% certainty that it was written by a programmer named "Anonymous Coward".
Arms race (Score:1)

by Anonymous Coward writes:

We need new tool to parse code, create syntax tree, transform in ways to do same tasks but masks the ident of the authors, and re-emits, anonymized.
Code de-anon tools could be used by regimes such as Chinese to find who wrote anti-censorship tool. Very dangerous to prevent anonymous writing, anonymous code, anonymous anything.
Not to blame researcher: it will be done if it can be done. But now... to protect.
- Re:Arms race (Score:4, Interesting)
  
  by JustAnotherOldGuy ( 4145623 ) writes: on Sunday August 12, 2018 @02:23PM (#57112332) Journal
  
  We need new tool to parse code, create syntax tree, transform in ways to do same tasks but masks the ident of the authors, and re-emits, anonymized.
  Pffft, just copy someone else's code, problem solved. If anything happens it'll get blamed on them.
  
Obvious, Not-so-Obvious and Not Obvious-Oblivious (Score:2)

by ElitistWhiner ( 79961 ) writes:

...there's code that just makes you wonder " how many authors, iterations and algorithms later?".
The latter is the future that'll take AI to sort out evolution
I'm pretty sure if you applied this to my code (Score:2)

by rsilvergun ( 571051 ) writes:

the police would show up wanting to know where the bodies were buried.
- - - Re: (Score:2)
      
      by Aighearach ( 97333 ) writes:
      
      Give or take ++
No worries for me (Score:2, Interesting)

by Anonymous Coward writes:

I occationally contribute to open-source projects, but I do so under my full name anyway. Seeing that they are able to identify authors of compiled code too, it might be interesting to see if they can identify the authors of viruses & malware that has been making the rounds the last decade. Who to sue . . .
Another use case might be the javascript found on web pages. A noscript-like utility could ditch all javascript written by the wrong people - i.e. ad-related or spyware-related stuff. Loose it withou
Have these researchers actually written code? (Score:5, Insightful)

by Solandri ( 704621 ) writes: on Sunday August 12, 2018 @01:06PM (#57111948)

About half the time I code something, I end up grabbing a chunk of code that someone else has written which almost does what I want but not quite, copy/pasting it, and making a few tweaks to it so it'll do what I want.

That's kinda the whole reason software is different from crafting or manufacturing - zero cost of duplication. So there's no point doing duplicate work if someone else has already done it. In fact that's the fundamental rationale underlying open source.

- Re:Have these researchers actually written code? (Score:5, Informative)
  
  by jetkust ( 596906 ) writes: on Sunday August 12, 2018 @01:43PM (#57112138)
  
  None of this contradicts what they are doing. They addressed copy and paste in the article.
  
  From Article:
  Greenstadt and Caliskan have also uncovered a number of interesting insights about the nature of programming. For example, they have found that experienced developers appear easier to identify than novice ones. The more skilled you are, the more unique your work apparently becomes. That might be in part because beginner programmers often copy and paste code solutions from websites like Stack Overflow.
  
  - Re: (Score:2)
    
    by NicknameUnavailable ( 4134147 ) writes:
    
    I've been writing software for over 2 decades and I still routinely copy+paste key components straight off the first StackOverflow result and hit run without testing it. It's just faster and it works 99% of the time, the other 1% takes a few more minutes of tweaking it but it ends up looking largely the same. It's definitely not inexperience. In fact, if I'm picking up some bleeding-edge thing I'll tend to do that less because there aren't preexisting code samples.
    - Re: (Score:2)
      
      by jetkust ( 596906 ) writes:
      
      Yes. Everyone does it. Just depends on what you're working on. I rarely can copy and paste code directly like that, but if I could I would. The more copying and pasting you can actually do the less complicated that project likely is to program. I think it's not as much the skill level of the developer but the difficulty of what is being programmed. So I still see a lot of truth in the findings, as you can argue lower skilled (or less experienced) developers may be working on projects with less difficulty.
      - Re: (Score:1)
        
        by NicknameUnavailable ( 4134147 ) writes:
        
        Getting the right result the first try is more about using the right keywords than the difficulty involved.
- Have these commenters actually read the article? (Score:4, Informative)
  
  by Kiwikwi ( 2734467 ) writes: on Sunday August 12, 2018 @01:59PM (#57112218)
  
  Yeah, you shouldn't need to worry then. From TFA:
  Experienced developers appear easier to identify than novice ones. The more skilled you are, the more unique your work apparently becomes. That might be in part because beginner programmers often copy and paste code solutions from websites like Stack Overflow.
  
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  so there is also zero cost in replacing you, right?
  copy-paste programmers are not the best programmers. if i need to google and search how something is done, i have not yet fully understood the problem yet. once i do, coding is the easy part and takes less time often than searching and re-editing 'to make it work'.
- Re: (Score:2)
  
  by AHuxley ( 892839 ) writes:
  
  Its everything around the copy/paste parts that will stand out.
  Comments, style, format, something a university always suggested. . . .
  Even date, font, slang, US vs UK spelling. Useful comments, comments that are always off topic? Unique use of terms to invoke faith, spellings?
  Someone who worked on NSA, GCHQ, mil/contractor code and always has to keep their comments to a set bureaucratic style? Something that feels like billable hours?
When I steal/borrow code (Score:2)

by bobstreo ( 1320787 ) writes:

I'd always add a comment regarding where it came from.
If I wrote it "The Usual Suspects" was listed when/if I had time to add comments.
So now you know, NSA/CIA/RIAA,,, /s
Bad news... (Score:2)

by hcs_$reboot ( 1536101 ) writes:

...for some former MS devs... IE6 and XP coders to be soon uncovered!
Stack Exchange? (Score:2)

by TJHook3r ( 4699685 ) writes:

All code will link back to some page on Stack Exchange - good luck with your profiles!
Recognising style (Score:3)

by Martin S. ( 98249 ) writes: on Sunday August 12, 2018 @04:15PM (#57112832) Journal

Once I've worked with a team for a while, I can generally recognise who coded something it from their style.
There are plenty of stylistic elements that distinguish the actual coder, even in shops with tight coding standards. Some favour for loops, some unrole their code, some cram lots of logic on one line, while others aggressively decompose. Some will write very abstract code, others tightly focused on the specific case. Some will use lots of getter setters, others will favour tell don't ask, some will use favour 'do { ... } while()', others will use while loops. Some very short snappy functions, some longer functions, some use programming domain naming, others favour business domain naming. Some favour arrays, others favour collections.
I've often be approached by collegue with comments, such 'this looks like your code' and they are usually right, so this is not some special skill I possess. It is absolutely realistic that an algorithm or AI could identify these elements with static analysis and metrics and a sufficient sample.

- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  People can recognize my perl code, however nobody has been able to understand it. Including myself.
- Re: (Score:1)
  
  by rgmoore ( 133276 ) writes:
  
  The thing I always wonder about this kind of thing is how well it scales. Your coworkers can tell a code snippet is from you because there are only a relative handful of people contributing to your project. But if you trained a program on just your group and then asked it to find your work on GitHub, it would probably find a whole lot of false positives-- other programmers whose style is similar enough to yours that it's fooled.
  This actually shows up in the article. The researchers claim to be 96% accu
I don't AI to do this. (Score:1)

by devslash0 ( 4203435 ) writes:

I can instantly tell which developer within the company wrote the code I'm reviewing just by looking at it.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Researchers Use Machine-Learning Techniques To De-Anonymize Coders (wired.com) 66

Researchers Use Machine-Learning Techniques To De-Anonymize Coders More Login

Researchers Use Machine-Learning Techniques To De-Anonymize Coders

Satoshi Nakamoto... (Score:5, Interesting)

Re: (Score:2)

Re: Satoshi Nakamoto... (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re:Malware authors (Score:4, Funny)

Re:All I've got to say (Score:5, Funny)

Re: (Score:2)

Arms race (Score:1)

Re:Arms race (Score:4, Interesting)

Obvious, Not-so-Obvious and Not Obvious-Oblivious (Score:2)

I'm pretty sure if you applied this to my code (Score:2)

Re: (Score:2)

No worries for me (Score:2, Interesting)

Have these researchers actually written code? (Score:5, Insightful)

Re:Have these researchers actually written code? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Have these commenters actually read the article? (Score:4, Informative)

Re: (Score:1)

Re: (Score:2)

When I steal/borrow code (Score:2)

Bad news... (Score:2)

Stack Exchange? (Score:2)

Recognising style (Score:3)

Re: (Score:1)

Re: (Score:1)

I don't AI to do this. (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot