Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
Check out the new SourceForge HTML5 internet speed test! No Flash necessary and runs on all devices. ×
Programming Government Open Source Privacy

Coding Styles Survive Binary Compilation, Could Lead Investigators Back To Programmers (princeton.edu) 164

An anonymous reader writes: Researchers have created an algorithm that can accurately detect code written by different programmers (PDF), even if the code has been compiled into an executable binary. Because of open source coding repositories like GitHub, state agencies can build a database of all developers and their coding styles, and then easily compare the coding style used in "anti-establishment" software to detect the culprit. Despite all the privacy implications this research may have, the algorithm can also be used by security researchers to track down malware authors. We also discussed an earlier phase of this research.
This discussion has been archived. No new comments can be posted.

Coding Styles Survive Binary Compilation, Could Lead Investigators Back To Programmers

Comments Filter:
  • Frist! (Score:2, Insightful)

    by Anonymous Coward

    Going to be lots of false positives on this one.

    • by zr ( 19885 )

      if true, even if not definitive would provide useful leads.

      • Brilliant! These are the same people that cannot find a bad person unless they can stealth-fully break into Cell Tower transmissions, Social Networking Sites, and slamming one with lethal does of X-Rays.
        • by zr ( 19885 )

          quite a jump there, from analyzing code to irradiating people..

        • by gweihir ( 88907 )

          Actually, as Paris showed (and then showed again), they cannot do so having all those capabilities and knowing who these people were beforehand. I think they simply cannot do it in the first place, no matter what outrageous capabilities these cretins will be given next.

      • by gweihir ( 88907 )

        Actually, it would not. Large false positive probabilities drive a detection method quickly to negative worth, because they then waste resources that could have spent better. Well known to experts, but something non-experts routinely do not comprehend.

      • by vlad30 ( 44644 )
        Like coders don't copy and paste code from various sources
    • by cshark ( 673578 )

      Going to be lots of false positives on this one.

      My thoughts exactly.
      Morons.

      • Re:Frist! (Score:5, Interesting)

        by ShanghaiBill ( 739463 ) on Wednesday December 30, 2015 @02:29PM (#51210465)

        False positives are not a problem if you deal with them rationally. If a woman is murdered, and the DNA matches one in a million, then in a country of 300 million, there will be 300 matches, and 299 false positives. But if only one lives in the same city, and it happens to be her ex-boyfriend, then the DNA match is useful information.

        • False positives are not a problem if you deal with them rationally. If a woman is murdered, and the DNA matches one in a million, then in a country of 300 million, there will be 300 matches, and 299 false positives. But if only one lives in the same city, and it happens to be her ex-boyfriend, then the DNA match is useful information.

          Except in this case that does not work since locality doesn't matter for the Internet or software. Also many good groups establish various coding standards so many authors will now become one; some individuality may survive based on logic structure, but that would get mitigated quickly by group reviews and code updates in response.

        • by gweihir ( 88907 )

          False positives are not a problem if you deal with them rationally. If a woman is murdered, and the DNA matches one in a million, then in a country of 300 million, there will be 300 matches, and 299 false positives. But if only one lives in the same city, and it happens to be her ex-boyfriend, then the DNA match is useful information.

          Actually, it is not. You wasted the resources to do 300 million DNA tests, when simply looking for the ex-boyfriend would have helped you to narrow it down. With those 300 million DNA tests, you would have spent, say, the effort for 10'000 of them for locating and questioning the ex-boyfriend and administering a DNA test just to him. Hence you come out the effort for about 299.99 million DNA tests short and you still have to investigate the ex-boyfriend. That wasted effort is going to have massive negative

      • by tattood ( 855883 )

        Going to be lots of false positives on this one.

        So I can easily avoid this trap by never hosting any code on Github?

        • ... or by sharing your GitHub log-in details with a half-dozen other people so that each one of you dilutes and poisons the "fingerprint" of the other 5.
    • lol yeah. The blathering about governments is just somebody getting silly and running their mouth about stupid shit. Newsflash, if a person sounds like a conspiracy theorist? They're probably not a good data source.

      This is great technology for figuring out which one of 5 people wrote a particular method/function. And I have no doubt that governments will use this technology to mislead juries into believing it is like a fingerprint, by using the word "fingerprint" nearby the name of their test in sentences,

    • Going to be lots of false positives on this one.

      Doesn't matter.

      It's a bit like blood-group matching. It can't prove you're the guilty person, but it negative match can certainly prove you aren't.

  • by Anonymous Coward

    This must have a lot of false positives. I'd be surprised if this works at all, but sure it would sell some product and get a few grants.

    • by Anonymous Coward

      Yes. We'd gone through this a billion times already. Every couple years someone decides they can tell who a programmer is through the binaries except they forget about how much code is little more than snippets of other code or the like.

      • Re:I doubt this (Score:5, Interesting)

        by Lab Rat Jason ( 2495638 ) on Wednesday December 30, 2015 @10:37AM (#51208863)

        This is why I steal most of my functional code from GitHub in the first place...

        --OR--

        Easy to avoid detection by simply NOT UPLOADING code to GitHub in the first place. The assumption that every dev does this is stupid.

        • I'd think the only time this would work is if a programmer was a contributor to open source projects, then went bad and started writing software designed to commit crimes. Anyone starting as a criminal developer would never have uploaded their code to Github.
          • I'd think the only time this would work is if a programmer was a contributor to open source projects, then went bad and started writing software designed to commit crimes.

            Say someone contributes to open-source projects and then contributes to the Android project. A U.S. appeals court found Android to infringe Oracle's copyright, pending a forthcoming phase of the trial to determine whether API interoperability is a valid rationale for fair use. Does Android count as "software designed to commit crimes" because copyright infringement is a crime?

        • by Zardus ( 464755 )

          I'm speculating, but this could probably be applied without the need of a large github corpus. If you have some set of malware that you know was written by a specific person/group, you could check other pieces of malware to see if the same people wrote them. That'd probably be useful to *somebody*.

        • If you arent uploading to GitHub, you are an Alchemist, not a Scientist.
          • If you arent uploading to GitHub, you are an Alchemist, not a Scientist.

            If by "alchemist" you mean "someone practicing obsolete practices worthy of derision", this sounds like you're trying to say GitHub ought to have a monopoly on hosting free software projects, as opposed to SourceForge which shares a parent company with Slashdot. Do you work for GitHub?

          • Alternatively, I'm a professional.

            • Still an Alchemist, just one with a Patron..
              • Most alchemists had patrons. That's how they funded their alchemy.

                The deal usually involved the alchemist agreeing to make gold if the patron provided the workshop, living expenses and money for essential reagents. The alchemist would spend a bit of time doing their alchemy thing, then disappear mysteriously when the patron started to get impatient about the lack of gold production.

              • If by that you mean I turn bytes into dollars, I suppose. I don't get why you think developing proprietary software is that different from OSS, except for the guarantees to the customers, deadlines, QA process, support to do it right.

                I don't use GitHub because I don't contribute to OSS.

          • If you arent uploading to GitHub, you are an Alchemist, not a Scientist.

            I'm neither an Alchemist nor a Scientist in writing my code. I'm more of an Engineer. An Alchemist does something repetitively and happens upon the same results more by chance than anything else.
            A Scientist does better by making it more predictable, but there is little real structure or design to the work, leading to a lot of errors and lots of rework.
            An Engineer designs, architects, and makes reproducible work with low errors with little rework.

      • Re: (Score:3, Interesting)

        by Anonymous Coward

        As a systems admin I have been called upon at times to automate a few things. Doing so in my situation seemed easiest with vb.net. (Yes I know the actual programmers here are recoiling with horror. Stuff it, the program works, and saves me vast amounts of time)

        In that program somewhere around 5% are lines that I actually coded. Everything else is snippets of code from Microsoft's help files, question/answer sites, and similar opensource programs found online. Unless they are checking for things like t

        • by unrtst ( 777550 )

          In that program somewhere around 5% are lines that I actually coded. Everything else is snippets of code from Microsoft's help files, question/answer sites, and similar opensource programs found online. Unless they are checking for things like the fact that I included no error handling (since I am the only one that uses said program) I fail to see how this would work at all.

          I strongly suspect this is precisely the kind of code that they will most easily be able to associate to individuals.
          Through their deep analysis of public code, I would strongly suspect that they have cached those segments, like any good search engine or data analysis would do. As such, they can diff and cut out any code that has been duplicated from elsewhere (just as they could with raw source code). Anything modified by you would remain.
          Because your coding style is, admittedly, quite different from that

          • Re: (Score:2, Interesting)

            by Gr8Apes ( 679165 )

            As such, they can diff and cut out any code that has been duplicated from elsewhere (just as they could with raw source code). Anything modified by you would remain. Because your coding style is, admittedly, quite different from that in the snippets, it will stand out as if it were glowing.

            The funniest thing about this is how wrong that statement is. I can take myself as an example, I've worked in multiple shops, several with different code formatting practices, not to mention potentially different languages. I generally configure my IDE to whatever code formatting requirements there are, so everything I add gets put into the current format. Naming practices are whatever is in the current codebase. So, essentially, from a source and binary perspective, my code will look like whatever the curr

            • Right, except, you don't make a case that supports your conclusion. You make a good case that the use the idiot in TFS describes is not a good use. But that doesn't harm the capabilities of the technology at all. Your conclusion that it is snake oil mistakes the location and nature of the mistake.

              From what you said, it sounds like each of your past employers could use it to tell if you were the one who wrote a particular function or method, based on the specific ways that you implemented their stated coding

            • I don't even write my own code.

              Simulink writes it for me. Are they also able to back out Simulink model styles as well?

      • by cshark ( 673578 )

        Yes. We'd gone through this a billion times already. Every couple years someone decides they can tell who a programmer is through the binaries except they forget about how much code is little more than snippets of other code or the like.

        Even so. All of this assumed that the blackhat coder is sharing his code or contributing to the open source community to begin with.
        Github and repositories like it are a not a panacea of coding styles from every coder on earth. The total number of people that contribute is actually very small when you consider the size and scope of the overall community. Furthermore, I've intentionally changed my coding style a dozen times.

        I would challenge these Princeton researchers to make heads or tails of me, honestly

    • I'd be surprised if this works at all

      It is just like a polygraph machine. It works, it works well, it works under known conditions, and it produces known results. Of course, then things it does are not the things described by the words people use near it, nor are more of the actions it is used to support actually supported by the function of the machine. And yet, the machine is not malfunctioning.

      This is not an investigative tool for law enforcement. It is a useful tool for certain business researchers, and it may prove useful to historians of

    • This must have a lot of false positives.

      True, especially for projects where the maintainers care about style and ensure code in pull requests conforms to project guidelines. Note: this is not about formatting, where to put braces, etc. which is information lost during compilation. I am talking about naming (which may be preserved in debugging symbols), code structures, etc. which may be partially or fully preserved.

      I'd be surprised if this works at all, but sure it would sell some product and get a few gran

  • by Registered Coward v2 ( 447531 ) on Wednesday December 30, 2015 @10:02AM (#51208645)
    People have been analyzing writing styles for a long time to try to identify authors. Expecting your coding style to be obfuscated by compiling it has proven to be as wrong as thinking your identity is shielded if you publish under a pseudonym. If you make your code publicly available you really shouldn't have any expectation of privacy.
    • by Anonymous Coward on Wednesday December 30, 2015 @10:36AM (#51208857)

      I doubt it. Your code once compiled will be very similar to most other similarly skilled programmers in that language, unless you go out of your way to be obfuscate things - i.e. a poor coder. Compilers, libraries, APIs, language versions and proprietary extensions are beyond your coding style. This entire premise assumes there will be no false-positives, which will be the vast majority of hits. So basically, they're casting nets, and claiming success when they get one, ignoring the other thousand. Once you're at the binary, coding style has all but gone (assuming you're not doing assembler, which even then, will come down to the same few solutions to a given functional requirement).

      • Do you write:

        void foo()
        {
        String result;
        if (...)
        result = a;
        else (...)
        result = b;
        else (...)
        result = c;
        return result;
        }

        or

        void foo()
        {
        String result;
        if (...)
        return a;
        else (...)
        return b;
        else (...)
        return c;
        }

        Both are common, but compile to different code. Do you code to 'a procedure should live on a page'? How about 'a procedure should have a purpose'? Return errors or throw exceptions? Return values or

    • by Anonymous Coward

      Expecting your coding style to be obfuscated by compiling it has proven to be as wrong as thinking your identity is shielded if you publish under a pseudonym.

      I was going to take the time to format my reply just like an APK post to illustrate a point, but I'm too lazy this morning. Point being, if you simply copy 90% of what you publish, then any fingerprinting is going to most likely end up pointing back to the original author.
      While most people don't do that as a matter of habit when posting comments or writing, the wholescale re-use of code is extremely common especially in open source projects. In fact, that's kind of the point. So once you compile to binary,

    • Everybody knows that whitespace is translated to nop's you insensitive clod!

    • The problem would be proving the code in question was written by someone, due to "coding styles". That sounds legally sketchy as all hell.
      • The problem would be proving the code in question was written by someone, due to "coding styles". That sounds legally sketchy as all hell.

        I agree; but there is a difference between what would hold up in court and using the results to identify who may have written the code and using that to narrow the scope of an investigation or even to prove original authorship. Based on the article, it seems to work on very specific snippets of code written to perform a specific task; weather it would be useful to analyze large swaths of code is another question altogether; especially since such code is likely to have a number roof contributors as well as s

  • Heck (Score:4, Funny)

    by vikingpower ( 768921 ) on Wednesday December 30, 2015 @10:02AM (#51208649) Homepage Journal

    gotta change my indentation style and public void( String s1 ) whitespace habit, now the guvnmunt automagically can also get these out of binaries built from my code. O gawd, I'm afraid now.

    • by prefec2 ( 875483 )

      This is only a part of your coding style. In most cases the part where the braces go and the indentation are part of the company or language code style. And it should be identical for everyone in your organisation. They can be enforced by checkstyle (in Java). The style also includes the use of any kind of design pattern. For example, do you implement factories in the same class or in a separate class. Do you use sub interfaces in Java.

      • Re:Heck (Score:4, Interesting)

        by JaredOfEuropa ( 526365 ) on Wednesday December 30, 2015 @10:21AM (#51208767) Journal
        Ideally there is such an enforced coding standard, but I have worked in situations with merged teams or projects where coding styles were rather mixed. From what I could see, cosmetic stuff like braces and indentations caused some annoyance but it didn't really lead to much lost coding time, increased effort in fixing or changing things, or an increase in bugs.

        Anyway, brace placement won't survive compilation so this method is useless for rooting out the K&R traitors.
      • For example, do you implement factories in the same class or in a separate class. Do you use sub interfaces in Java

        That depends on whether I've just learned about them or not. If so, yes. If it's been a while since I learned, I've weaned myself off them and I'm using the next fad^H^H^H uh cool feature.

        Just like everyone else, amiright? ;-)

        So that probably wouldn't help in terms of fingerprinting developers...

    • by Volanin ( 935080 )

      I feel you. This is my whitespace style as well; it gets so much trashing from my colleagues... And now they come to take away its freedom; that in binary, every whitespace is ignored equally. It's a sad day.

    • by Anonymous Coward

      Those are not the things that matter. One example of things that matter: loop structure - initialization, stopping condition; e.g. start from 0 and count up or start from X and count down. have a separate iteration variable.

      plenty of other higher order things like function overloading, object inheritance, recursive calls, etc. can be analyzed and probability matrix can be created of who is the most likely author from a given list.

      • I'm sure we'll figure out what "style" even makes it into a binary. Whom do they think they're messing with here?

    • by hattig ( 47930 )

      Yeah, they're doing Design Pattern Analysis (or similar) alongside analysing binary metadata, how the software behaves acts as a fingerprint to the coder who wrote it. Strip everything from those naughty binary executables people!

      Ultimately, I somehow doubt that in a small application the binary can actually distinguish the author that precisely. There are only so many coding behaviour styles and patterns.

  • Fuck that. (Score:3, Insightful)

    by Anonymous Coward on Wednesday December 30, 2015 @10:03AM (#51208655)

    Aren't we being tracked enough as it is?
    Why for fucks sake why?

    My new years resolution will to remove all my code from all public repositories.

    • Re: (Score:2, Interesting)

      by SQLGuru ( 980662 )

      It's versioned.......and cloned.......and forked. Good luck with that.

      I think it's funny (ironic, not ha ha) that many of the people espousing Open Source as being perfect are generally the same ones that have the biggest desire for digital privacy. And because of their push for OSS, they will be some of the first to lose their privacy.

      ** I think OSS has it's place, as does closed source. I also have a desire for some privacy but recognize that I have to give up some of that privacy in order to have some

  • by El_Muerte_TDS ( 592157 ) on Wednesday December 30, 2015 @10:08AM (#51208683) Homepage

    Good luck when your programmer pool is a couple of thousand and your samples consist out of obfuscated and underhanded software which is often produced by malware creators.

    • by Anonymous Coward

      It does not need to be perfect, it even does not need to be good. This will just generate another data point which will be used with many other data points to find, track and control people.
      Certainty is not required, as long as all data combined results in a reasonable likelihood for the intended purpose, as deemed by the overlord wielding this tool.

    • by AHuxley ( 892839 )
      Nations can spend big on their clandestine campus study efforts and over the years can project any nations style they want.
      Did a nation embrace Basic? teach with Ada? early C? Pascal? Have decades of common business oriented language in academia, assembler language, academics who enjoyed lots of free "big iron" access?
      Like hires like, like learns from like.
      Or a large user group of newer Microsoft consumers that are self taught on PC's with newer programming ideas and lots of code reuse?
      With t
  • by Anonymous Coward

    So what happens when someone copies and pastes from 10 different authors to make a project.

  • However, a lot of people have similar enough coding styles, so you may be able to break it down to particular camps of styles. Also many people change their style based on the language they are coding in. Also over time their style may evolve and change.
    In my career I try to keep my mind open, and I see an other style of coding, other than judging it inferior to mine, I would like to understand it, and if I like it I will incorporate it into my style.

    But compiling your code, will not hide how you coded it

  • by WarmBoota ( 675361 ) on Wednesday December 30, 2015 @10:21AM (#51208763) Homepage
    Good luck tracking me!! I copy all of my code from Stackoverflow!
    • by Anonymous Coward

      StackSort connects to StackOverflow, searches for 'sort a list', and downloads and runs code snippets until the list is sorted.
      https://xkcd.com/1185/ [xkcd.com]

      (captcha: truisms)

  • Oh really? (Score:5, Insightful)

    by Viol8 ( 599362 ) on Wednesday December 30, 2015 @10:21AM (#51208765) Homepage

    If you RTFA it seems their sample size was 20 programmers. Occasionally they went up to 100 and they're getting something like 60-80% accuracy. BFD.

    Guys - when you've sampled the compiled, optimised binary output (with all debug info stripped) of a million coders all using different compilers on different architectures and are getting at least a 99% accuracy rate, get back to us. In the meantime, I'm sure you'll get some nice marks from your supervisors but I won't be losing any sleep.

    • Don't forget different versions of the same compiler. Eg. gcc-1.0 may have a different binary output than gcc-2.0

    • by PPH ( 736903 )

      getting at least a 99% accuracy rate

      If they are hoping to use this as evidence in a trial, maybe. But to reduce the size of a list of candidate suspects for further investigation, 60 to 80% could be OK.

    • by Kjella ( 173770 )

      Guys - when you've sampled the compiled, optimised binary output (with all debug info stripped) of a million coders all using different compilers on different architectures and are getting at least a 99% accuracy rate, get back to us. In the meantime, I'm sure you'll get some nice marks from your supervisors but I won't be losing any sleep.

      I wouldn't call it totally useless, imagine you found an unknown binary running on some internal server and it turns out to be a custom inside hack job deployed with stolen credentials. Maybe you even know the thief must have physically been in the victim's office. You now have a relatively limited set of suspects, a binary and a lot of source to compare with. If we're talking classified information, industrial espionage or some other really high end material this could be one lead in the investigation.

    • If you RTFA it seems their sample size was 20 programmers.

      Right, so that tells you that they're idiots because they can't do what the idiots speculate, or that the idiots speculating misunderstood the purpose of the tool?

      when you've sampled the compiled... output... of a million coders all using different compilers on different architectures and are getting at least a 99% accuracy rate, get back to [me]

      There are multiple problems in your analysis. First, there are not millions of compilers or architectures. If you take the compilers and architectures that make up 95% of what is used, you've only got a few platforms and a few compilers, not different ones for millions of programmers. This may sound pedantic, but the problem you imply with this pa

  • Just run your app through an obfuscator and it's completely masked. Problem solved.

  • Even their test size seemed to have low accuracy, but I wonder how well this even works over time. I know my code from 5-6 years ago looks nothing like code that I write today.

  • Problem 1.) Who wrote this https://de.wikipedia.org/wiki/... [wikipedia.org] ? Problem 2.) In the movie First Blood, Part II (a.k.a. Rambo II), when the camera pans through the interiors of Marshall Murdock's CIA base building, parts of the code listing of some computer program can be seen scrolling through some of the screens there. Who wrote that code? Hint to Problem 2: The person in question is also a Slashdot member.
  • This technology was used to determine how many coders were used for the Stuxnet attack.
    That was back in 2010.
    Using that, they determined that a team of 20 people were used, indicating a state-sponsored attack of remarkable complexity...
  • Any nontrivial programming exercise involves problem solving. Faced with a particular recurring problem, a programmer will learn methods to solve it. There are many choices. Most programmers, after having learned a small collection of 'good enough' solutions to common problems will continue to use them whenever 'good enough is good enough', the time and effort of relearning seeming unproductive.

    This is no different, conceptually, than in sports when certain sportspeople play in a discernible style. Nobody i

    • I think that, in the case of tennis players, it will be much easier to identify highly discriminating features of players than in the case of computer programmers. This is so because imagining to actually have what it takes to be a great tennis player is much easier than imagining to have the skills of a great programmer: If you can imagine having the skills of a great programmer, you do in fact have them. So, why not examine the stock example: Identifying chess players by their moves?
    • by yacc143 ( 975862 )

      Yes and no. The better people in the industry continue to learn, every day.

      Actually, my current team lead expects us to learn all the time, and is completely willing to take the hit in longer ticket handling times.
      OTOH, my current boss is an outlier in my experience in this industry.

  • Maybe now we can track down the guy that wrote that Volkswagen code? I'll be right there... need to grab my pitchfork.

    • At 32c3 https://www.youtube.com/watch?... [youtube.com] , Daniel Lange and Felix Domke presented their analysis of Volkswagen's "Dieselgate" software. It seems that that one doesn't look like ordinary code at all, but rather like code patterns generated from tables that relate sensory data to engine control parameters. Think of one of the earliest motivations for building computing machines in the first place: To create parameter tables for artillery aiming!
      • It's most likely made in Simulink and compiled to C and then for the target platform. Simulink is used everywhere in automated controls from the automotive up through aircraft.

        That's exactly what this code does, I work with A2L files all the time [youtu.be], it's how we calibrate our engines. It's how everyone calibrates their engines.

  • Did a couple of scripts in quiet time between Xmas and new year. Took the chance to move from perl to python. I use git hub as it is there.

    Think my rating will be 'dufus head'

  • Important part:

    Finally, we do not consider executable binaries that are obfuscated
    to hinder reverse engineering. While simple systems,
    such as packers [2] or encryption stubs that merely restore the
    original executable binary into memory during execution may
    be analyzed by simply recovering the unpacked or decrypted
    executable binary from m

  • Generally, any programming language has an upper limit regarding the number of commands that are recognized, which the same cannot be said of spoken/written languages. The only thing that will actually be discovered are the differences in algorithms, not the unique number of programmers to a particular dialect.
  • The place where (and how) you catch nulls is very programmer-specific in my experience and often evades the style-check.

    • by ledow ( 319597 )

      But...

      Say that's a "signature". There are only "n" so many places you can put the null-check that will work properly.

      Say you can list "m" such things. Then, at most, you can categorise every programmer into one of m x n groups (and, in fact, you might find that certain m's and n's go hand-in-hand, etc.).

      So, if you have something like github - that has 11 million users and 30m repos at last count. Let's assume that most of those 11 million users, then, are programmers that commit code. You'd need to find

    • I don't really believe this. I have only used the pattern
      Allocation
      check for NULL
      error
      continue

      There are a few different things to do in the error, do you goto error handling at the end? Do you start unwinding previous work and exit out? Yes, I have done all of those in different situations, with different coding standards around me, but the basic NULL check is the next thing that happens directly after the allocation. Note that compilers don't care about formatting and whitespace, so if you do

      • by drolli ( 522659 )

        Well i have seen people placing constructs which "propagate" nulls in data structures...... (not that i am a fan of this, i would have liked to make guy swallow his keyboard). The strategies can to do so can be funny.

  • will make any such strategy useless in short order. Source code translators and syntax standardization tools might be another approach.

    Anyway, it's a big yawn, however, some enterprising con artists will sell this to clueless government bureaucrats for big bucks. Bureaucrat will get his bonus. Con artist company will get their money. Win, win. It won't work, of course, but when has that ever mattered in the government world?

  • Cut-and-paste other programmer's code!
  • This study seems to have a high error rate. (70-80% correct, less for big programmer populations)
    If might be useful for de-priortizing some leads, but seems a bit like a divining rod.

    What is interesting is what it says about what programmers do.
    They continually make choices as to how to implement things.
    The choices are limited by their judgement and bag of tricks.
    What they have seen, what has worked in the past, and what they manage to dream up.

    Perhaps this research is actually creating and comparing invent

    • Did the same team that developed that code also run an accuracy assessment? Was there a "prize" (contract payment) associated with meeting certain accuracy? I remember reading about facial recognition systems which worked well in labs, but fail in the field.
      As soon as developers become aware that they might be identified, I think that they might do things (spoof, run beautify and strip comments) to throw such a system off.
  • by mark-t ( 151149 ) <markt&nerdflat,com> on Wednesday December 30, 2015 @11:32AM (#51209257) Journal

    It seems to me like the easiest way to avoid being identified in this regard would be to write code that follows any published general style guidelines or otherwise very common conventions.

    As a side effect, it will make your source code more readable to others, which is beneficial if you are on a programming team.

  • possible upside (Score:4, Interesting)

    by Gravis Zero ( 934156 ) on Wednesday December 30, 2015 @11:37AM (#51209303)

    while i don't think you'll be able to identify an exact person, i do think this technology could be used to identify code that is prone to error and exploitation or even code that is for exploitation.

  • There is no way this could be even close to conclusive, but the moral of the story is - if it is stupid, but a judge will call it probable cause, then it isn't stupid.

    The truth is it doesn't need to be conclusive, it just has to look conclusive to a 60 year old law professional with no programming experience.
  • As someone who knows a fair amount about compilers and interpreters, I would be highly skeptical of that underlying statement to begin with. Going down this path is a road that stretches the bounds of credulity.

    But I think I would also dispute the notion that programmers have unique coding styles in the age of widely accepted standards and practices.
    Practices, I would concede are coding styles.

    Though, even then... it's not like we're talking about something like assembly, where the style you use would reall

  • 01001001 00100000 01100011 01101111 01100100 01100101 00100000 01101001 01101110 00100000 01100010 01101001 01101110 01100001 01110010 01111001

  • I suspect that the value is not in answering the question "who the hell wrote this - which programmer in Internet land ?" but in identification a programmer out of a small group of suspects, eg "was this written by the known malware team in Boston, Beijing or Kiev ?". So: it will further narrow the field out of an already small group of suspects.

    This has an interesting implication on GPL enforcement. Today if Nasty Corp Inc takes a large chunk of code from Git Hub and makes it part of a proprietary product

  • Linus Torvalds has been indicted for creating numerous pieces of malware. "His coding style is unmistakable" prosecutors said quoting numerable code fixes he made after scolding commentaries on other people's coding style.

  • by thermowax ( 179226 ) on Wednesday December 30, 2015 @02:01PM (#51210211)

    ...you run the object code through a permuter like shikata ga nai?

    I suspect the successful detection rate may be a bit lower.

  • i'm pretty sure my coding style has changed significantly over time, from project to project, due to experience, learning from past mistakes, influence from other programmers, etc. Also, I've worked on projects where 10 different programmers have touched the same code. Good luck trying to identify me from any two pieces of code.

"Security is mostly a superstition. It does not exist in nature... Life is either a daring adventure or nothing." -- Helen Keller

Working...