Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Privacy The Internet

AOL, Netflix and the End of Open Research 85

An anonymous reader writes "In 2006, heads rolled at AOL after the company released anonymized logs of user searches. With last week's announcement that researchers had been able to learn the identities of users in the scrubbed Netflix dataset, could the days of companies sharing data with academic researchers be numbered? Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Will any high tech company ever take this kind of chance again? If not, how will this impact research and and the development of future technologies that could have come from the study of real data?"
This discussion has been archived. No new comments can be posted.

AOL, Netflix and the End of Open Research

Comments Filter:
  • Correlations (Score:5, Insightful)

    by Lachryma ( 949694 ) on Friday November 30, 2007 @03:25PM (#21537371)
    The identities were learned because the users shared their movie preference information with IMDB.

    I don't see this as a problem, yet.

    • Re: (Score:1, Flamebait)

      by ShieldW0lf ( 601553 )
      Do I care that information about me might have been released amongst all the rest? Not at all.

      Do I care that massive companies and governments get to amass all this data and not share it with the rest of us? A great deal.

      There is too much privacy. No one cares about your guilty little sexual encounters, no one cares what the doctor says is going to kill you, and there are truly evil people hiding terrible things while you concern yourself with such trivialities.

      Get over yourself. Stop fighting for secre
    • by jfengel ( 409917 )
      They shared some of their movie preference information with the IMDB, but they may have intended to keep the rest of it private. Some of those private ratings have now slipped out.

      I don't know if anything really important came of it, but it's extremely illustrative: even anonymized data can become known if you can tie it in to a public data source. Movie ratings data may be important, or at most slightly embarrassing ("you LIKED Ghost Dad? Ewwww!") but it could easily have been worse if the data had been
    • The only problem I could see is if they didn't get the users' permission first. If the users give their permission, then where's the problem? If not, what the hell was Netflix thinking? I understand selection bias, but that doesn't change the fact that users care about their privacy.
    • by yali ( 209015 )
      Here's what I wrote (with minor editing) in a comment to the earlier article...

      Suppose that you want to keep your political attitudes private -- for whatever reason, you decided it's nobody else's business. On IMDb, publicly linked to your real identity, you choose to only rate movies with non-political content, which you don't mind anybody knowing your opinion about. On Netflix, you believe that your ratings will be kept private, and you want to take advantage of their recommendations. So you rate all the
  • by omnirealm ( 244599 ) on Friday November 30, 2007 @03:25PM (#21537383) Homepage
    There exist effective techniques that can anonymize the data in order to thwart attempts to correlate identities, while still preserving the statistical properties of the data that make it useful to researchers. They include k-anonymity and l-diversity:

    http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html [cmu.edu]

    http://www.cs.cornell.edu/~dkifer/papers/ldiversity.pdf [cornell.edu]
    • by stranger_to_himself ( 1132241 ) on Friday November 30, 2007 @04:25PM (#21538031) Journal

      From scanning those articles it looks as if they are just methods for defining levels of anonymity in a dataset, rather than providing any effective means of achieving it (please correct me if I'm wrong).

      I can't see how, for example, if I am planning a study of small area (ie zip code level) variation in the levels of some disease or other, while adjusting for, say, age, sex, and ethnicity, that I could do so without a dataset that included all of these items. How could you make the records less unique without throwing away the data?

      We have to accept that if we want meaningful research to happen, then we need some amount of data sharing and linking needs to occur. We need to rely, in medicine at least, on ethics committees to represent our best interests when it comes to striking the balance.

      It seems to me that the trend for guarding personal data like its the family silver is a relatively modern thing. If it continues, then reliable unbiassed medical research, especially disease monitoring and control will become impossible.

      • So how does this acceptance of the professional responsibility of researchers change when one acknowledges that at any moment homeland security or the like can issue a national security letter to obtain access to the dataset? They could use it to identify potential troublemakers, and moreover to uncover people's secrets to blackmail them with. Or they could uncover minor crimes and selectively enforce the laws on people they suspect of whatever they aren't able to prove. Worse, they could employ these t
        • I understand these arguments, and they're important, but I don't think the answer lies in trying to hide things. The powerful guys will always find a way to get what they want, and its really just the smaller people with legitimate uses going through the correct procedures that you'll hinder.

          People can't blackmail you with public domain information. And if the data on minor crimes is available on everybody, then you can point to the selective prosecution. Information is a powerful tool in the hands of b

          • powerful weapons, particularly early in their history are invariably clumsy and prone to lots of collateral damage.

            i'm not sure society would accept the cost of that damage in exchange for the benefit. Even if you claimed it would only be for a transition period. Heck, i'm not sure you could convince me any such transition period would ever end.
    • Yes, seems like there is no reason a technical solution couldn't solve the problem of balancing privacy with data sharing. There is still plenty to be learned if the data sharing were general enough. If researchers knew my age, sex, weight -- do they really need to know my name and address? At the same time, the irony is that if we all released every single detail about ourselves to researchers, the world would be fine -- it's not the researchers that are the bad guys. It is the storing of the data somewher
      • No, I don't think there is a technically feasible way to retain anonymity while providing the type of data wanted by researchers and clickstream corporations.

        The reason is because the researchers and clickstream companies don't just want the raw data of what is occurring on a given network. They want to be able to track individual web browsing habits of particular users. They don't need to know who "user 123" is, but they need to be able to differentiate "user 123's" web browsing habits from "user 99
  • The Impact (Score:5, Insightful)

    by flaming error ( 1041742 ) on Friday November 30, 2007 @03:27PM (#21537423) Journal
    > how will this impact research and and the development
    > of future technologies that could have come from the
    > study of real data?

    It's definitely a hindrance. Kind of like not letting cops search houses without permission.
    • Re: (Score:3, Informative)

      but it's the companies data, not yours. Once they strip out your name and such your privacy claims are limited. Not that people won't piece things back together using an outside database This is what happened in the Netflix case. They were able to guess user's #3956 name at ANOTHER website. They could probably keep the info off the net-at-large by only letting researchers use their equipment under NDA so not everybody has this info.

      As far as "legal searching" goes, they already do this... legally, they j
      • Arguably the data is not truly anonymized if a third party can reconnect your name to the data. So claiming that it's not really "your" data, simply because they did some form of obfuscation, is a bit bogus.

        What these companies really should do is just ask people when they sign up for the service "Hey, we might someday want to provide academic researchers with data on our customers' purchase habits. We will do our best to anonymize this data before providing it to the researchers, but if you've provided s
    • So you're saying that this would all be pointless speculation if they let their users choose whether to participate or not?
    • by coaxial ( 28297 )
      You're being sarcastic, but lack of real data is a hinderance. In information retrieval, data sets with real users is hard to get. You need real user data because that's how you evaluate if your algorithm is any good at helping real people find things. People are noisy. People do dumb things. People aren't optimal! Heuristics for mimicking user behavior can work for something this, but ultimately you have to test against real user data. Otherwise your optimizing your system for a user that doesn't ex
  • Opt-in (Score:4, Interesting)

    by chiasmus1 ( 654565 ) on Friday November 30, 2007 @03:28PM (#21537441) Homepage
    There are people who do not really care if their search results are added to the collection that is released. If Google had an opt-in option for data that they were going to release to academic researchers, I would opt-in. I imagine that there are other people who do not care who is looking at their searches. Something that companies might consider if they wanted to release search results is the option for the users to see what information gets released.
    • Re: (Score:3, Insightful)

      by houstonbofh ( 602064 )
      But how many would? There are "Chilling Effects" all over the place. For example, I don't want to share my data because it may not be deleted, (Gmail and facebook) and I don't want you to share my data because I don't know what you will do with it, (RIAA) and no one wants to approach the line because lawyers are too damn expensive. I think we need to reinstitute "Trial by Combat" as a defense. Nothing else has stopped frivolous legal shenanigans...
      • I think we need to reinstitute "Trial by Combat" as a defense.
        That'd better be trial-by-combat-no-proxies-allowed. What makes you think that the MPAA wouldn't be able to afford the services of Chuck Norris?
    • Re:Opt-in (Score:5, Insightful)

      by kcwhitta ( 232438 ) on Friday November 30, 2007 @03:38PM (#21537537)
      The problem with opt-in statistical gathering is that they can skew a sample, subtly biasing it. This would invalidate a lot of scientific research.
      • Re: (Score:3, Insightful)

        i.e., it might come as a surprise when researchers discover that NOBODY (who opted in) searches the internet for pornography, music torrents, Paris Hilton...

        Hell, out of Google's top 20 searches, you might get maybe 3 listed?
      • It wouldn't invalidate it completely. Worst case, it means the research is only applicable to the subset of people who agree to participate, rather than the user population as a whole. It may still yield useful insight for that subset, and (if the self-selection bias isn't too bad) possibly the larger user population too. So while not perfect, the opt-in data may still be good enough for some uses.

        Another important note is that the data gathering itself is not opt-in. It's the publishing of "anonymous" v
      • Then for research where it's more important to perform the research in a valid way than it is to protect the users privacy, they can release the data to that researcher alone. For things such as improving the recommendation of movies, the bias should be okay.
      • Considering that the majority of psychological studies are performed on college freshmen taking a Psych 101 class, the reality is that "getting an ubiased random sample" is an ideal that researchers rarely worry too much about living up to.

        Not that I'm defending this practice, but I do think that a very large sample of Google/AOL users who opted-in would actually be more generalizable than the average study.

        • Considering that the majority of psychological studies are performed on college freshmen taking a Psych 101 class, the reality is that "getting an ubiased random sample" is an ideal that researchers rarely worry too much about living up to.

          That's not really true. It might be true for toy studies for students or pilots where instruments are being tested (I've done a few myself in that context), but all serious psychological studies spend a good deal of time trying to get their sample right.
          • Pssh, you haven't read the studies I have, then. Let me just grab a few sitting on my desk (not counting the ones done on children) and list their participants:

            36 respondents at the U of MN, 107 students at U of MN (journal of consumer research)
            90 U of MD undergrads (j of personality and social psych)
            100 Columbia undergrads, 60 Columbia undergrads, 29 Columbia undergrads (j of personality and social psych)
            114 UCSB undergrads, 16 female UCSB grad students (applied cog psych)
            26 males with normal or corr

            • Okay I stand corrected. That's quite shocking. I do more neurological/psychiatric stuff I suppose, which is bit more medical so the samples have to be more population representative, and our study staff spend a lot of time sampling electoral registers and randomly knocking on doors. We even worry sometimes when we see samples all recruited from the same town.
    • that's what should happen, not exactly what could happen here. search companies could just attempt to assume the rights to searches as it was after all done on their servers.
    • by KevMar ( 471257 )
      Naw, make it an opt out that you have to update every 6 months. we will call it the do not release my info registry.

      I can see alot of value it letting information like this released, but there should be some rules attached to it.

      First, make it a by request instead of open access data. People requesting access should sign privacy contracts that only allow them to publish the results as long as the results dont identify anyone.

      I dont mind the research saying "23 can be identified by name that searched for
  • by Rob T Firefly ( 844560 ) on Friday November 30, 2007 @03:32PM (#21537479) Homepage Journal

    Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.'
    Eric, you fool! Have you no concept of the world's tencency toward drama and hilarity? Loudly declaring "this kind of thing could never happen at Google" is like saying "at least it's not raining" or "it's a million-to-one chance" or some other damn fool thing that will prove you wrong nine times out of ten.
    • He knows it's not likely to happen at his company because they are already monetizing this type of data mining research themselves in house and don't want to let anyone else do it. :)
    • Colon: "So it'd only work if it's your actual million-to-one chance."
      Nobby: "I suppose that's right."
      Colon: "So 999,943-to-one, for example--"
      Carrot: "Wouldn't have a hope. No-one ever said 'It's a 999,943-to-one chance but it might just work.'"
    • by 3ryon ( 415000 )
      Eric, you fool! Have you no concept of the world's tencency toward drama and hilarity? Loudly declaring "this kind of thing could never happen at Google" is like saying "at least it's not raining" or "it's a million-to-one chance" or some other damn fool thing that will prove you wrong nine times out of ten.

      I'll never win the lottery, I'll never win the lottery! Do you hear me god? I'll never win the lottery!
  • by Anonymous Coward
    The final question regarding "what research opportunities will be lost" because of data privacy is pretty horrible. It is analogous to "what crime prevention successess will be sacrificed, because society was not willing to live as a collective prisoner to the state". I.e. duh- yes, you can prevent crime from locking everyone up. But there are *more important values* to be achieved by not presuming everyone guilty and locking them up ahead of time. I.e. in the same way, yeah, you could have all kinds of
    • You're probably just trolling, but in case you aren't, seeing the rampant crime that is institutionalized in modern prisons, I think your argument falls flat on its face.

      Liberty doesn't have security as its price. Liberty and Security are often correlated, not directly correlated inversely as you assume.

      As more people are free to do things that don't infringe on others' security, security often goes up as the people who would be breaking security systems for their own benefit have plenty of other "acceptab
    • Nodal networks are interesting things. There's research to be had there, regardless of what a 'node' is. This article is about cleansing real world data in such a way that the 'nodes' can be used for such research regardless of nodal identity. So, yes, real and interesting anonymous data can be gleaned. But so can meta-data associated with a 'node'.

      Just hope that you don't become too AdNoid while your AdNodes are tonsured.

      Cheers,
      Matt
  • by BlowChunx ( 168122 ) on Friday November 30, 2007 @03:42PM (#21537579)
    I love this quote from TFA:
    "Companies do not make money by giving researchers access to data. "

    Wrong! Netflix released data to get a better recommendation system. The better they can pick movies for you, the more you will like their service. The $1million prize is peanuts compared to the increase in revenue a better system can bring.

    I wonder if anyone has estimated the value of the man hours invested in this contest?
    • Netflix released data to get a better recommendation system.

      Yes, but netflix has a very good system already. Now Tivo, Blockbuster, inteliflix, Dish, etc have a well researched starting point to catch up. and they have more data than the researchers, etc thanks to Netflix data combined with in house.

      Granted the winning solution looks way to computational, and data intensive to run on a Tivo box at real time. I guess those units with the regular phone connection could have it processed off-line and receiv

      • No, I don't think that Netflix had a "very good system already". I don't do pattern recognition for a living (my field is CFD), and I had a system that beat Netflix after about 1 month of reading papers and figuring out how to compute the SVD for a large sparse matrix.

        What they *really* need is a good way to filter the errors out of the data that they have. Errors in the data introduce larger errors in your predictions...
  • If companies don't do a thorough enough job of sanitizing statistical data before releasing it, they have to be prepared to deal with the consequences. I'm all for maintaining research access to large volumes of real-world data, but it does need to be obtained through responsible channels.

    All that said, I think an interesting question is: How can we build systems that appropriately compensate companies for access to their data, with strict enforcement of measures designed to thwart misuse of the data? Po
  • So... (Score:4, Funny)

    by thatskinnyguy ( 1129515 ) on Friday November 30, 2007 @03:49PM (#21537631)

    Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.'
    So...
    AOL = evil
    Netflix = evil
    Facebook = evil
    Goolgle != evil

    Thank you Eric for giving us the warm and fuzzies that Google is not evil with your two cents.
    • Goolgle != evil

      You've got it all wrong... "Goolgle" is clearly evil, since whoever they are, they're obviously trying to get rich quick on typos made by the most awesome company on the planet (Google).

      Google would never stoop to such measures, another example shining example of why "Google" ne "evil"

      • "Google" ne "evil"
        But "Google" == "evil", so google is both evil and not evil at the same time.

        Aw crap, my cat just pooped in some box...
        • My cat crapped in the litter box about half an hour ago. We'd better be careful to carefully consider the implications of this statistically correlated data and guard against its improper release and use!

  • you can't just randomly give people untested drugs, you need to try it out on rats first

    so obviously, in the future, rats will use aol and we will get human usage pattern information from that
    • by sm62704 ( 957197 )
      you can't just randomly give people untested drugs, you need to try it out on rats first

      You can if you're the US Military. I can tell you that from experience. Speaking of drugs...
    • What's with the 'in the future' bit. It seems to me that human PC users are rats.

      : Strong preference for an urban environment
      : Operate at night
      : Pink eyes
      : Sexually promiscuous
      : Tunnel vision
      : Socially organised into rat packs.
      : Cautious omnivores, who warn fellow rats of toxins
      : Proud champions of Darwin
      : Those who work in labs have white coats.
      : Worried about drain brains, etc...

      It's all going to turn to custard when humans who venture outdoors have better PC capability on their mobile phones. Those tree
  • If researchers use it irresponsibly, then they can't be trusted with access to it. Way to ruin a good thing guys.
    • Re: (Score:2, Informative)

      This is kinda like saying security researchers are to blame for discovering and publishing weaknesses in software. Responsible citizens just pretend everything is fine and wait for someone really bad to discover the same weaknesses and exploit them. Because it's so much easier chasing down criminals than it is to fix problems in the first place by adopting better security practices. I guess we could just arrest all researchers who publicize uncomfortable truths. What's the number to Adobe's legal department
      • If researchers want better access to data, then they need to play by the rules. Netflix data was to be used to figure out a better recommendation algorithm, not to crosslink to IMDB in an attempt to expose peoples identity. I would have hoped that the researchers would have had more respect for people than that.
        • I think you're missing the point of my post. It's one thing to say that researchers should be responsible and "play by the rules." If a researcher's intent is to turn a profit through abuse of statistical data, that's bad. If a researcher's aim is to expose how an unethical person would be able to turn a profit through misuse of the data, that's entirely different (i.e. the difference between security researchers and crackers). We need researchers to point out flaws in data sources, to prevent abuse of the
  • Medical records? (Score:4, Interesting)

    by CheeseTroll ( 696413 ) on Friday November 30, 2007 @04:09PM (#21537837)
    This puts the idea of analyzing "anonymous" electronic medical records in an interesting light. Even without a name, SSN, or other ID that explicitly links a record to a specific person, could researchers cross-reference the data with other databases well enough to identify people via patterns in their health record? I'm guessing yes.

    For the record, it's not my intent to troll, but I do think it's something that future researchers will need to take into account to ensure people's privacy.
    • by guruevi ( 827432 )
      The problem is not cross-referencing patterns because machines nor humans can detect patterns that well to match up 2 datasets that are only related by patterns especially among a large population. So technically, to be totally perfect, you should release the anonymized data of EVERYBODY and filter out cases that are truly unique (can be done simply with keywords).

      The main issue with AOL's and Netflix is that they released data that was self-referencing the user using substitution to replace names with uid'
      • by vrmlguy ( 120854 )

        If AOL/Netflix would've done the same thing and replaced those unique words/names with a generic word, the researchers would've had much more trouble matching up the users.

        The problem is, if you replace words with other words, you're destroying the semantic meaning of the text, and IIRC the winning algorithm used that semantic meaning to assign scores. Specifically, they looked for common words in movie titles; if you rent several movies with "pirate" in the title, it's reasonable that you might be interested in other movies with "pirate" in the title. Now, you could build a hash-table that consistantly replaced words with meaningless strings (i.e., "pirate" becomes "nhy6m

    • by copdk4 ( 712016 )
      Unlike the data from for-profit companies (AOL, Google etc), medical records are not necessarily "owned" by hospitals. In the US, the data is owned by the PATIENTs versus in the UK it belongs to the NHS Reference [wikipedia.org]

      So atleast in the US, hospitals can get into legal trouble for even disclosing anonymized dataset without consents from each and every patient (although several hospitals make patients sign a form waiving their rights to ownership)

  • Why depend on Fortune 500 companies to provide large volumes of data to researchers? They provide data comprised of alphanumeric character sequences, punctuation, etc, right? There's a better way that provides that plus a more complete representation of the entire character set! Every UNIX-based machine comes with a built in data generator: /dev/random

    (depending on your machine, your mileage may vary with the quality of the data).
    • That's great - if your goal is to analyze the statistical properties of the RNG. It kinda sucks if your goal is to conduct research or marketing in the real world.
  • This is just the tip of the iceberg. If you live in the US, it's likely that logs of all your web activity are being sold to clickstream companies. The data logs being sold by the ISPs seem to use the exact same sort of inadequate anonymity practices as were used by AOL.

    The problem is that no matter how well the data is cloaked, a users browser habits can easily make the anonymity worthless. As has been seen in the case of NetFlix and AOL, it's easy to figure out whom a person is by simply looking at anonymized logs. A single visit to a social networking site is often enough to make a good guess. But when a specific anomized IP address visits the same page of social networking sites, or edits social their profile at a social networking site, or reviews an item at a vendor site, the real identity of that "anonymized" IP address is completely confirmed.

    Simply cloaking an IP address will never provide anonymity. But the companies that purchase your web surfing logs would have no use for logs that weren't attached to a single user. Unless the ISPs were to keep track of and filter out every single vendor site which revealed a user's real name, there would seem to be no safe way to anonymize user logs. Since there are countless numbers of web forums, vendors, and social networking sites, it would seem technically impossible to truly provide any safe level of anonymity for user logs. Selling these logs is just a bad practice that needs to be stopped.

    I can only wonder why the EFF and other organizations haven't made a bigger deal about this. These ISPs are selling all of their user's web logs. I cannot imagine any effective way the ISP's could ever anonymize this data. More info: http://wanderingstan.com/2007-03-19/is_comcast_selling_your_clickstream_audio_transcript [wanderingstan.com] http://arstechnica.com/news.ars/post/20070315-your-isp-may-be-selling-your-web-clicks.html [arstechnica.com]
  • People can accept to share information publicly like movies or product rankings. This decision will move down the price of costly marketing studies and will democratize insightful information.
    To balance the protection and sharing of information, more complex social networks infrastructure are required, may be projects like OpenQabal [java.net] can help.


  • Well Gee Wally, they share our data with everybody damned else.
  • We have a major problem...
    1. Music-DRM protects the RIAA's data and tries to prevent end-users from derivinging an unprotected version of the data in the file.

      Personal-data-DRM (anonymization) protects the ISP's or hospital's data and tries to prevent end-users (researchers) from deriving an unprotected (unanonymized) version of the data in the file.
    2. Music-DRM makes things more difficult for legitimate customers who legally purchased the data files (music).

      Personal-data-DRM (anonymization) makes things

The truth of a proposition has nothing to do with its credibility. And vice versa.

Working...