AOL, Netflix and the End of Open Research 85
An anonymous reader writes "In 2006, heads rolled at AOL after the company released anonymized logs of user searches. With last week's announcement that researchers had been able to learn the identities of users in the scrubbed Netflix dataset, could the days of companies sharing data with academic researchers be numbered? Shortly after the AOL incident, Google's Eric Schmidt called the data release 'a terrible thing,' and assured the public that 'this kind of thing could not happen at Google.' Will any high tech company ever take this kind of chance again? If not, how will this impact research and and the development of future technologies that could have come from the study of real data?"
Correlations (Score:5, Insightful)
I don't see this as a problem, yet.
Re: (Score:1, Flamebait)
Do I care that massive companies and governments get to amass all this data and not share it with the rest of us? A great deal.
There is too much privacy. No one cares about your guilty little sexual encounters, no one cares what the doctor says is going to kill you, and there are truly evil people hiding terrible things while you concern yourself with such trivialities.
Get over yourself. Stop fighting for secre
Re: (Score:1)
Re: (Score:2)
I don't know if anything really important came of it, but it's extremely illustrative: even anonymized data can become known if you can tie it in to a public data source. Movie ratings data may be important, or at most slightly embarrassing ("you LIKED Ghost Dad? Ewwww!") but it could easily have been worse if the data had been
Re: (Score:2)
Re: (Score:2)
Suppose that you want to keep your political attitudes private -- for whatever reason, you decided it's nobody else's business. On IMDb, publicly linked to your real identity, you choose to only rate movies with non-political content, which you don't mind anybody knowing your opinion about. On Netflix, you believe that your ratings will be kept private, and you want to take advantage of their recommendations. So you rate all the
k-anonymity and l-diversity (Score:5, Informative)
http://privacy.cs.cmu.edu/people/sweeney/kanonymity.html [cmu.edu]
http://www.cs.cornell.edu/~dkifer/papers/ldiversity.pdf [cornell.edu]
Re:k-anonymity and l-diversity (Score:4, Insightful)
From scanning those articles it looks as if they are just methods for defining levels of anonymity in a dataset, rather than providing any effective means of achieving it (please correct me if I'm wrong).
I can't see how, for example, if I am planning a study of small area (ie zip code level) variation in the levels of some disease or other, while adjusting for, say, age, sex, and ethnicity, that I could do so without a dataset that included all of these items. How could you make the records less unique without throwing away the data?
We have to accept that if we want meaningful research to happen, then we need some amount of data sharing and linking needs to occur. We need to rely, in medicine at least, on ethics committees to represent our best interests when it comes to striking the balance.
It seems to me that the trend for guarding personal data like its the family silver is a relatively modern thing. If it continues, then reliable unbiassed medical research, especially disease monitoring and control will become impossible.
throw in the feds (Score:2)
Re: (Score:2)
I understand these arguments, and they're important, but I don't think the answer lies in trying to hide things. The powerful guys will always find a way to get what they want, and its really just the smaller people with legitimate uses going through the correct procedures that you'll hinder.
People can't blackmail you with public domain information. And if the data on minor crimes is available on everybody, then you can point to the selective prosecution. Information is a powerful tool in the hands of b
clumsy (Score:2)
i'm not sure society would accept the cost of that damage in exchange for the benefit. Even if you claimed it would only be for a transition period. Heck, i'm not sure you could convince me any such transition period would ever end.
Re: (Score:1)
Keeping it anonymous is effectively impossible (Score:2)
The reason is because the researchers and clickstream companies don't just want the raw data of what is occurring on a given network. They want to be able to track individual web browsing habits of particular users. They don't need to know who "user 123" is, but they need to be able to differentiate "user 123's" web browsing habits from "user 99
The Impact (Score:5, Insightful)
> of future technologies that could have come from the
> study of real data?
It's definitely a hindrance. Kind of like not letting cops search houses without permission.
Re: (Score:3, Informative)
As far as "legal searching" goes, they already do this... legally, they j
Re: (Score:2)
What these companies really should do is just ask people when they sign up for the service "Hey, we might someday want to provide academic researchers with data on our customers' purchase habits. We will do our best to anonymize this data before providing it to the researchers, but if you've provided s
Re: (Score:2)
Re: (Score:2)
Opt-in (Score:4, Interesting)
Re: (Score:3, Insightful)
Re: (Score:2)
Re: (Score:1, Offtopic)
Chuck Norris? Isn't he getting a bit long in the tooth? They would probably prefer someone like Chuck "The Iceman" Liddell [wikipedia.org] or one some other professional mixed martial arts fighter instead...
Re: (Score:1)
http://www.chucknorrisfacts.com/ [chucknorrisfacts.com]
Re:Opt-in (Score:5, Insightful)
Re: (Score:3, Insightful)
Hell, out of Google's top 20 searches, you might get maybe 3 listed?
Re: (Score:2)
Another important note is that the data gathering itself is not opt-in. It's the publishing of "anonymous" v
Re: (Score:2)
Re: (Score:2)
Not that I'm defending this practice, but I do think that a very large sample of Google/AOL users who opted-in would actually be more generalizable than the average study.
Re: (Score:2)
Considering that the majority of psychological studies are performed on college freshmen taking a Psych 101 class, the reality is that "getting an ubiased random sample" is an ideal that researchers rarely worry too much about living up to.
That's not really true. It might be true for toy studies for students or pilots where instruments are being tested (I've done a few myself in that context), but all serious psychological studies spend a good deal of time trying to get their sample right.Re: (Score:2)
36 respondents at the U of MN, 107 students at U of MN (journal of consumer research)
90 U of MD undergrads (j of personality and social psych)
100 Columbia undergrads, 60 Columbia undergrads, 29 Columbia undergrads (j of personality and social psych)
114 UCSB undergrads, 16 female UCSB grad students (applied cog psych)
26 males with normal or corr
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
I can see alot of value it letting information like this released, but there should be some rules attached to it.
First, make it a by request instead of open access data. People requesting access should sign privacy contracts that only allow them to publish the results as long as the results dont identify anyone.
I dont mind the research saying "23 can be identified by name that searched for
Inviting drama (Score:5, Funny)
Re: (Score:2)
Quoth the unbeatable Pratchett (Score:2)
Nobby: "I suppose that's right."
Colon: "So 999,943-to-one, for example--"
Carrot: "Wouldn't have a hope. No-one ever said 'It's a 999,943-to-one chance but it might just work.'"
Re: (Score:2)
I'll never win the lottery, I'll never win the lottery! Do you hear me god? I'll never win the lottery!
privacy: you can't have your cake, and eat it too (Score:1, Insightful)
locking all people up doesn't prevent all crime (Score:3, Insightful)
Liberty doesn't have security as its price. Liberty and Security are often correlated, not directly correlated inversely as you assume.
As more people are free to do things that don't infringe on others' security, security often goes up as the people who would be breaking security systems for their own benefit have plenty of other "acceptab
Re:privacy: you can't have your cake, and eat it t (Score:1)
Just hope that you don't become too AdNoid while your AdNodes are tonsured.
Cheers,
Matt
research for the sake of? (Score:5, Insightful)
"Companies do not make money by giving researchers access to data. "
Wrong! Netflix released data to get a better recommendation system. The better they can pick movies for you, the more you will like their service. The $1million prize is peanuts compared to the increase in revenue a better system can bring.
I wonder if anyone has estimated the value of the man hours invested in this contest?
Re: (Score:2)
Yes, but netflix has a very good system already. Now Tivo, Blockbuster, inteliflix, Dish, etc have a well researched starting point to catch up. and they have more data than the researchers, etc thanks to Netflix data combined with in house.
Granted the winning solution looks way to computational, and data intensive to run on a Tivo box at real time. I guess those units with the regular phone connection could have it processed off-line and receiv
Re: (Score:2)
What they *really* need is a good way to filter the errors out of the data that they have. Errors in the data introduce larger errors in your predictions...
Responsibility and rewards. (Score:2, Interesting)
All that said, I think an interesting question is: How can we build systems that appropriately compensate companies for access to their data, with strict enforcement of measures designed to thwart misuse of the data? Po
So... (Score:4, Funny)
AOL = evil
Netflix = evil
Facebook = evil
Goolgle != evil
Thank you Eric for giving us the warm and fuzzies that Google is not evil with your two cents.
Re: (Score:1)
You've got it all wrong... "Goolgle" is clearly evil, since whoever they are, they're obviously trying to get rich quick on typos made by the most awesome company on the planet (Google).
Google would never stoop to such measures, another example shining example of why "Google" ne "evil"
Re: (Score:1)
Aw crap, my cat just pooped in some box...
Re: (Score:1)
they have the same problem in pharmaceuticals (Score:2)
so obviously, in the future, rats will use aol and we will get human usage pattern information from that
Re: (Score:2)
You can if you're the US Military. I can tell you that from experience. Speaking of drugs...
Re: (Score:1)
: Strong preference for an urban environment
: Operate at night
: Pink eyes
: Sexually promiscuous
: Tunnel vision
: Socially organised into rat packs.
: Cautious omnivores, who warn fellow rats of toxins
: Proud champions of Darwin
: Those who work in labs have white coats.
: Worried about drain brains, etc...
It's all going to turn to custard when humans who venture outdoors have better PC capability on their mobile phones. Those tree
Researchers are to blame. (Score:2)
Re: (Score:2, Informative)
Re: (Score:2)
Re: (Score:2)
Medical records? (Score:4, Interesting)
For the record, it's not my intent to troll, but I do think it's something that future researchers will need to take into account to ensure people's privacy.
Re: (Score:2)
The main issue with AOL's and Netflix is that they released data that was self-referencing the user using substitution to replace names with uid'
Re: (Score:2)
If AOL/Netflix would've done the same thing and replaced those unique words/names with a generic word, the researchers would've had much more trouble matching up the users.
The problem is, if you replace words with other words, you're destroying the semantic meaning of the text, and IIRC the winning algorithm used that semantic meaning to assign scores. Specifically, they looked for common words in movie titles; if you rent several movies with "pirate" in the title, it's reasonable that you might be interested in other movies with "pirate" in the title. Now, you could build a hash-table that consistantly replaced words with meaningless strings (i.e., "pirate" becomes "nhy6m
Re: (Score:1)
So atleast in the US, hospitals can get into legal trouble for even disclosing anonymized dataset without consents from each and every patient (although several hospitals make patients sign a form waiving their rights to ownership)
It has to be said. (Score:1, Offtopic)
(depending on your machine, your mileage may vary with the quality of the data).
Re: (Score:2)
ISP's already sell all your web browsing logs (Score:3, Informative)
The problem is that no matter how well the data is cloaked, a users browser habits can easily make the anonymity worthless. As has been seen in the case of NetFlix and AOL, it's easy to figure out whom a person is by simply looking at anonymized logs. A single visit to a social networking site is often enough to make a good guess. But when a specific anomized IP address visits the same page of social networking sites, or edits social their profile at a social networking site, or reviews an item at a vendor site, the real identity of that "anonymized" IP address is completely confirmed.
Simply cloaking an IP address will never provide anonymity. But the companies that purchase your web surfing logs would have no use for logs that weren't attached to a single user. Unless the ISPs were to keep track of and filter out every single vendor site which revealed a user's real name, there would seem to be no safe way to anonymize user logs. Since there are countless numbers of web forums, vendors, and social networking sites, it would seem technically impossible to truly provide any safe level of anonymity for user logs. Selling these logs is just a bad practice that needs to be stopped.
I can only wonder why the EFF and other organizations haven't made a bigger deal about this. These ISPs are selling all of their user's web logs. I cannot imagine any effective way the ISP's could ever anonymize this data. More info: http://wanderingstan.com/2007-03-19/is_comcast_selling_your_clickstream_audio_transcript [wanderingstan.com] http://arstechnica.com/news.ars/post/20070315-your-isp-may-be-selling-your-web-clicks.html [arstechnica.com]
Open Marketing Initiatives (Score:1)
To balance the protection and sharing of information, more complex social networks infrastructure are required, may be projects like OpenQabal [java.net] can help.
Well Gee Wally (Score:1)
Well Gee Wally, they share our data with everybody damned else.
Anonymizing personal data == DRM (Score:2)
Personal-data-DRM (anonymization) protects the ISP's or hospital's data and tries to prevent end-users (researchers) from deriving an unprotected (unanonymized) version of the data in the file.
Personal-data-DRM (anonymization) makes things