An anonymous reader writes "The BBC reports on some changes to the data retention policy at Google in response to pressure from European authorities, but also included in the article is information about why Google claims they need to retain non-anonymised data for so long. Improving services, sure, but preventing fraud? Aiding 'valid legal orders'?"
Reader s0ckratees points to some commentary on the change at Google's official blog. The upshot: IP addresses in Google's logs will be anonymized after nine months, rather than 18 as previously.
And the government wants to know who's been searching for things they don't approve of they have to ask google for the logs every 9 months rather than every 18 months.
by Anonymous Coward
on Tuesday September 09 2008, @09:31AM (#24932649)
You may not like that Google keeps data, but they have an almost perfect record for keeping it private from others. Or did you not see the fuss they raised over YouTube data, and how even after being ordered to turn over their data, they still fought to reach a compromise that protected user privacy?
As for China, there's a reason Google keeps literally zero servers on Chinese soil. Even data for Chinese nationals is kept out of China, specifically so Google won't have to turn it over.
Short of not keeping data at all, there is pretty much nothing more they can do to protect privacy. But that's never enough for SlashDot...
That order was from the US government and they were turning the screws, perhaps not that hard but they were turning.
Now I don't think google is the great force of good, but it so far is at least neutral which is more than can be said for most companies its size and scope.
China is the least of my concerns. How about the Justice Department [zdnet.com] or the Department of Homeland Security?
The Europeans might be pressuring Google to reduce its retention periods, but I suspect that Google heard the opposite point-of-view from the government here in the USA.
Frankly I think that none of Google's logs should carry identifying information. If they need to track IPs for some reason, put them in a separate database table that's unconnected to the contents of the search strings. Keeping this information much beyond a week or two seems unreasonable to me.
Do you realise how pointless it is anonymising IP adresses after 18 months or even 9 months, via simple data base associations they can link access to a particular individual and no longer need the IP adress for longer term analysis. Based upon those records and intervening IP adress records any new access can be tied to existing database and the individual user.
In fact google clearly state they are only anonymising the users IP address and do not talk about any other long term user records. Even their pr
The Europeans might be pressuring Google to reduce its retention periods, but I suspect that Google heard the opposite point-of-view from the government here in the USA.
Interesting the Europeans want Google to not keep logs on people, the complete opposite of the European Union who have no problem on keeping logs of people for ever longer time to see if they are a threat (to them getting voted out). The oppression loving UK government is interested in unlimited retention time of data.
The European Parliament has approved rules forcing telephone companies to retain call and internet records for use in anti-terror investigations.
Records will be kept for up to two years under the new measures.
Truly, a human rights victory for the ages! Like that place... uh... what's it called? Something "square". I think it might be in India. Google's not coming up with anything.
but preventing fraud? Aiding 'valid legal orders'?
While I would say IP addresses shouldn't be the only method for these protection they do help.
Wow every site within 123.45.67.x seems to have a virus and malware on it. Oh a new site was scanned its address is in 123.45.67.x lets not publish right away lets put it threw full check. Or say 98.76.54.* always had clean site that were legit. A new site was found Well lets put it threw the quick checks and post it and queue it for full scan for later.
Wow every Arab country seems to hate us and throw planes at us. Oh a new male was seen in an airport. His ethnicity is Arabic lets not let him board the plane right away lets put it threw(sic) full check. Or say whitey always had clean papers that were legit. A new white boy was found Well lets put him threw(sic) the quick checks and let him board and queue it for full scan for later.
Fixed that for you. Hope you see the lunacy of your statement now.
Your argument has flaws in it. Checking IP Address is less like profiling an individual but the location. Even before 9/11 there has always been more security from people entering from plane in a different country. A plain from UK would be treated differently then a Plane from Switzerland (even if all the passengers are White Americans). While they will all be security checked. There are fast tracks open for known friendly location vs. Neutral Locations.
If it is known that a Netblock owner is lax about the c
Deal with it? Perhaps, after someone pointed out your mistake, you can in the future learn from it, instead of telling us to 'deal with it'.
That wasn't a bad spelling or a random typo: you used a completely different word...twice. It made it difficult to read your post.
Yes, Deal with it. Oh my God! a guy made a misspelling on a message board, actually it was just the wrong word. As spell checkers sometimes switch the words around. It is a problem that I have, Yes I have Dyslexia, and it is a constant battle taking me hours to write up a one page homework assignment for class, or half a day for a simple document for work. I am not going to spend 25 minutes witting a post, on Slashdot to make sure it is perfect. If it is difficult to read it means your mind is to closed
The difference between occasional misspelling and wrong words vs. adding characters that do not have a Phonic meaning to them. You method is intentionally trying to officiate your message, A wrong word but attempted to by phonically correct (At least similar to the accent of the writer) isn't trying to officiate the message. 7337 speech or Elite speech was designed particularly as a method of making reading difficult as it required (Just so some internet geeks could feel good about themselfs when they write
Sorry, I don't know your accent when I try to read something, nor should I to be able to comprehend it. Spelling is still important. For example, it took me a couple of reads to understand the line "A wrong word but attempted to by phonically correct" (that by was supposed to be 'be'). When you're reading a sentence and see a completely different word than what the word is supposed to be, it does throw people off the mark when trying to comprehend what you wrote, because they try to put the incorrect word i
they are scrubbing it out of good will more than anything.
Well, not really. The European advisory group was recommending a 6 month maximum, and Google were at 18. As Microsoft have learned, Europe is not shy about going after megacorps that think they are above its laws, and privacy and data protection issues are hot political topics in various EU countries right now, with a lot of media coverage of leaked data and rising public awareness of the dangers associated with such things.
This was done out of "good will" for the same reasons that industries accept "volunt
Actually, the IP should not be stored at all. Google might want to analyze the IPs to analyze and prevent attacks on its servers and additionally to get location information for its ad services. But there is no need to store it for a longer period -- unless you want to start massive data mining projects, which is exactly what is feared most from a privacy point of view.
So, any good news would be that the IP is not stored at all (except very temporarily).
-- unless you want to start massive data mining projects, which is exactly what is feared most from a privacy point of view.
Aehmm, I don't know, how you would describe the inner processes of a search engine, but in my book massive data mining is involved. So you got at least one motive to store massive distributed data, like say, the IP and search terms from 75% of Internets population: Testinput, ZeitGeistmaterial or localized ads and/or search results. Sorry, but storing data is a result of computing data and entering data.
Remember: It's not the information that hurts, it's the way you use and react to it.
-- unless you want to start massive data mining projects, which is exactly what is feared most from a privacy point of view.
Aehmm, I don't know, how you would describe the inner processes of a search engine, but in my book massive data mining is involved.
Agreed, but you do not need to store persistently for months the IP address of the requester. You can store the tags with the requests (such as location info, AS range of the IP). For more personalized computations you can use cookies -- at least with cookies the requester has some sort of control over his information.
Apparently, Google does not require the IP info after 9 months. So, what do they do with the IP address during those 9 months? Why can't they do it more or less immediately and then jus
peon@google ~ $/bin/su - Password: google ~ # psql searchlogs scooby -W Password: Welcome to psql 8.0.15, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit
searchlogs=> DELETE FROM searchlogs WHERE ts < NOW() - INTERVAL '9 months'; DELETE 551719812875516 searchlogs=> \q google ~ # logout peon@google ~ $ logout
It appears this 18 months, or 9 months as it is now, does not apply to Google Web History when you are logged into your google account. My Web History log goes back to April 2005.
I for one am glad they are not deleting the Web History log at 9 months. It is nice to be able to peruse through my searches done years ago.
I always thought Google should buy Sealand or some other country and move it's operations there outside of United states laws, it would do a lot of good if we had a country that didn't have such crap... abuses of new countries laws or lack of laws non-withstanding
Tor isn't great for high bandwidth connections, but I think it's just perfect to make sure all of those do-gooder large corporations don't get a choice about anonymizing IP addresses.
I'm talking about the fact that it's 2008, and that search space could be exhaustively searched in a matter of hours on a desktop machine.
As the poster below me points out, "throw away the salt" is an answer, but it means the logs can only be compared to other logs in the time frame that you were using that salt.
Maybe IPv6 will make anonymized logs more feasible because of the 2^128 search space.
I'd like to know if they also commit to anonymizing the client ID that is associated with every Chrome installation and the associated history tied to your account. After all, what's the point of anonymizing the IP data if your Chrome installation is tracking everything anyway? The same company would hold all the same information.
What I'm saying is that IP (v4) addresses are uniquely problematic for being pseudonymized from the perspective of a web master, because of the tiny search space. You wouldn't choose a 10 digit only password would you?
Say the threat model here is you are running a website and you get subpoenaed.
It would be great to be able to say, "OK here is a list of hashes of IP addresses, that's all I've got, have fun."...but you can't do that for the reason I said above. If you
So I generate a table with 2^32 IP addresses and their MD5 with themselves as the salt, it doesn't enlarge the search space in this situation and I can then easily do a binary search to find what the origional IP was.
The summary reads "...Google claim they need to retain..." The use of *claim* rather than *claims* suggests that Google is being viewed as something other than a single entity.Am I missing something or was that just a typo?
I'm sure their partners will retain the good stuff.
considering the amount of data Google processes on a regular basis, a 9 Month backlog isn't that unreasonable.
i'm more concerned about Google not handing my data over to 3rd parties or governments than their retaining records of my searches. as long as they're willing to stand up for the rights of users, they can hold my search data for as long as they need to improve search results, reduce spam, and develop personalized search features.
considering the amount of data Google processes on a regular basis, a 9 Month backlog isn't that unreasonable.
Sure it is. Why? Because they are collecting data continuously and if it takes a long time to process what they've collected, more data is backlogged, and it keeps spiraling out of control. In fact, if it takes more than 24 hours to process 1 day of data, the backlog will increase without limit. The proper thing to do is to apply proper anonymization to the information immediately so you don't
first off, Google's processing capacity isn't static, it's constantly growing. just because it takes more than 24 hrs to process a certain set of data doesn't mean that the backlog will increase without limit. that isn't a logically sound argument.
if you take that argument and reduce the time frame from 1 day to 1 hour->1 minute->1 millisecond... so on and so forth, you reach the conclusion that if Google is unable to instantaneously process/analyze every piece of data the exact moment it is received or created, then their backlog will increase without limit.
sometimes data needs to accumulated before it can be processed. for instance, to observe search trends, or to compare e-mails for spam analysis, etc. sometimes logs need to be kept for extended periods of time--that's why they're called logs--or data is retained for repeat analysis.
i don't know what exactly Google retains user data for or what kind of analysis they do, but it's understandable if some data needs to be retained in its original state for certain types of research or analysis. if they were going to release network measurement data to 3rd parties, as that paper you linked to discusses, then, yes, i would expect Google to follow their own anonymization guidelines. but like they've stated in their press release, it's all about finding a balance between protecting user privacy and improving the quality of their services.
perhaps the best thing to do is to give users the option to have their search requests retained for improving personalized search results, and let them enable/disable this feature as it suits them. all other data will simply be processed for a set period of time and then expunged.
if they're not releasing server logs to anyone, anonymization isn't really necessary. though i'm sure they allow users to access their services through anonymous proxies.
Scrape it (Score:3, Funny)
To sparkling shine
So the chin
Hairless, divine
Burma Shave
So if you live in china (Score:4, Interesting)
And the government wants to know who's been searching for things they don't approve of they have to ask google for the logs every 9 months rather than every 18 months.
Re:So if you live in china (Score:4, Insightful)
You may not like that Google keeps data, but they have an almost perfect record for keeping it private from others. Or did you not see the fuss they raised over YouTube data, and how even after being ordered to turn over their data, they still fought to reach a compromise that protected user privacy?
As for China, there's a reason Google keeps literally zero servers on Chinese soil. Even data for Chinese nationals is kept out of China, specifically so Google won't have to turn it over.
Short of not keeping data at all, there is pretty much nothing more they can do to protect privacy. But that's never enough for SlashDot...
Parent
Re: (Score:2)
That order was from the US government and they were turning the screws, perhaps not that hard but they were turning.
Now I don't think google is the great force of good, but it so far is at least neutral which is more than can be said for most companies its size and scope.
Re:So if you live in china (Score:5, Insightful)
China is the least of my concerns. How about the Justice Department [zdnet.com] or the Department of Homeland Security?
The Europeans might be pressuring Google to reduce its retention periods, but I suspect that Google heard the opposite point-of-view from the government here in the USA.
Frankly I think that none of Google's logs should carry identifying information. If they need to track IPs for some reason, put them in a separate database table that's unconnected to the contents of the search strings. Keeping this information much beyond a week or two seems unreasonable to me.
Parent
Re: (Score:3, Insightful)
In fact google clearly state they are only anonymising the users IP address and do not talk about any other long term user records. Even their pr
Re: (Score:3, Insightful)
The Europeans might be pressuring Google to reduce its retention periods, but I suspect that Google heard the opposite point-of-view from the government here in the USA.
Interesting the Europeans want Google to not keep logs on people, the complete opposite of the European Union who have no problem on keeping logs of people for ever longer time to see if they are a threat (to them getting voted out). The oppression loving UK government is interested in unlimited retention time of data.
http://news.bbc.co.uk/2/hi/europe/4527840.stm [bbc.co.uk]
The European Parliament has approved rules forcing telephone companies to retain call and internet records for use in anti-terror investigations. Records will be kept for up to two years under the new measures.
Re: (Score:2)
Truly, a human rights victory for the ages! Like that place ... uh ... what's it called? Something "square". I think it might be in India. Google's not coming up with anything.
Re: (Score:3, Funny)
Whew! Good thing I'm in America.
Why the question. (Score:2)
but preventing fraud? Aiding 'valid legal orders'?
While I would say IP addresses shouldn't be the only method for these protection they do help.
Wow every site within 123.45.67.x seems to have a virus and malware on it. Oh a new site was scanned its address is in 123.45.67.x lets not publish right away lets put it threw full check. Or say 98.76.54.* always had clean site that were legit. A new site was found Well lets put it threw the quick checks and post it and queue it for full scan for later.
Yes knowing
Re: (Score:2)
What is being anonymysed are the IP addreses of people that do a google search.
Re: (Score:2)
Wow every Arab country seems to hate us and throw planes at us. Oh a new male was seen in an airport. His ethnicity is Arabic lets not let him board the plane right away lets put it threw(sic) full check. Or say whitey always had clean papers that were legit. A new white boy was found Well lets put him threw(sic) the quick checks and let him board and queue it for full scan for later.
Fixed that for you. Hope you see the lunacy of your statement now.
Re: (Score:2)
Your argument has flaws in it.
Checking IP Address is less like profiling an individual but the location. Even before 9/11 there has always been more security from people entering from plane in a different country. A plain from UK would be treated differently then a Plane from Switzerland (even if all the passengers are White Americans).
While they will all be security checked. There are fast tracks open for known friendly location vs. Neutral Locations.
If it is known that a Netblock owner is lax about the c
Re: (Score:2)
Re: (Score:2)
Except for the fact that my account is 10 years old... I am just a bad speller deal with it.
Re: (Score:2)
Re: (Score:2)
Yes, Deal with it. Oh my God! a guy made a misspelling on a message board, actually it was just the wrong word. As spell checkers sometimes switch the words around. It is a problem that I have, Yes I have Dyslexia, and it is a constant battle taking me hours to write up a one page homework assignment for class, or half a day for a simple document for work. I am not going to spend 25 minutes witting a post, on Slashdot to make sure it is perfect. If it is difficult to read it means your mind is to closed
Re: (Score:2)
Re: (Score:2)
The difference between occasional misspelling and wrong words vs. adding characters that do not have a Phonic meaning to them. You method is intentionally trying to officiate your message, A wrong word but attempted to by phonically correct (At least similar to the accent of the writer) isn't trying to officiate the message. 7337 speech or Elite speech was designed particularly as a method of making reading difficult as it required (Just so some internet geeks could feel good about themselfs when they write
Re: (Score:2)
Re: (Score:2)
I am not going to spend 25 minutes witting a post
If only everyone here spent even 5 minutes witting their posts, the quality of humor here would be much improved.
Improving services, sure, but preventing fraud? (Score:5, Insightful)
Improving services, sure, but preventing fraud?
Sure - AdWord fraud. Scrubbing logs quicker means less leeway for click fraud to be discovered.
Re: (Score:2)
Good will? No. Enlightened self-interest? Maybe. (Score:2)
they are scrubbing it out of good will more than anything.
Well, not really. The European advisory group was recommending a 6 month maximum, and Google were at 18. As Microsoft have learned, Europe is not shy about going after megacorps that think they are above its laws, and privacy and data protection issues are hot political topics in various EU countries right now, with a lot of media coverage of leaked data and rising public awareness of the dangers associated with such things.
This was done out of "good will" for the same reasons that industries accept "volunt
Google handing data over... (Score:2)
Google is handing data over to a few 3 letter agencies. BIG SHOCK! OH NO! NSA Reads my email!
Seriously, I put google not handing over such data at somewhere between 0 and -1.
Why not after DHCP lease expires... (Score:2)
Figure out a pseudo average for a DHCP lease... say 72hours, and make anonymous after that?
9 months are too long (Score:3, Informative)
Actually, the IP should not be stored at all. Google might want to analyze the IPs to analyze and prevent attacks on its servers and additionally to get location information for its ad services. But there is no need to store it for a longer period -- unless you want to start massive data mining projects, which is exactly what is feared most from a privacy point of view.
So, any good news would be that the IP is not stored at all (except very temporarily).
Re: (Score:2)
-- unless you want to start massive data mining projects, which is exactly what is feared most from a privacy point of view.
Aehmm, I don't know, how you would describe the inner processes of a search engine, but in my book massive data mining is involved. So you got at least one motive to store massive distributed data, like say, the IP and search terms from 75% of Internets population: Testinput, ZeitGeistmaterial or localized ads and/or search results. Sorry, but storing data is a result of computing data and entering data.
Remember: It's not the information that hurts, it's the way you use and react to it.
Re: (Score:2)
-- unless you want to start massive data mining projects, which is exactly what is feared most from a privacy point of view.
Aehmm, I don't know, how you would describe the inner processes of a search engine, but in my book massive data mining is involved.
Agreed, but you do not need to store persistently for months the IP address of the requester. You can store the tags with the requests (such as location info, AS range of the IP). For more personalized computations you can use cookies -- at least with cookies the requester has some sort of control over his information.
Apparently, Google does not require the IP info after 9 months. So, what do they do with the IP address during those 9 months? Why can't they do it more or less immediately and then jus
Re: (Score:2)
9 months are too long
I am sure many women, especially mothers, agree.
Google anonymizing in China/India (Score:2, Informative)
What difference does it make to reduce this 18 months to 9 months log retention period?
Will Google anonymize logs in other countries too?
How about Google China? It respectfully hands over logs to the authorities on demand anytime. Same with Google India [wired.com].
I've just done it. (Score:4, Funny)
Google Web History goes back to April 2005 here (Score:2)
It appears this 18 months, or 9 months as it is now, does not apply to Google Web History when you are logged into your google account. My Web History log goes back to April 2005.
I for one am glad they are not deleting the Web History log at 9 months. It is nice to be able to peruse through my searches done years ago.
Re: (Score:2)
I for one am glad they are not deleting the Web History log at 9 months. It is nice to be able to peruse through my searches done years ago.
We Agree.
--Your Friendly Neighborhood TLA
Google need's it's own country? (Score:2)
I always thought Google should buy Sealand or some other country and move it's operations there outside of United states laws, it would do a lot of good if we had a country that didn't have such crap... abuses of new countries laws or lack of laws non-withstanding
do the anonymizing yourself (Score:2, Informative)
Tor isn't great for high bandwidth connections, but I think it's just perfect to make sure all of those do-gooder large corporations don't get a choice about anonymizing IP addresses.
http://www.torproject.org/ [torproject.org]
Re: (Score:2)
Re:Just out of interest (Score:5, Insightful)
Salting goes without saying -1 uninsightful
I'm talking about the fact that it's 2008, and that search space could be exhaustively searched in a matter of hours on a desktop machine.
As the poster below me points out, "throw away the salt" is an answer, but it means the logs can only be compared to other logs in the time frame that you were using that salt.
Maybe IPv6 will make anonymized logs more feasible because of the 2^128 search space.
Parent
Re: (Score:2)
How do you Anonymize IP logs?
By using Scroogle [scroogle.org].
Note to mods:
I got my karma for this post here [slashdot.org], don't mod me up again for the same information <grin>.
What about Chrome? (Score:2)
I'd like to know if they also commit to anonymizing the client ID that is associated with every Chrome installation and the associated history tied to your account. After all, what's the point of anonymizing the IP data if your Chrome installation is tracking everything anyway? The same company would hold all the same information.
Re: (Score:2)
I should have been way more specific here.
What I'm saying is that IP (v4) addresses are uniquely problematic for being pseudonymized from the perspective of a web master, because of the tiny search space.
You wouldn't choose a 10 digit only password would you?
Say the threat model here is you are running a website and you get subpoenaed.
It would be great to be able to say, "OK here is a list of hashes of IP addresses, that's all I've got, have fun." ...but you can't do that for the reason I said above. If you
Re: (Score:3, Funny)
I'm hypertensive, you insensitive clod.
Re: (Score:3, Insightful)
So I generate a table with 2^32 IP addresses and their MD5 with themselves as the salt, it doesn't enlarge the search space in this situation and I can then easily do a binary search to find what the origional IP was.
Re: (Score:2)
The summary reads "...Google claim they need to retain..." The use of *claim* rather than *claims* suggests that Google is being viewed as something other than a single entity.Am I missing something or was that just a typo?
I'm sure their partners will retain the good stuff.
Re: (Score:2)
British English, dude. At some registers, collectives are viewed as plural, not singular.
Search BBC stories for "Microsoft are" and such. (Whether that somewhat informal register should be used in BBC pieces is another topic entirely...)
Re: (Score:2)
Yes, I was treating Google as a collective noun [wikipedia.org], and yes, I'm British.
(I submitted the article. Amusingly, I appear to have anonymised myself while doing so...)
Re:9 Months (Score:5, Insightful)
considering the amount of data Google processes on a regular basis, a 9 Month backlog isn't that unreasonable.
i'm more concerned about Google not handing my data over to 3rd parties or governments than their retaining records of my searches. as long as they're willing to stand up for the rights of users, they can hold my search data for as long as they need to improve search results, reduce spam, and develop personalized search features.
Parent
Re: (Score:3, Insightful)
considering the amount of data Google processes on a regular basis, a 9 Month backlog isn't that unreasonable.
Sure it is. Why? Because they are collecting data continuously and if it takes a long time to process what they've collected, more data is backlogged, and it keeps spiraling out of control. In fact, if it takes more than 24 hours to process 1 day of data, the backlog will increase without limit. The proper thing to do is to apply proper anonymization to the information immediately so you don't
Re:9 Months (Score:5, Insightful)
first off, Google's processing capacity isn't static, it's constantly growing. just because it takes more than 24 hrs to process a certain set of data doesn't mean that the backlog will increase without limit. that isn't a logically sound argument.
if you take that argument and reduce the time frame from 1 day to 1 hour->1 minute->1 millisecond... so on and so forth, you reach the conclusion that if Google is unable to instantaneously process/analyze every piece of data the exact moment it is received or created, then their backlog will increase without limit.
sometimes data needs to accumulated before it can be processed. for instance, to observe search trends, or to compare e-mails for spam analysis, etc. sometimes logs need to be kept for extended periods of time--that's why they're called logs--or data is retained for repeat analysis.
i don't know what exactly Google retains user data for or what kind of analysis they do, but it's understandable if some data needs to be retained in its original state for certain types of research or analysis. if they were going to release network measurement data to 3rd parties, as that paper you linked to discusses, then, yes, i would expect Google to follow their own anonymization guidelines. but like they've stated in their press release, it's all about finding a balance between protecting user privacy and improving the quality of their services.
perhaps the best thing to do is to give users the option to have their search requests retained for improving personalized search results, and let them enable/disable this feature as it suits them. all other data will simply be processed for a set period of time and then expunged.
if they're not releasing server logs to anyone, anonymization isn't really necessary. though i'm sure they allow users to access their services through anonymous proxies.
Parent
Re: (Score:2, Informative)