You're Very Easy To Track Down, Even When Your Data Has Been 'Anonymized' (technologyreview.com) 53
An anonymous reader quotes a report from MIT Technology Review: The most common way public agencies protect our identities is anonymization. This involves stripping out obviously identifiable things such as names, phone numbers, email addresses, and so on. Data sets are also altered to be less precise, columns in spreadsheets are removed, and "noise" is introduced to the data. Privacy policies reassure us that this means there's no risk we could be tracked down in the database. However, a new study in Nature Communications suggests this is far from the case. Researchers from Imperial College London and the University of Louvain have created a machine-learning model that estimates exactly how easy individuals are to reidentify from an anonymized data set. You can check your own score here, by entering your zip code, gender, and date of birth.
On average, in the U.S., using those three records, you could be correctly located in an "anonymized" database 81% of the time. Given 15 demographic attributes of someone living in Massachusetts, there's a 99.98% chance you could find that person in any anonymized database. The tool was created by assembling a database of 210 different data sets from five sources, including the U.S. Census. The researchers fed this data into a machine-learning model, which learned which combinations are more nearly unique and which are less so, and then assigns the probability of correct identification.
On average, in the U.S., using those three records, you could be correctly located in an "anonymized" database 81% of the time. Given 15 demographic attributes of someone living in Massachusetts, there's a 99.98% chance you could find that person in any anonymized database. The tool was created by assembling a database of 210 different data sets from five sources, including the U.S. Census. The researchers fed this data into a machine-learning model, which learned which combinations are more nearly unique and which are less so, and then assigns the probability of correct identification.
I do it right away! (Score:1)
Yeah i'm going to enter my zip code, gender, and date of birth right away !
Re: (Score:3)
The result is bullcrap.
I entered my info, and it said a person matching my DOB, gender, and zipcode has a 73% chance of being me.
How does that make any sense at all? If there is no other person with the same DOB/gender/zipcode, then it is certain to be me. If there is one other person, then it is 50%. Otherwise, it is even less. It is mathematically impossible for it to be 73%.
Not really anonymized (Score:2)
Re:Not really anonymized (Score:4)
Sheesh, at least put an offset on the Date Of Birth. We always added/subtracted a random number from 1 to 30 days.
It's easy when you have just one date variable. Once you have two or more it gets gnarly, let's say you have a start and end date of something. Do you fudge them the same? That's basically the same as not fudging the interval, if it was 279 days before it's 279 days now. If you want to fudge that you have to start making sure the constraints still makes sense like start date before end date and all the dates that should be inside that interval. Making properly anonymous data sets when you have to assume the person has some of the data but isn't supposed to glean the rest is extremely painful.
My experience with this has been with healthcare data and it's problematic because some parts of your medical history are fairly obvious, if you broke your leg and was limping around with a cast for weeks everybody knows. Other things you may have told people in confidence, neither of them are supposed to gain anymore information about you from an anonymous set. This is particularly tricky with researchers that have worked with individual cases but we're supposed to deliver medical data for research without them recognizing individuals but if you make it a total blur then the data is also worthless.
Re: (Score:1)
In a past life I was given an anonymized dataset - the DOB was clear, as were all the name and address fields. But not "Title". You can identify a few people from their title alone (typically containing strings like "His Excellency the").
Been known for a while (Score:2)
Zip, gender and DOB are unique for like 80% of the population. That's been known for years (and was reported on /. years ago)
Re: (Score:2)
Zip, gender and DOB are unique for like 80% of the population. That's been known for years (and was reported on /. years ago)
The average zipcode has 7800 people. DOB has roughly 365 days/year * 70 years, and gender doubles the possibilities, so that is over 40,000. So, yeah, most would be unique.
Gender: M, F, or X (Score:2)
Gender is fairly evenly distributed.
At least it used to be, and I'm not even talking about sex-selective abortion in one-child jurisdictions [c-fam.org] or female infanticide in cultures that expect parents of a daughter to pay a dowry [theguardian.com].
In January 2019, New York City began to allow a third gender on birth certificates. The "X" for nonbinary can mean one of two things: the child was observed to be intersex (that is, with unclear karyotype or genitalia) or the person transitioned to nonbinary later in life. New York City follows Oregon, DC, the State of Was
Re: (Score:2)
It then said my postcode does not exist but nevertheless it gave me an estimated chance of NaN% - WTF does that mean? Percentage of sodium nitride? Although they told me it is more than 73%. Here is what it said
in an “anonymous” health dataset, that person would be you NaN%
of the time!
Other people have, on average, 79%
chance of being correctly re-identified, making you much more unique than the rest of the UK population.
All this
Re: (Score:1)
Re: (Score:2)
No, it's not called the Birthday Paradox. That's a completely different problem that illustrates completely different points.
Mr. Anonymous Coward, they are coming for you! (Score:4, Funny)
Re: (Score:3, Funny)
*polishes shotgun*
Is that what you kids are calling it these days?
Re: Mr. Anonymous Coward, they are coming for you! (Score:1)
Hah! You want my name? Try "Legion".
Mr. Anonymous Coward is like Florida Man (Score:2)
Re: (Score:2)
"...much less unique." (Score:2)
My Result:
"53% of the time!
If your employer or neighbor finds someone matching your date of birth (xx/xx/xxxx), gender (X), and ZIP code (X) in an “anonymous” health dataset, that person would be you 53%
of the time!
Other people have, on average, 83%
chance of being correctly re-identified, making you much less unique than the rest of the US population."
I'm "much less unique"? Should I be insulted?
Re: "...much less unique." (Score:1)
My Zip contains around 8000 people. Birth date narrows to around 22 candidates, assuming uniform distribution. Age and sex are then nearly sufficient to identify, especially given that my age is 65+.
Re: (Score:2)
... BUT if you add in the commentor's gender, that drops to 11. And then, that's assuming even distribution across ages, which is unlikely. Given that he's basically right on the average life expectancy (75 year) we can cut that number down again. There are probably 3-7 people in his Zip code that match his most basic identifying characteristics. If you throw in one more thing (like car make), he's easily identifiable.
Re: (Score:2)
I confess I do not understand how it could be any number 50% X 100%. Obviously, 50% would mean that two people share the same three data points. 100% would mean just you. How can you get part of a person who matches the criteria? Don't make me RTFA.
Re: (Score:2)
Slashdot, $deity curse your inability to display 50% < X < 100%.
Thank you ObamaCare! (Score:2)
From the test:
and ZIP code (xxxxx) in an âoeanonymousâ health dataset, that person would be you 67% of the time!
Our state used to regulate health insurance and providers, mandating uniform coverage across the state. Therefore, it was unnecessary for doctors and insurers to break down patients to a per zip code basis. My insurer had no idea where I lived (other than in the little PO Box to which they sent correspondence). Once the ACA went into effect, it superseded state regulations and allowed businesses to demand a home address.
I fooled them. When they sent me a few notices demanding my home address,
My love (Score:1)
ZIP Code Doesn't Exist (Score:1)
They probably should update their database to include all the zip codes, as it says mine does not exist, when in fact it most certainly has existed for over 6 years. It took many of the banks and credit card processors only a year or two to figure that one out.