Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Privacy The Internet United States Technology

You're Very Easy To Track Down, Even When Your Data Has Been 'Anonymized' (technologyreview.com) 53

An anonymous reader quotes a report from MIT Technology Review: The most common way public agencies protect our identities is anonymization. This involves stripping out obviously identifiable things such as names, phone numbers, email addresses, and so on. Data sets are also altered to be less precise, columns in spreadsheets are removed, and "noise" is introduced to the data. Privacy policies reassure us that this means there's no risk we could be tracked down in the database. However, a new study in Nature Communications suggests this is far from the case. Researchers from Imperial College London and the University of Louvain have created a machine-learning model that estimates exactly how easy individuals are to reidentify from an anonymized data set. You can check your own score here, by entering your zip code, gender, and date of birth.

On average, in the U.S., using those three records, you could be correctly located in an "anonymized" database 81% of the time. Given 15 demographic attributes of someone living in Massachusetts, there's a 99.98% chance you could find that person in any anonymized database. The tool was created by assembling a database of 210 different data sets from five sources, including the U.S. Census. The researchers fed this data into a machine-learning model, which learned which combinations are more nearly unique and which are less so, and then assigns the probability of correct identification.

This discussion has been archived. No new comments can be posted.

You're Very Easy To Track Down, Even When Your Data Has Been 'Anonymized'

Comments Filter:
  • Yeah i'm going to enter my zip code, gender, and date of birth right away !

    • The result is bullcrap.

      I entered my info, and it said a person matching my DOB, gender, and zipcode has a 73% chance of being me.

      How does that make any sense at all? If there is no other person with the same DOB/gender/zipcode, then it is certain to be me. If there is one other person, then it is 50%. Otherwise, it is even less. It is mathematically impossible for it to be 73%.

  • Sheesh, at least put an offset on the Date Of Birth. We always added/subtracted a random number from 1 to 30 days.
    • by Kjella ( 173770 ) on Wednesday July 24, 2019 @09:25PM (#58982562) Homepage

      Sheesh, at least put an offset on the Date Of Birth. We always added/subtracted a random number from 1 to 30 days.

      It's easy when you have just one date variable. Once you have two or more it gets gnarly, let's say you have a start and end date of something. Do you fudge them the same? That's basically the same as not fudging the interval, if it was 279 days before it's 279 days now. If you want to fudge that you have to start making sure the constraints still makes sense like start date before end date and all the dates that should be inside that interval. Making properly anonymous data sets when you have to assume the person has some of the data but isn't supposed to glean the rest is extremely painful.

      My experience with this has been with healthcare data and it's problematic because some parts of your medical history are fairly obvious, if you broke your leg and was limping around with a cast for weeks everybody knows. Other things you may have told people in confidence, neither of them are supposed to gain anymore information about you from an anonymous set. This is particularly tricky with researchers that have worked with individual cases but we're supposed to deliver medical data for research without them recognizing individuals but if you make it a total blur then the data is also worthless.

    • In a past life I was given an anonymized dataset - the DOB was clear, as were all the name and address fields. But not "Title". You can identify a few people from their title alone (typically containing strings like "His Excellency the").

  • Zip, gender and DOB are unique for like 80% of the population. That's been known for years (and was reported on /. years ago)

    • Zip, gender and DOB are unique for like 80% of the population. That's been known for years (and was reported on /. years ago)

      The average zipcode has 7800 people. DOB has roughly 365 days/year * 70 years, and gender doubles the possibilities, so that is over 40,000. So, yeah, most would be unique.

      • I tried the test giving a real UK postcode (not mine) and a false but realistic DoB. In fact it would not allow me to put in the whole postcode which in the UK narrows it to typically half a street (say 50 people), only the broad first part which was for about a fifth of a major city.

        It then said my postcode does not exist but nevertheless it gave me an estimated chance of NaN% - WTF does that mean? Percentage of sodium nitride? Although they told me it is more than 73%. Here is what it said :-

        in an “anonymous” health dataset, that person would be you NaN%
        of the time!
        Other people have, on average, 79%
        chance of being correctly re-identified, making you much more unique than the rest of the UK population.

        All this

    • It's called the Birthday Paradox. https://en.wikipedia.org/wiki/... [wikipedia.org]
  • by Seven Spirals ( 4924941 ) on Wednesday July 24, 2019 @08:25PM (#58982238)
    I've hated you for over a decade, Coward. Now we may finally know who you are. *polishes shotgun* :-)
  • My Result:

    "53% of the time!
    If your employer or neighbor finds someone matching your date of birth (xx/xx/xxxx), gender (X), and ZIP code (X) in an “anonymous” health dataset, that person would be you 53%
    of the time!
    Other people have, on average, 83%
    chance of being correctly re-identified, making you much less unique than the rest of the US population."

    I'm "much less unique"? Should I be insulted?

    • by Anonymous Coward

      My Zip contains around 8000 people. Birth date narrows to around 22 candidates, assuming uniform distribution. Age and sex are then nearly sufficient to identify, especially given that my age is 65+.

    • by q4Fry ( 1322209 )

      I confess I do not understand how it could be any number 50% X 100%. Obviously, 50% would mean that two people share the same three data points. 100% would mean just you. How can you get part of a person who matches the criteria? Don't make me RTFA.

  • From the test:

    and ZIP code (xxxxx) in an âoeanonymousâ health dataset, that person would be you 67% of the time!

    Our state used to regulate health insurance and providers, mandating uniform coverage across the state. Therefore, it was unnecessary for doctors and insurers to break down patients to a per zip code basis. My insurer had no idea where I lived (other than in the little PO Box to which they sent correspondence). Once the ACA went into effect, it superseded state regulations and allowed businesses to demand a home address.

    I fooled them. When they sent me a few notices demanding my home address,

  • Hey, I am with you bro :) https://www.myquickcents.com/ [myquickcents.com]
  • They probably should update their database to include all the zip codes, as it says mine does not exist, when in fact it most certainly has existed for over 6 years. It took many of the banks and credit card processors only a year or two to figure that one out.

UNIX is hot. It's more than hot. It's steaming. It's quicksilver lightning with a laserbeam kicker. -- Michael Jay Tucker

Working...