Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
IBM Privacy AI

IBM's New Differential Privacy Library Works With Just a Single Line of Code (ibm.com) 45

Friday IBM Research updated their open source "IBM Differential Privacy Library," a suite of new lightweight tools offering "an array of functionality to extract insight and knowledge from data with robust privacy guarantees."

"Most tasks can be run with only a single line of code," brags a new blog post (shared by Slashdot reader IBMResearch), explaining how it works: This year for the first time in its 230-year history the U.S. Census will use differential privacy to keep the responses of its citizens confidential when the data is made available. But how does it work? Differential privacy uses mathematical noise to preserve individuals' privacy and confidentiality while allowing population statistics to be observed.

This concept has a natural extension to machine learning, where we can protect models against privacy attacks, while maintaining overall accuracy. For example, if you want to know my age (32) I can pick a random number out of a hat, say ±7 — you will only learn that I could be between 25 and 39. I've added a little bit of noise to the data to protect my age and the US Census will do something similar.

While the US government built its own differential privacy tool, IBM has been working on its own open source version and today we are publishing our latest release v0.3. The IBM Differential Privacy Library boasts a suite of tools for machine learning and data analytics tasks, all with built-in privacy guarantees. Our library is unique to others in giving scientists and developers access to lightweight, user-friendly tools for data analytics and machine learning in a familiar environment... What also sets our library apart is our machine learning functionality enables organisations to publish and share their data with rigorous guarantees on user privacy like never before...

Also included is a collection of fundamental tools for data exploration and analytics. All the details for getting started with the library can be found at IBM's Github repository.

This discussion has been archived. No new comments can be posted.

IBM's New Differential Privacy Library Works With Just a Single Line of Code

Comments Filter:
  • Lines of code is not a metric. I can condense a 50k line cpp program into one line.
    • Damn, came here to say this, my early obfuscation technique involved blocks of code delimited by semicolons, not Chr$(13)+Chr$(10) (Amstrad heritage!)

      • While agreeing that lines of code is not always a useful metric, it can have a large effect. "rm -rf /" for example, is a single line of shell code.

    • by AmiMoJo ( 196126 )

      I think the implication was that you can add a single (presumably not excessively long) line of code to your app and it just works.

    • by hey! ( 33014 )

      If you want to be pedantic, you need to start with what exactly is being accomplished here.

      About fifteen years ago cryptographers mathematically proved something that I think most people who have had experience with handling sensitive data probably suspected: given unlimited query access to a database it is possible to recover sensitive information about individuals that the database design and security policies are supposed to hide.

      The only way to prevent this is to add statistical noise in a way that (a)

    • It is actually about 50k LOC, too.

      You can't even import and call it with just one line.

      The only thing that is one line is calling the method; same as any other API where you're only using one method.

    • by Pimpy ( 143938 )

      No, but apparently reading isn't either, or you would realize what an idiotic comment this was. They have a library that drops in alongside scikit-learn and numpy which allows one to e.g. create a differential view of a test/train data split using a one line wrapper. That's pretty useful for people who need to expose user data to third parties via an ML model, without having to go through the steps of minimizing the dataset/dataframe in advance, or for people that want to experiment with differential privac

  • ... to let your privacy evaporate into nothing if the code that is supposed to obscure your personal data is provided by some profit greedy mega-corporation that has every incentive to sell your data to the highest bidder.

    The only data that is ever secure is the data that has not been entered, measured, stored or transferred at all.
    • by AmiMoJo ( 196126 )

      The method they describe in the summary only works if you have a single data point that is rarely updated, like the census. The age brackets are +/- 5 years, census is every 10 years, no problem.

      Thing is there are lots of other sources of data about you. Say this year you are in the 35-44 age bracket. Next year you apply for something and you are in the 45-54 bracket. Combining those two data points we know you are 45 years old.

    • IBM has no incentive to sell your data, they're not an advertising company and they haven't taken any steps to actually collect anybody's data on a large scale.

      You know you hate The Man. OK. But now consider: What if there is more than one Man?

    • It's true, the only way to fully preserve privacy is not to release data, but differential privacy is the best compromise we have. Having a way to extract knowledge from data with mathematical guarantees on individuals' privacy is important to have. The census is a good example, because the Census is required by law to collect data, and making it available delivers various benefits to the population.
  • It sounds like they are storing data with noise in a way that generates the same outputs in summation as the original data, but has been altered. Like how a picture with noise could generate the same blurred image as the original. However, reconstructing the most likely original image given an information reduced version is what the new AIs do so well, see http://thispersondoesnotexist.... [thisperson...texist.com]
    What if my data is the most likely to an AI?

    • This is more direct example of what I'm talking about:
      https://blogs.nvidia.com/blog/... [nvidia.com]

    • Differential privacy provides mathematical proof that individual signals cannot be extracted with probability epsilon. You pick epsilon, the math tells you how much noise you need to add -- and how many data points you need to extract useful information. AI may often do interesting things, sometimes things that humans can't do, but it's not magic and can't change mathematics.
      • AI may often do interesting things, sometimes things that humans can't do, but it's not magic and can't change mathematics.

        "Can't change mathematics?" Maybe not AI, but Fortran 66 sure could. Now how did it go, something like this:


        INTEGER I
        CALL SUB(1)
        I = 1 + 1
        WRITE (6,*) I

        SUBROUTINE SUB(N)
        INTEGER N
        N = 2
        RETURN

        ... would give the precise and perfectly accurate answer of: 4.

        Why? Because Fortran was call by value, so the subroutine call pointed to where the value for 1 had been previously stored and not a copy of it.. The subroutine changed the value of that place to 2. Back in the m

        • What happens if somebody explains to you that math is an abstraction, and Fortran 66 is not an abstraction but a machine made of switches? Would you even notice the lack of agreement in terms?

        • by sjames ( 1099 )

          Alas, I've already posted and there is no moderation option for interesting AND funny, so this virtual mod will have to do :-)

      • by sjames ( 1099 )

        But multiple data sets independently obfuscated can. For example, let's say your date of birth is obfuscated in multiple databases. As I add more such datasets, your obfuscated date of birth will cluster around a small area, the more datasets, the more I can constrain that area.

        It's not nothing, but it's also not a silver bullet.

        • But multiple data sets independently obfuscated can. For example, let's say your date of birth is obfuscated in multiple databases. As I add more such datasets, your obfuscated date of birth will cluster around a small area, the more datasets, the more I can constrain that area.

          Cite? The papers on differential privacy that I've read appear to prove that combining sets with differential privacy of epsilon cannot produce a data set with differential privacy >epsilon. I'll admit that I haven't read them carefully and worked through the proofs myself, and that I'm going on memory from reading that I did a year or two ago, so it's possible that I'm mistaken... but I don't think so.

          • by sjames ( 1099 )

            It's a matter of mathematics. Plot your age with error bars from multiple datasets. Your actual age will be the intersection of the error bars.

            You may be thinking of the impossibility of teasing out a particular persons information from multiple statistical summations of multiple datasets.

            • So... you don't actually know anything about the mathematics of differential privacy. Got it. (Hint: The whole point is that it's not possible to determine whether a given individual's data is even in the data set, so there's no way to plot the "error bars" for a given individual).
  • The only line of code you need.
  • Looking at the code - definitely has that IBM feel to it.

    Oh - not that it was written by the same folks or something - but that same culture seems to have kept a lot of that same fashion sense, I suppose - something like a love of knots mixed with a love of old-latin-themed math symbols, and sparsely explained documentation trees.

    Microsoft has the same sort of ideals lingering in its code - but it's most seen in its godawful spartan documentation sections.

    Like, it's all a knot you have to untie, in order to

  • How likely am I to get a job if the data comes back that I'm 95-100 years old?
    • As a COBOL programmer? It may be considered a requirement.

    • The broad goal of differential privacy is to preserve the privacy of individuals while allowing population statistics to be accurately observable. So in the context of adding noise to ages, you own age is obfuscated (to prevent things like linkage attacks), but population statistics can still be computed accurately (i.e., the average age of people in the dataset). This concept extends to much more complex "queries", like training a machine learning model.
  • for government form purposes my legal name is between "Nkfsdjhflh" and "Wkjfhulsdj"
  • Uhm, isn't that called a Function or Subroutine in most programming languages? That reduces complex things into one line of code.

  • So if guarantee is violated have they specified by name who goes to prison? Everything guaranteed has been violated at some point.
  • Taking only the description. Since the supposed noise is used to get an identical positive and negative range from the actual value then the mean of the limits is the original value. I'm 32, I choose 7 as the "noise" and my age is in the range 25 to 39. So, 25 + 39 = 64. And 64 /2 = 32, my age. Here's hoping the actual noise doesn't always leave the original value at the exact center of the range.
    • So, I enter my age and, e.g. a "privacy" element. When performing a query my entry has noise applied based on the privacy property i supplied so that my appearance will be randomly across results that would show rows with ages between 25 and 39 but all other ages entered have their own noise such that the set that appears in each of two results (that someone expects to differ by one entry such that the difference of the results gives my age) is actually unknown and as more result pairs are compared the prob
  • It sounds to me like they are just encrypting the data, and we know what is encrypted can be decrypted many times.
    • Differential privacy (and data privacy, more generally) is commonly mistaken as a sub-topic of security or cryptography, but the two fields have different purposes. With security, you are limiting access to data to authorised people only (i.e. someone with a key or password to access the data). With differential privacy and data privacy, you are looking to publish (statistics of) sensitive data to the public without revealing personal information about the individuals in the dataset. The mathematical guaran
  • Perhaps on the color range (as in: indexed colors), or the normal type.

    Now I wonder whether you can still get "the picture", aka leak privacy, if it only does the former, like a GIF, instead of the latter, like a low-quality JPEG.

    Still, ... sounds reasonable... Just add noise that does not distort the statistics, if the precision is still good enough.
    I don't know why I would need a library for that though ...

  • We've seen this movie before. APL, from the 60s, another brilliant IBM invention iirc, was famous for being able to do almost anything in a single line of code, a line of code that no one could understand or debug.

Let's organize this thing and take all the fun out of it.

Working...