IBM's New Differential Privacy Library Works With Just a Single Line of Code (ibm.com) 45
Friday IBM Research updated their open source "IBM Differential Privacy Library," a suite of new lightweight tools offering "an array of functionality to extract insight and knowledge from data with robust privacy guarantees."
"Most tasks can be run with only a single line of code," brags a new blog post (shared by Slashdot reader IBMResearch), explaining how it works: This year for the first time in its 230-year history the U.S. Census will use differential privacy to keep the responses of its citizens confidential when the data is made available. But how does it work? Differential privacy uses mathematical noise to preserve individuals' privacy and confidentiality while allowing population statistics to be observed.
This concept has a natural extension to machine learning, where we can protect models against privacy attacks, while maintaining overall accuracy. For example, if you want to know my age (32) I can pick a random number out of a hat, say ±7 — you will only learn that I could be between 25 and 39. I've added a little bit of noise to the data to protect my age and the US Census will do something similar.
While the US government built its own differential privacy tool, IBM has been working on its own open source version and today we are publishing our latest release v0.3. The IBM Differential Privacy Library boasts a suite of tools for machine learning and data analytics tasks, all with built-in privacy guarantees. Our library is unique to others in giving scientists and developers access to lightweight, user-friendly tools for data analytics and machine learning in a familiar environment... What also sets our library apart is our machine learning functionality enables organisations to publish and share their data with rigorous guarantees on user privacy like never before...
Also included is a collection of fundamental tools for data exploration and analytics. All the details for getting started with the library can be found at IBM's Github repository.
"Most tasks can be run with only a single line of code," brags a new blog post (shared by Slashdot reader IBMResearch), explaining how it works: This year for the first time in its 230-year history the U.S. Census will use differential privacy to keep the responses of its citizens confidential when the data is made available. But how does it work? Differential privacy uses mathematical noise to preserve individuals' privacy and confidentiality while allowing population statistics to be observed.
This concept has a natural extension to machine learning, where we can protect models against privacy attacks, while maintaining overall accuracy. For example, if you want to know my age (32) I can pick a random number out of a hat, say ±7 — you will only learn that I could be between 25 and 39. I've added a little bit of noise to the data to protect my age and the US Census will do something similar.
While the US government built its own differential privacy tool, IBM has been working on its own open source version and today we are publishing our latest release v0.3. The IBM Differential Privacy Library boasts a suite of tools for machine learning and data analytics tasks, all with built-in privacy guarantees. Our library is unique to others in giving scientists and developers access to lightweight, user-friendly tools for data analytics and machine learning in a familiar environment... What also sets our library apart is our machine learning functionality enables organisations to publish and share their data with rigorous guarantees on user privacy like never before...
Also included is a collection of fundamental tools for data exploration and analytics. All the details for getting started with the library can be found at IBM's Github repository.
Single line (Score:2)
Re: (Score:2)
Damn, came here to say this, my early obfuscation technique involved blocks of code delimited by semicolons, not Chr$(13)+Chr$(10) (Amstrad heritage!)
Re: (Score:2)
While agreeing that lines of code is not always a useful metric, it can have a large effect. "rm -rf /" for example, is a single line of shell code.
Re: (Score:2)
I was hampered by the choice my father made (CPC464, greenscreen) in some ways, but benefited in others, it taught me to workaround limitations and succeed as an underdog. Thank you for your reply, it made me smile!
Re: (Score:2)
Non-Amstrad users should note the use of the semicolon after the "HELLO ", this tells Locomotive BASIC to not print the Chr$(13)+Chr$(10), forcing my name to appear on the same line, which could be achieved without using the multiline operator using a space instead - this is someone who had an Amstrad, has a sense of humour, and wanted me to be happy. Thank you again!
Re: (Score:2)
I think the implication was that you can add a single (presumably not excessively long) line of code to your app and it just works.
Just like embedding an SQL SELECT statement then (Score:2)
Same wine, new bottle.
Re: (Score:2)
I think the implication was that you can add a single (presumably not excessively long) line of code to your app and it just works.
And yet, looking at the README there is no way anybody ever thought that about it.
They had to have just not understood what a "line of code" is compared to "a single API call."
Re: (Score:3)
If you want to be pedantic, you need to start with what exactly is being accomplished here.
About fifteen years ago cryptographers mathematically proved something that I think most people who have had experience with handling sensitive data probably suspected: given unlimited query access to a database it is possible to recover sensitive information about individuals that the database design and security policies are supposed to hide.
The only way to prevent this is to add statistical noise in a way that (a)
Re: (Score:2)
It is actually about 50k LOC, too.
You can't even import and call it with just one line.
The only thing that is one line is calling the method; same as any other API where you're only using one method.
Re: (Score:2)
No, but apparently reading isn't either, or you would realize what an idiotic comment this was. They have a library that drops in alongside scikit-learn and numpy which allows one to e.g. create a differential view of a test/train data split using a one line wrapper. That's pretty useful for people who need to expose user data to third parties via an ML model, without having to go through the steps of minimizing the dataset/dataframe in advance, or for people that want to experiment with differential privac
It only takes single line of code... (Score:2)
The only data that is ever secure is the data that has not been entered, measured, stored or transferred at all.
Re: (Score:2)
The method they describe in the summary only works if you have a single data point that is rarely updated, like the census. The age brackets are +/- 5 years, census is every 10 years, no problem.
Thing is there are lots of other sources of data about you. Say this year you are in the 35-44 age bracket. Next year you apply for something and you are in the 45-54 bracket. Combining those two data points we know you are 45 years old.
Re: (Score:2)
IBM has no incentive to sell your data, they're not an advertising company and they haven't taken any steps to actually collect anybody's data on a large scale.
You know you hate The Man. OK. But now consider: What if there is more than one Man?
Re: (Score:1)
Could be vulnerable to AI attacks? (Score:2)
It sounds like they are storing data with noise in a way that generates the same outputs in summation as the original data, but has been altered. Like how a picture with noise could generate the same blurred image as the original. However, reconstructing the most likely original image given an information reduced version is what the new AIs do so well, see http://thispersondoesnotexist.... [thisperson...texist.com]
What if my data is the most likely to an AI?
Re: Could be vulnerable to AI attacks? (Score:2)
This is more direct example of what I'm talking about:
https://blogs.nvidia.com/blog/... [nvidia.com]
Re: (Score:2)
Re: (Score:2)
AI may often do interesting things, sometimes things that humans can't do, but it's not magic and can't change mathematics.
"Can't change mathematics?" Maybe not AI, but Fortran 66 sure could. Now how did it go, something like this:
... would give the precise and perfectly accurate answer of: 4.
INTEGER I
CALL SUB(1)
I = 1 + 1
WRITE (6,*) I
SUBROUTINE SUB(N)
INTEGER N
N = 2
RETURN
Why? Because Fortran was call by value, so the subroutine call pointed to where the value for 1 had been previously stored and not a copy of it.. The subroutine changed the value of that place to 2. Back in the m
Re: (Score:2)
What happens if somebody explains to you that math is an abstraction, and Fortran 66 is not an abstraction but a machine made of switches? Would you even notice the lack of agreement in terms?
Re: (Score:2)
Alas, I've already posted and there is no moderation option for interesting AND funny, so this virtual mod will have to do :-)
Re: (Score:2)
But multiple data sets independently obfuscated can. For example, let's say your date of birth is obfuscated in multiple databases. As I add more such datasets, your obfuscated date of birth will cluster around a small area, the more datasets, the more I can constrain that area.
It's not nothing, but it's also not a silver bullet.
Re: (Score:2)
But multiple data sets independently obfuscated can. For example, let's say your date of birth is obfuscated in multiple databases. As I add more such datasets, your obfuscated date of birth will cluster around a small area, the more datasets, the more I can constrain that area.
Cite? The papers on differential privacy that I've read appear to prove that combining sets with differential privacy of epsilon cannot produce a data set with differential privacy >epsilon. I'll admit that I haven't read them carefully and worked through the proofs myself, and that I'm going on memory from reading that I did a year or two ago, so it's possible that I'm mistaken... but I don't think so.
Re: (Score:2)
It's a matter of mathematics. Plot your age with error bars from multiple datasets. Your actual age will be the intersection of the error bars.
You may be thinking of the impossibility of teasing out a particular persons information from multiple statistical summations of multiple datasets.
Re: (Score:2)
10 GOTO 10 (Score:2)
Yeah - it's IBM code. (Score:2)
Looking at the code - definitely has that IBM feel to it.
Oh - not that it was written by the same folks or something - but that same culture seems to have kept a lot of that same fashion sense, I suppose - something like a love of knots mixed with a love of old-latin-themed math symbols, and sparsely explained documentation trees.
Microsoft has the same sort of ideals lingering in its code - but it's most seen in its godawful spartan documentation sections.
Like, it's all a knot you have to untie, in order to
Re: (Score:2)
Re: (Score:1)
Highly not likely (Score:2)
Re: (Score:3)
As a COBOL programmer? It may be considered a requirement.
Re: (Score:1)
name (Score:2)
One line of code... (Score:2)
Uhm, isn't that called a Function or Subroutine in most programming languages? That reduces complex things into one line of code.
"built-in privacy guarantees" (Score:1)
What noise? (Score:2)
Re: What noise? (Score:2)
Window manager in a single line of code (Score:2)
startx.
It just sounds like (Score:2)
Re: (Score:1)
Then literal equivalent of pixelating an image. (Score:2)
Perhaps on the color range (as in: indexed colors), or the normal type.
Now I wonder whether you can still get "the picture", aka leak privacy, if it only does the former, like a GIF, instead of the latter, like a low-quality JPEG.
Still, ... sounds reasonable... Just add noise that does not distort the statistics, if the precision is still good enough. ...
I don't know why I would need a library for that though
Hahahah. APL! (Score:1)