


Google Releases VaultGemma, Its First Privacy-Preserving LLM 23
An anonymous reader quotes a report from Ars Technica: The companies seeking to build larger AI models have been increasingly stymied by a lack of high-quality training data. As tech firms scour the web for more data to feed their models, they could increasingly rely on potentially sensitive user data. A team at Google Research is exploring new techniques to make the resulting large language models (LLMs) less likely to 'memorize' any of that content. LLMs have non-deterministic outputs, meaning you can't exactly predict what they'll say. While the output varies even for identical inputs, models do sometimes regurgitate something from their training data -- if trained with personal data, the output could be a violation of user privacy. In the event copyrighted data makes it into training data (either accidentally or on purpose), its appearance in outputs can cause a different kind of headache for devs. Differential privacy can prevent such memorization by introducing calibrated noise during the training phase.
Adding differential privacy to a model comes with drawbacks in terms of accuracy and compute requirements. No one has bothered to figure out the degree to which that alters the scaling laws of AI models until now. The team worked from the assumption that model performance would be primarily affected by the noise-batch ratio, which compares the volume of randomized noise to the size of the original training data. By running experiments with varying model sizes and noise-batch ratios, the team established a basic understanding of differential privacy scaling laws, which is a balance between the compute budget, privacy budget, and data budget. In short, more noise leads to lower-quality outputs unless offset with a higher compute budget (FLOPs) or data budget (tokens). The paper details the scaling laws for private LLMs, which could help developers find an ideal noise-batch ratio to make a model more private. The work the team has done here has led to a new Google model called VaultGemma, its first open-weight model trained with differential privacy to minimize memorization risks. It's built on the older Gemma 2 foundation and sized at 1 billion parameters, which the company says performs comparably to non-private models of similar size.
It's available now from Hugging Face and Kaggle.
Adding differential privacy to a model comes with drawbacks in terms of accuracy and compute requirements. No one has bothered to figure out the degree to which that alters the scaling laws of AI models until now. The team worked from the assumption that model performance would be primarily affected by the noise-batch ratio, which compares the volume of randomized noise to the size of the original training data. By running experiments with varying model sizes and noise-batch ratios, the team established a basic understanding of differential privacy scaling laws, which is a balance between the compute budget, privacy budget, and data budget. In short, more noise leads to lower-quality outputs unless offset with a higher compute budget (FLOPs) or data budget (tokens). The paper details the scaling laws for private LLMs, which could help developers find an ideal noise-batch ratio to make a model more private. The work the team has done here has led to a new Google model called VaultGemma, its first open-weight model trained with differential privacy to minimize memorization risks. It's built on the older Gemma 2 foundation and sized at 1 billion parameters, which the company says performs comparably to non-private models of similar size.
It's available now from Hugging Face and Kaggle.
seems to be a distraction from the root issue? (Score:1)
"Couldn't you just stop training models on copyrighted or privacy-infringing data?"
Google: "lol no"
Re: (Score:1)
The question was never to stop training. The question was how to process LLM queries in a privacy preserving way.
Every lie we tell incurs a debt to the truth. (Score:4, Interesting)
Re: (Score:2)
Did you have a look into the paper?
https://arxiv.org/abs/2501.189... [arxiv.org]
If I steal a car and paint it another color. (Score:5, Insightful)
Is this like dithering of images? Sure you changed the underlying pixels but the general image is still there - and last I checked that doesn't fix copyright violations.
Re: (Score:1)
.. I still stole the car. Same thing here. Is this like dithering of images? Sure you changed the underlying pixels but the general image is still there - and last I checked that doesn't fix copyright violations.
Learning is not a copyright violation.
It doesn't eliminate the problem. (Score:2)
As it is written, it doesn't eliminate the problem. It only attempts to minimize it. That makes it useless at the intended "privacy preserving"
- less likely to 'memorize' any of that content
- l trained with differential privacy to minimize memorization risks
Pronounce that Mr. Privacy Pants (Score:1)
I have a hard time believing Google wants (Score:2)
Doesn't mean it isn't true, just that it's hard to believe them about it.
Re: (Score:1)
I suspect the real motivation is to obscure where the training data came from rather than protect anyone's privacy. They're trying to remove the /liability/ of using questionable data without having to actually stop scraping it.
non-deterministic outputs mean you can't predict w (Score:1)
Re: (Score:2)
Actually you don't always get the same result (or at least the exact same response to the same question). I've tried this with LLMs running locally (using ollama), making sure to restart the engine from scratch every time, so there is some randomness going on.
According to ChatGPT, Ollama does make use of a random number generator for some reason.
Re: (Score:2)
For a very good reason. Imagine a system without any entropy from temperature. It would be quite boring. Have a standardised environment of hardware and software stack and model, set your temperature to 0 and your seed to a known number, and you will find the output deterministic and reproducable. Change your video card, pytorch version, or whatever, and you might see some drift, but as long as the temperature and see
Re: (Score:2)
llama.cpp (used by ollama, they just are let's say hesitant to give credit) uses random numbers, i.e. nonzero temperature by default. ollama wraps it with config files, but by default they should have 0.7-0.8 as default temperature, what means a reasonable amount of variations between different outputs.
No, LLM do NOT have nondeterministic output (Score:4, Interesting)
Stop writing that bullshit.
A LLM is an artificial neural network and with that equivalent to a mathematical function. And these have the properties, that they produce for the same input the same output. EVERY artificial neural network inherits that property.
The output of a LLM in each step is a probability distribution. If you always choose the most probable token, you always get the same output. Random sampling is a choice you make yourself and not a property inherent to the system.
Re: No, LLM do NOT have nondeterministic output (Score:3)
Re: (Score:3)
But you do it voluntary. If you need deterministic output, then you set temperature to 0 (or TopK to 1) and can compare your results across runs.
I just hate people parroting things like "LLMs are random!" or "Nobody understands how AI works" and think it would be clever criticism.
Re: (Score:2)
Look up model temperature
allo literally defined it in the post:
If you always choose the most probable token, you always get the same output. Random sampling is a choice...
Re: (Score:3)
You are 100% correct.
For human interactions, the LLM seems random not only because most interfaces set the temperature to a nonzero value, but also because seemingly irrelevant changes such as spacing or punctuation will change the LLM output.
Re: (Score:2)
Good point!
It also seems random, because the result may be unexpected. People hear something with LLM is random and they see a bad result and associate the randomness with the result and not with the tokens. But the result is not bad because of random sampling, but because of limitations of the models.
LLM sampling is a quite interesting rabbit hole, though.
Or just... (Score:2)