Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
Privacy AI Google

Google Releases VaultGemma, Its First Privacy-Preserving LLM 23

An anonymous reader quotes a report from Ars Technica: The companies seeking to build larger AI models have been increasingly stymied by a lack of high-quality training data. As tech firms scour the web for more data to feed their models, they could increasingly rely on potentially sensitive user data. A team at Google Research is exploring new techniques to make the resulting large language models (LLMs) less likely to 'memorize' any of that content. LLMs have non-deterministic outputs, meaning you can't exactly predict what they'll say. While the output varies even for identical inputs, models do sometimes regurgitate something from their training data -- if trained with personal data, the output could be a violation of user privacy. In the event copyrighted data makes it into training data (either accidentally or on purpose), its appearance in outputs can cause a different kind of headache for devs. Differential privacy can prevent such memorization by introducing calibrated noise during the training phase.

Adding differential privacy to a model comes with drawbacks in terms of accuracy and compute requirements. No one has bothered to figure out the degree to which that alters the scaling laws of AI models until now. The team worked from the assumption that model performance would be primarily affected by the noise-batch ratio, which compares the volume of randomized noise to the size of the original training data. By running experiments with varying model sizes and noise-batch ratios, the team established a basic understanding of differential privacy scaling laws, which is a balance between the compute budget, privacy budget, and data budget. In short, more noise leads to lower-quality outputs unless offset with a higher compute budget (FLOPs) or data budget (tokens). The paper details the scaling laws for private LLMs, which could help developers find an ideal noise-batch ratio to make a model more private.
The work the team has done here has led to a new Google model called VaultGemma, its first open-weight model trained with differential privacy to minimize memorization risks. It's built on the older Gemma 2 foundation and sized at 1 billion parameters, which the company says performs comparably to non-private models of similar size.

It's available now from Hugging Face and Kaggle.

Google Releases VaultGemma, Its First Privacy-Preserving LLM

Comments Filter:
  • "Couldn't you just stop training models on copyrighted or privacy-infringing data?"

    Google: "lol no"

    • by Anonymous Coward

      The question was never to stop training. The question was how to process LLM queries in a privacy preserving way.

  • by Pseudonymous Powers ( 4097097 ) on Tuesday September 16, 2025 @09:08AM (#65662902)
    I bet VaultGemma safeguards your privacy the same way that RBMK reactors don't explode.
  • by Fly Swatter ( 30498 ) on Tuesday September 16, 2025 @09:18AM (#65662924) Homepage
    .. I still stole the car. Same thing here.

    Is this like dithering of images? Sure you changed the underlying pixels but the general image is still there - and last I checked that doesn't fix copyright violations.
    • by xpyr ( 743763 )

      .. I still stole the car. Same thing here. Is this like dithering of images? Sure you changed the underlying pixels but the general image is still there - and last I checked that doesn't fix copyright violations.

      Learning is not a copyright violation.

  • As it is written, it doesn't eliminate the problem. It only attempts to minimize it. That makes it useless at the intended "privacy preserving"

    - less likely to 'memorize' any of that content

    - l trained with differential privacy to minimize memorization risks

  • GPT6vg wasn't sexy enough.
  • to protect anyone's privacy. User data is their bread and butter.

    Doesn't mean it isn't true, just that it's hard to believe them about it.

    • by Anonymous Coward

      I suspect the real motivation is to obscure where the training data came from rather than protect anyone's privacy. They're trying to remove the /liability/ of using questionable data without having to actually stop scraping it.

  • Am I the only one to cringe when I read "LLMs have non-deterministic outputs, meaning you can't exactly predict what they'll say"? Deterministic doesn't mean that you can predict the outcome. If the process is complex but not random, you always get the same result even though you can't predict it.
    • by JustNiz ( 692889 )

      Actually you don't always get the same result (or at least the exact same response to the same question). I've tried this with LLMs running locally (using ollama), making sure to restart the engine from scratch every time, so there is some randomness going on.
      According to ChatGPT, Ollama does make use of a random number generator for some reason.

      • According to ChatGPT, Ollama does make use of a random number generator for some reason.

        For a very good reason. Imagine a system without any entropy from temperature. It would be quite boring. Have a standardised environment of hardware and software stack and model, set your temperature to 0 and your seed to a known number, and you will find the output deterministic and reproducable. Change your video card, pytorch version, or whatever, and you might see some drift, but as long as the temperature and see

      • by allo ( 1728082 )

        llama.cpp (used by ollama, they just are let's say hesitant to give credit) uses random numbers, i.e. nonzero temperature by default. ollama wraps it with config files, but by default they should have 0.7-0.8 as default temperature, what means a reasonable amount of variations between different outputs.

  • by allo ( 1728082 ) on Tuesday September 16, 2025 @01:14PM (#65663526)

    Stop writing that bullshit.

    A LLM is an artificial neural network and with that equivalent to a mathematical function. And these have the properties, that they produce for the same input the same output. EVERY artificial neural network inherits that property.

    The output of a LLM in each step is a probability distribution. If you always choose the most probable token, you always get the same output. Random sampling is a choice you make yourself and not a property inherent to the system.

    • In practice we add randomness because it makes for better utility overall. Look up model temperature
      • by allo ( 1728082 )

        But you do it voluntary. If you need deterministic output, then you set temperature to 0 (or TopK to 1) and can compare your results across runs.
        I just hate people parroting things like "LLMs are random!" or "Nobody understands how AI works" and think it would be clever criticism.

      • by MobyDisk ( 75490 )

        Look up model temperature

        allo literally defined it in the post:

        If you always choose the most probable token, you always get the same output. Random sampling is a choice...

    • by MobyDisk ( 75490 )

      You are 100% correct.

      For human interactions, the LLM seems random not only because most interfaces set the temperature to a nonzero value, but also because seemingly irrelevant changes such as spacing or punctuation will change the LLM output.

      • by allo ( 1728082 )

        Good point!

        It also seems random, because the result may be unexpected. People hear something with LLM is random and they see a bad result and associate the randomness with the result and not with the tokens. But the result is not bad because of random sampling, but because of limitations of the models.
        LLM sampling is a quite interesting rabbit hole, though.

  • Or you could just run a locally hosted model. Especially with the newer models that can be ran on a single server rack machine.

The trouble with the rat-race is that even if you win, you're still a rat. -- Lily Tomlin

Working...