Stable Diffusion 'Memorizes' Some Images, Sparking Privacy Concerns (arstechnica.com) 37
An anonymous reader quotes a report from Ars Technica: On Monday, a group of AI researchers from Google, DeepMind, UC Berkeley, Princeton, and ETH Zurich released a paper outlining an adversarial attack that can extract a small percentage of training images from latent diffusion AI image synthesis models like Stable Diffusion. It challenges views that image synthesis models do not memorize their training data and that training data might remain private if not disclosed. Recently, AI image synthesis models have been the subject of intense ethical debate and even legal action. Proponents and opponents of generative AI tools regularly argue over the privacy and copyright implications of these new technologies. Adding fuel to either side of the argument could dramatically affect potential legal regulation of the technology, and as a result, this latest paper, authored by Nicholas Carlini et al., has perked up ears in AI circles.
However, Carlini's results are not as clear-cut as they may first appear. Discovering instances of memorization in Stable Diffusion required 175 million image generations for testing and preexisting knowledge of trained images. Researchers only extracted 94 direct matches and 109 perceptual near-matches out of 350,000 high-probability-of-memorization images they tested (a set of known duplicates in the 160 million-image dataset used to train Stable Diffusion), resulting in a roughly 0.03 percent memorization rate in this particular scenario. Also, the researchers note that the "memorization" they've discovered is approximate since the AI model cannot produce identical byte-for-byte copies of the training images. By definition, Stable Diffusion cannot memorize large amounts of data because the size of the 160,000 million-image training dataset is many orders of magnitude larger than the 2GB Stable Diffusion AI model. That means any memorization that exists in the model is small, rare, and very difficult to accidentally extract.
Still, even when present in very small quantities, the paper appears to show that approximate memorization in latent diffusion models does exist, and that could have implications for data privacy and copyright. The results may one day affect potential image synthesis regulation if the AI models become considered "lossy databases" that can reproduce training data, as one AI pundit speculated. Although considering the 0.03 percent hit rate, they would have to be considered very, very lossy databases -- perhaps to a statistically insignificant degree. [...] Eric Wallace, one of the paper's authors, shared some personal thoughts on the research in a Twitter thread. As stated in the paper, he suggested that AI model-makers should de-duplicate their data to reduce memorization. He also noted that Stable Diffusion's model is small relative to its training set, so larger diffusion models are likely to memorize more. And he advised against applying today's diffusion models to privacy-sensitive domains like medical imagery.
However, Carlini's results are not as clear-cut as they may first appear. Discovering instances of memorization in Stable Diffusion required 175 million image generations for testing and preexisting knowledge of trained images. Researchers only extracted 94 direct matches and 109 perceptual near-matches out of 350,000 high-probability-of-memorization images they tested (a set of known duplicates in the 160 million-image dataset used to train Stable Diffusion), resulting in a roughly 0.03 percent memorization rate in this particular scenario. Also, the researchers note that the "memorization" they've discovered is approximate since the AI model cannot produce identical byte-for-byte copies of the training images. By definition, Stable Diffusion cannot memorize large amounts of data because the size of the 160,000 million-image training dataset is many orders of magnitude larger than the 2GB Stable Diffusion AI model. That means any memorization that exists in the model is small, rare, and very difficult to accidentally extract.
Still, even when present in very small quantities, the paper appears to show that approximate memorization in latent diffusion models does exist, and that could have implications for data privacy and copyright. The results may one day affect potential image synthesis regulation if the AI models become considered "lossy databases" that can reproduce training data, as one AI pundit speculated. Although considering the 0.03 percent hit rate, they would have to be considered very, very lossy databases -- perhaps to a statistically insignificant degree. [...] Eric Wallace, one of the paper's authors, shared some personal thoughts on the research in a Twitter thread. As stated in the paper, he suggested that AI model-makers should de-duplicate their data to reduce memorization. He also noted that Stable Diffusion's model is small relative to its training set, so larger diffusion models are likely to memorize more. And he advised against applying today's diffusion models to privacy-sensitive domains like medical imagery.
Me too! (Score:3)
I'm very sure that there are some images stored in my memory. As well as other types of copyrighted information. Retrieving an image from a human memory right now is impossible ( maybe some day some not yet invented MRI like magic machine will!) but retrieving segments copyrighted works of literature or designs I may have memorized is possible for humans. But that doesn't mean I am stealing or plagiarizing these. They are just things I can draw on to inspire new works.
So I don't see why latent recovery of
Re: (Score:3)
Retrieving an image from a human memory right now is impossible
It's not, really. You can draw it. Or describe it. Or imagine it. Plenty of things.
Re:Me too! (Score:4, Interesting)
I wouldn't be surprised if there are savants that can duplicate images perfectly from memory.
Which makes me ask the question if they are walking copyright violations?!?
Re: (Score:2)
No but if they draw it out and then start selling copies, then that's a copyright violation.
Re: (Score:2)
Actually there's legal precedent for "copied" images to remain safe from copyright claims depending on how accurate the copy was. It's possible to claim the work is transformative and therefore fair use (e.g. Marilyn Diptych by Andy Warhol).
But you're right that the one trying to sell it would be the target of lawsuits. Since you can't sue a piece of software, the person instructing it to make the copy would have to be liable. Just like they'd be liable if they took the image to a photocopier.
Re:Me too! (Score:4, Interesting)
There was a guy I knew several decades ago. He was mentally retarded (whatever the word is now) but he could draw anything that he saw for a few seconds. A car just drove by on the street? He could draw it after it went around the corner.
Oddly enough, the one thing that he couldn't draw was writing. If he drew a truck that had "Joe's Plumbing" on the door he just put in a squiggle in that place, even as the rest of the vehicle was perfectly drawn. Same thing if he drew a building or or anything else. He never included anything written on them.
Re: (Score:3)
The challenge for copyright is that the person using the AI tool would potentially be oblivious that the output was a copyright problem and roll with it.
If it comes from your own memory, then in all likelihood you know well exactly what you are doing, but AI transformations of training data tend not to bother with attribution in the output. Making it very difficult to comfortably use, not knowing if a chunk of something made it through that can cause a problem later.
Re: (Score:2)
175 million image generations
If you throw enough shit at a wall it's gonna line up just like an image of your mom sooner or later.
Re: (Score:2)
Never really looked at the details of the theory for Markov chains. Is that a property?
Re: Its all "memory" (Score:2)
No. This is also using the term Markov chain in a very abstract way as neural nets are in no way obeying the rules of markiv chains in general. But it's a solid no to your question
Re: (Score:2)
Thanks. It did sound not very intuitive to me. I am very well aware ANNs and Markov chains are different things.
Re: Its all "memory" (Score:2)
Stop claiming everything is a Markov chain. Theyâ(TM)re not the same thing. This is absolutely not a state machine like a Markov chain other than in the sense that any computational process is a state machine.
Image search (Score:2)
Not all that different than directly searching LAION-V or doing a Google Image search.
Re:Image search (Score:5, Interesting)
Indeed. The paper authors first identified images that were highly duplicated (> 100x) in StableDiffusion's training dataset - not byte-for-byte, but the same content, just slightly modified (cropped, text, scaling, etc - as is common on the internet). They then used the training captions for these images to generate 500 images each, and if the latents of any of the 500 images were similar any others in that set, then that was a candidate for overtraining and thus extraction.
They tried it also with images that were unique in StableDiffusion's training dataset, but with over 10k tried, they could not achieve it with any of them.
I think it should be pointed out that if StableDiffusion, in training, sees the same image over and over again, it will learn it the same as it will learn anything else it sees over and over again - the Mona Lisa, an American flag, a zebra, whatever. While each individual image contributes on average a byte or so to the training, with enough duplicates (and/or bias toward a given set of images), you can train to anything.
TL/DR: LAION needs to use a better de-duplication algorithm.
Re: (Score:2)
And an image found via GIS isn't a new image copyrighted by the person who found it. Is an AI training set image extracted this way a new image? Is any image generated by an AI a new image, free of the copyrights of the images in the training set it was created from?
Re: (Score:2)
And an image found via GIS isn't a new image copyrighted by the person who found it. Is an AI training set image extracted this way a new image? Is any image generated by an AI a new image, free of the copyrights of the images in the training set it was created from?
The algorithm does not in general spit back the identical image, in fact it is extremely unlikely to if the training set is wide and comprehensive. In any case, blocking exact copies will not hinder its performance whatsoever. So it’s not like a GIS cropping at all. It is a new image with, yes some “inspiration” from its training set in the exact way this works with humans and the exact same copyrighted material with the exception that the human artist usually says which artists or styl
Re: (Score:2)
Well, Google and other search engines have a legal exception in most places. "AI" does not.
BS (Score:3)
This is BS. If possible this would be compression on a scale unknown to man. Anyone able to compress data even 5 orders of magnitude worse would make serious money.
This is like compressing a movie down to the size of a few image frames.
Simply not the same thing at all.
Re: BS (Score:2)
Read the article. It doesnt remember every training image, just a few. And even then I doubt it's anywhere close to pixel perfect.
Re: (Score:1)
That's not how it works. You have to consider the amount of data these models hold.
I can compress an entire movie in to one byte. Feed that byte in to my decompressor and your movie is ready. (BTW, my decompressor program is 500GB)
Re: BS (Score:2)
Things like Chatgpt have something like 80gigaparameters not terrabytes
Re: BS (Score:2)
Which tells you a lot about the actual variety of data we store in images. Thereâ(TM)s only a very few of the possible images out there that we actually care about. These AIs are great at remembering the full gamut of those images, and terrible at remembering the other random noise.
Very Very Lossy Database (Score:2)
This is BS.
No, it is brilliant: imagine when the current AI spring is over and marketing discovers that they can sell this stuff again as "VVLDB Technology" and everything is new once more.
Re: (Score:1)
The size of all the data involved must be taken into account to comment on the supposed compression ratio. Stable Diffusion's parameter/weights set comes to 4GB afaik. So in compression terms, if that's your data dictionary, and you're only able to use stable diffusion to lossily recreate a handful of images from its training set, that's very poor compression indeed. You can't use this technique to recreate real pictures that weren't in the training set, nothing is for free.
So no, it's not really like compr
Re: BS (Score:2)
You can, and thereâ(TM)s active research into doing so, but the data stores has to be more than just the prompt and the seed. You also need additional data to get from the generated image to the image you actually wanted. That data though is generally a lot less than you would need to store the whole image.
Re: (Score:2)
This is BS. If possible this would be compression on a scale unknown to man. Anyone able to compress data even 5 orders of magnitude worse would make serious money.
except if you use compression on information you'd expect to be able to recover the 100% of it, not just 0.03% at best and provided you know exactly what you're looking for.
the study is still interesting, though, it shows how these algorithms could and should be improved. sadly it's also going to be used by ip fundamentalists to drive lawyers and judges crazy with endless disingenuous argumentation, and very likely incite legislators to pass even more onerous and outlandish ip laws than we already have to k
Re: (Score:2)
It's not predictable how well a given image would stay 'intact' in the corpus of other training material.
Further, there was work not too long ago posted here that did exactly this, using an AI model to 'compress' an image. Though it didn't actually verbatim compress the bitmap, but, to roughly describe, it would preserve level of detail equivalent to: "this general area is brown fur, this general area is hay, and shading is from top-right" and then reconstitute a plausible analogous image when asked to 'un
Re: (Score:1)
Re: BS (Score:2)
Yeh⦠turns out these AIs are really good at compression. Because they organise data into the things that occur most commonly in images, associated with the terms used to describe them, theyâ(TM)re extremely good at highly compressing lots of things. Thereâ(TM)s research ongoing into compression algorithms both for images and video that start with an image generated by one of these AIs and store the difference from the actual desired image.
Hands are right (Score:2)
You know that it was memorization and recall because the hands are human.
It's just human learning, with computer fidelity (Score:2)
Every moment your eyes are open you are processing information. Scenes are records, items categorized, sense made of the things about us visually.
Do you remember every moment as a frame in your mind? No. You do have some frames though, from major life events, but even they are "blurry", you couldn't draw your mental image of a moment and the mental image itself isn't a perfect "photo".\
The key to all of this is the vast amount of visual input we receive (daily even). And our minds categorize everything,
But it didn't (Score:2)
Even in the low quality JPG you can see that they did not manage to faithfully extract the image, the best they could get was similar to an enormously overcompressed JPEG. It's clearly recognizable, but how many of the image's pixels are identical to the original? (Without the actual image extracted, only this overcooked crap from Arse Compressica, there's no way to tell.) TFA says "Researchers only extracted 94 direct matches and 109 perceptual near-matches" but if these are their best examples, then I'd a
"Training Models" is Mass Plagerism (Score:2)
What's the bound on compression? (Score:2)
By definition, Stable Diffusion cannot memorize large amounts of data because the size of the 160,000 million-image training dataset is many orders of magnitude larger than the 2GB Stable Diffusion AI model. That means any memorization that exists in the model is small, rare, and very difficult to accidentally extract.
This is nonsense.
To my knowledge, there's no mathematical proof that you can't compress that many images (I guess 160 billion? Why not say that?) down to 2GB, or to any particular lower bound. It's certainly not "by definition".