Mark Zuckerberg Gave Meta's Llama Team the OK To Train On Copyright Works, Filing Claims (techcrunch.com) 30
Plaintiffs in Kadrey v. Meta allege that Meta CEO Mark Zuckerberg authorized the team behind the company's Llama AI models to use a dataset of pirated ebooks and articles for training. They further accuse the company of concealing its actions by stripping copyright information and torrenting the data. TechCrunch reports: In newly unredacted documents filed (PDF) with the U.S. District Court for the Northern District of California late Wednesday, plaintiffs in Kadrey v. Meta, who include bestselling authors Sarah Silverman and Ta-Nehisi Coates, recount Meta's testimony from late last year, during which it was revealed that Zuckerberg approved Meta's use of a data set called LibGen for Llama-related training. LibGen, which describes itself as a "links aggregator," provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued a number of times, ordered to shut down, and fined tens of millions of dollars for copyright infringement.
According to Meta's testimony, as relayed by plaintiffs' counsel, Zuckerberg cleared the use of LibGen to train at least one of Meta's Llama models despite concerns within Meta's AI exec team and others at the company. The filing quotes Meta employees as referring to LibGen as a "data set we know to be pirated," and flagging that its use "may undermine [Meta's] negotiating position with regulators." The filing also cites a memo to Meta AI decision-makers noting that after "escalation to MZ," Meta's AI team "[was] approved to use LibGen." (MZ, here, is rather obvious shorthand for "Mark Zuckerberg.")
The details seemingly line up with reporting from The New York Times last April, which suggested that Meta cut corners to gather data for its AI. At one point, Meta was hiring contractors in Africa to aggregate summaries of books and considering buying the publisher Simon & Schuster, according to the Times. But the company's execs determined that it would take too long to negotiate licenses and reasoned that fair use was a solid defense. The filing Wednesday contains new accusations, like that Meta might've tried to conceal its alleged infringement by stripping the LibGen data of attribution.
According to Meta's testimony, as relayed by plaintiffs' counsel, Zuckerberg cleared the use of LibGen to train at least one of Meta's Llama models despite concerns within Meta's AI exec team and others at the company. The filing quotes Meta employees as referring to LibGen as a "data set we know to be pirated," and flagging that its use "may undermine [Meta's] negotiating position with regulators." The filing also cites a memo to Meta AI decision-makers noting that after "escalation to MZ," Meta's AI team "[was] approved to use LibGen." (MZ, here, is rather obvious shorthand for "Mark Zuckerberg.")
The details seemingly line up with reporting from The New York Times last April, which suggested that Meta cut corners to gather data for its AI. At one point, Meta was hiring contractors in Africa to aggregate summaries of books and considering buying the publisher Simon & Schuster, according to the Times. But the company's execs determined that it would take too long to negotiate licenses and reasoned that fair use was a solid defense. The filing Wednesday contains new accusations, like that Meta might've tried to conceal its alleged infringement by stripping the LibGen data of attribution.
Meta[stasize] is metastasizing! (Score:2)
The cancer is deep, and spreading!
Shocked! (Score:3)
I'm shocked! Shocked, I tell you!
Well, not that shocked.
Re: (Score:2)
This has how Facebook has always behaved. It's the old principle "it's better to beg forgiveness than ask permission", but carried to ridiculous extremes. They historically have always broken both laws and norms... then, when they get caught, say "mea culpa" - but with the damage already done and not recoverable, as seems to be their intent.
So, unfortunately, your joke/meme doesn't work with Facebook-related news simply because no one could possibly be shocked by their behavior after all this time.
Re: (Score:2)
Where did Meta do that?
Also, "for any purpose" is doing a lot of heavy lifting there - not least of which since you didn't include the word "publicly" in that sentence.
Also, in most cases, copyvio is not a crime. Non-fair use of copyrighted material is generally a civil offense. There are statutes for criminal copyright infringement, but generally apply to things like bootlegging ri
Re: (Score:2, Informative)
If I downloaded a torrent of copyrighted work, and used that to make a product, then sell that product, I'd expect to be chased after for criminal copyright infringement, since I intended to financially benefit from the willful copyright infringement.
If it's intentional infringement and for large scale commercial gain, it's criminal.
Monetary damages for large corporations should be dropped in favour of the alternative, which is already an option in law, of imprisonment.
Mark Zuckerberg would gladly hand over
Re:2 sets of laws: ones for the rich and (Score:4, Informative)
First, if you download a copy of This Old House, and then you use that information to build a house, that house is NOT a violation of copyright.
Secondly, there are broad general accepted exemptions in copyrighted law for the automated processing of copyrighted information. Literally like 95% of Google's business model would be illegal if not for that.
For an extreme case, look at Google Books. Google mass-scanned-in books, not just without permission, but explicitly against publisher wishes. Then put them up online, made them searchable, and showed excerpts (up to whole pages) at once. Zero permission, they just went and did this, and didn't just learn from them, but reproduced exact content from them.
Guess what? The courts found even that to be a transformative use. Google won.
There have been cases on AI training that have reached completion - for example, the LAION case in Germany. It was upheld.
Contrary to what you may think, copyright doesn't give the holder a dictatorship. For example, you can shout from the rooftops how you absolutely ban anyone from using it for parody.... tough luck, you don't have the right to stop it; you were never granted that right. There are broad classes of exemptions upheld by the courts. The purpose of copyright law is prosocial. It is intended to strike a balance between encouraging the creation of more material, and enabling society to benefit from said material.
Copyright is also based on specific works**. Like with the house, it doesn't matter if the housebuilder learned from a given work - so long as they're not reproducing a specific copyright-protected house, to within the bounds of qualification as a derivative work, they're perfectly fine; the copyright holder of the book they learned from has absolutely no claim against them. Styles are not copyrightable.
** The sort of oddball carveout is character copyrights. But it's a pretty narrowly proscribed cutout, and they're considered to stem from specific works anyway.
Re: (Score:2)
I'd expect to be chased after for criminal copyright infringement
That's very unlikely.
You might be sued in civil court, but the police won't be involved.
since I intended to financially benefit
Your financial benefit is irrelevant. The copying is illegal, not the profit.
I doubt he'd like spending a year in prison.
Very unlikely. Even Kim Dotcom didn't go to prison.
Everywhere All At Once (Score:1)
Where did Meta do that?
They potentially do it anytime you ask their model for a result, it may at any time include portions of copyrighted worked, which has been demonstrated pretty often.
That is a copyrighted work published from the website and service they built, that you are accessing.
Re: (Score:1)
They potentially do it anytime you ask their model for a result, it may at any time include portions of copyrighted worked, which has been demonstrated pretty often.
Exactly.
Re: (Score:2)
Re: (Score:1)
Portions, as in snippets or relatively short excerpts? You mean just like Google Books does? Sounds like fair use to me.
If I took 100 pages of copyrighted works from 10 books and made my own book with them, that would just be 10 different copyright violations.
Re: (Score:1)
Portions, as in snippets or relatively short excerpts?
No.
I mean entire images except for slight details and a background changed.
Or multiple paragraphs, un-cited.
You mean just like Google Books does?
They acknowledge the original author and work it came from even if it were similar.
Re: (Score:3)
You can make a reasonable argument that AI training is fair use. After all, it's really just a mechanized version of what humans do. Where do writers get their ideas? There's all kinds of answers they'll give you -- real life observation, experience, even just from the act of sitting down and writing. But one thing they never say, but they all do, is get their ideas from other writers. Writers are readers first; everything they read goes into their (actual) neural net and out comes as new stuff. Every
Re: (Score:2, Insightful)
That's an interesting question about 'fair use'. But I think we have an answer. A human is expected to give credit, and to not parrot back as his/her own work the complete copyrighted material. Selected quotes are OK, with the expectation that the human adds value/provides additional relevant content. But generative AI seems to violate both expectations. No credit, and no limits on what is extracted and presented back.
Re: (Score:2)
>> for any purpose
Well, not for *any* purpose.
"Fair use" allows people to use copyrighted works for purposes of criticism and commentary, news reporting, teaching, parody, and research, for example. I'm not qualified to determine whether facebook's use of copyrighted works here is actually fair use, but that is certainly what they are claiming.
Re:2 sets of laws: ones for the rich and (Score:4, Insightful)
"put it through a filter"
Yeah, mate, that's not how AI works.
Since the core of every generator is detectors, reverse the situation. Think of image recognition. You take a picture of your dog and put it in an image detector. It highlights it in a bounding box and labels it "Dog [99.97%]".
Was it trained with that picture? Of course not.
Was it trained with any picture of your dog? Almost certainly not.
Rather, it knows what dogs are visually. It didn't memorize its training data; it used its training data to distill what the essence of a dog is - what sort of complex arrangement of high and low level features that distinguish dogs from other things.
That doesn't just apply specifically to image detectors - that applies to *all* DNNs (and all NNs, period). DNNs have the ability to generalize from data, and it is this ability that is what is desired. That's not that they don't also have the ability to memorize. But they're not doing that from seeing something once, with a learning rate of 1e-5. There's also fundamental limits to how much DNNs, like everything else in the world, physically can memorize. Generalization is vastly more space efficient than memorization. AI performance improves with both increased train times and increased training data sizes because AI performance is measured on how well they generalize tasks, and both of those things increase generalization performance.
If you train a 10GB video generation model on Youtube's 100 petabytes of compressed video data (perhaps 10 exabytes uncompressed), I don't know how to break it to you, but that 10GB model does not contain 10 exabytes of videos (a 1e9-to-1 compression ratio). That's just not happening. Those videos aren't in there - they're gone. What is in there is what people look like, what animals look like, how people move, how animals move, how people and animals interact..... on and on and on. The generalization of the latent space that is video.
Reverse does not hold (Score:1)
Rather, it knows what dogs are visually.
You cannot just simply reverse this argument as you are doing, when there are countless examples of AI outright reproducing copyrighted artworks.
For recognizers that's fine, all material in the models they are storing on their servers you cannot access which represents and abstract and lossy encoding of many copyrighted works.
For imagine generation it is not fine to use obviously similar and marginally changed works from artists in imagines that you are transmitting
Re: (Score:2)
Just to make myself clear taking copyrighted work and putting it on the internet for any purpose without permission is a crime
Nope. Copyright violations are torts, not crimes.
Re: (Score:1)
Nope. Copyright violations are torts, not crimes.
Copyright infringers can be sued civilly and in some cases prosecuted criminally for the same infringing act.
The Swamp Exposed (Score:2)
That proven orangeNoser [youtu.be] should be locked up.
Go Zuck. Fuck 95-year copyright very much. (Score:1)
Even Zuck.
Pirate on until copyright is a fit-for-purpose 5 years.
Unfortunately this may be needed (Score:2)
I have no idea what to do about the legal and moral problems. But I would greatly prefer an AI whose knowledge is not limited to non-copyrighted work.
Abolish copyright (Score:2)
At this point, the only logical thing to do is to abolish copyright entirely. If corporations don't have to follow it, why should anybody else? If anything, AI has proven that the romantic idea of a scientific/engineering/artistic genius was just an illusion, most creative work is easily automated. So why should it get special protections? Culture has existed before copyright was a thing, and will exist afterwards. Without IP reform, humanity will end up being slaves to megacorps that can ignore it and then
Abolish Copyright = Give Money to MegaCorps (Score:2)
Re: (Score:2)
Re: (Score:2)
Most people who state "we need to completely abolish copyright" aren't thinking beyond the level of "I want to be able to freely download movies and music without any possible restrictions or repercussions".
Cur the standard ... (Score:1)
plaintiffs in Kadrey v. Meta, who include bestselling authors Sarah Silverman and Ta-Nehisi Coates
And if someone figures out how to remove their stuff from the dataset, nothing of value would be lost ...
Re: (Score:1)