Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI Facebook Piracy The Courts

Lawsuit Accuses Meta Of Training AI On Torrented 82TB Dataset Of Pirated Books (hothardware.com) 47

"Meta is involved in a class action lawsuit alleging copyright infringement, a claim the company disputes..." writes the tech news site Hot Hardware.

But the site adds that newly unsealed court documents "reveal that Meta allegedly used a minimum of 81.7TB of illegally torrented data sourced from shadow libraries to train its AI models." Internal emails further show that Meta employees expressed concerns about this practice. Some employees voiced strong ethical objections, with one noting that using content from sites like LibGen, known for distributing copyrighted material, would be unethical. A research engineer with Meta, Nikolay Bashlykov, also noted that "torrenting from a corporate laptop doesn't feel right," highlighting his discomfort surrounding the practice.

Additionally, the documents suggest that these concerns, including discussions about using data from LibGen, reached CEO Mark Zuckerberg, who may have ultimately approved the activity. Furthermore, the documents showed that despite these misgivings, employees discussed using VPNs to mask Meta's IP address to create anonymity, enabling them to download and share torrented data without it being easily traced back to the company's network.

This discussion has been archived. No new comments can be posted.

Lawsuit Accuses Meta Of Training AI On Torrented 82TB Dataset Of Pirated Books

Comments Filter:
  • by Finallyjoined!!! ( 1158431 ) on Sunday February 16, 2025 @12:42PM (#65170849)

    Once a crook, always a crook.

  • by TheMiddleRoad ( 1153113 ) on Sunday February 16, 2025 @12:48PM (#65170863)

    Most if not all the models are being trained on stolen data. This is not new. It's just that Meta is so incompetent as to leave a paper trail. Those China models people talk about so much? Trained on this set and more. Then people get "clever" and train one LLM removed. They take a set trained on pirated data and use it to train a non-pirated set. Then the indirectly-trained LLM is used. See! Clean!

    It's a constant train robbery, and we're all getting fucked every which way.

    • Hey at least Meta didn't murder anyone to cover the trail :p

    • But we're also responsible. Everyone using AI for their own benefit is responsible. I take a AI-free approach and don't use AI for that reason. If you use AI or play with it, then you are part of the problem.

    • by ihavesaxwithcollies ( 10441708 ) on Sunday February 16, 2025 @02:51PM (#65171157)

      Most if not all the models are being trained on stolen data. This is not new. It's just that Meta is so incompetent as to leave a paper trail. Those China models people talk about so much? Trained on this set and more. Then people get "clever" and train one LLM removed. They take a set trained on pirated data and use it to train a non-pirated set. Then the indirectly-trained LLM is used. See! Clean!

      It's a constant train robbery, and we're all getting fucked every which way.

      I think that's called data laundering.

    • Perhaps data should be free? Everyone benefits.
      • by allo ( 1728082 )

        Perhaps AI will finally spark the debate we need to have about modernising copyright. Sad that it takes greedy corporations to get things moving.

        • by HiThere ( 15173 )

          ??? The copyright laws as they exist were caused by greedy, rent-seeking, companies. Don't expect a battle over details by companies to improve things. Copyrights should not last for more than 15 years, with one allowed renewal...if you want to pay a substantial fee.

          • by micheas ( 231635 )

            ??? The copyright laws as they exist were caused by greedy, rent-seeking, companies. Don't expect a battle over details by companies to improve things. Copyrights should not last for more than 15 years, with one allowed renewal...if you want to pay a substantial fee.

            And only applied to books, maps, and other things that were expensive to produce. Newspapers were originally not copyrightable.

            I can see the argument that limiting copyright to anything that costs over $100k to produce would be in keeping with the spirit of the original copyright law.

    • No one is getting robbed. AI is a transformative work, and copyright shouldn't apply. Not that 90% of that shouldn't have been in the public domain anyway.
      • by jvkjvk ( 102057 )

        Sorry but copying one database over to another database is not a "transformative work". It's simply infringing on copyright.

    • THIS IS NOT A DRILL. Backup Llama and other LLMs while you still can. THIS IS NOT A DRILL.

  • by Rei ( 128717 ) on Sunday February 16, 2025 @12:53PM (#65170873) Homepage

    The thread in question was titled "Legal Escalations", because the decision process about what they could train on heavily involved Meta's lawyers.

    "torrenting from a corporate laptop doesn't feel right" wasn't expressed as "strong ethical objection" - it was followed by a laughing face emoji.

    This is not a new case. Kadrey v. Meta Platforms, Inc [courtlistener.com] was filed nearly a year ago. Numerous documents, including these accusations, and responses from Meta (such as this [courtlistener.com], this [courtlistener.com], and many more) are not "news" in any way.

  • Short-Lived Suit (Score:3, Interesting)

    by organgtool ( 966989 ) on Sunday February 16, 2025 @01:22PM (#65170927)
    Now that Zuck has kissed Trump's ring, I wonder how long it will be until this lawsuit disappears.
  • by BrendaEM ( 871664 ) on Sunday February 16, 2025 @02:21PM (#65171067) Homepage
    I didn't steal your life's work, the AI did.
  • by Anonymous Coward
    Did they seed after downloading? that would be the major crime!
    • Did they NOT seed after downloading? that would be the major crime.

      And 95-year copyright is the even bigger crime, that deprives us all of the benefits of the creativity copyright was meant to support.
  • The first time Meta does some good for a change and this is how you react?

  • It's amusing to see the outrage about AI being trained on pirated books, while that same outrage conveniently disappears when one discusses piracy in general.

    Based on the comments I've seen on Slashdot over the years, copyright is evil and shouldn't even exist, or else intellectual property is a complete scam, or copyright should (at most) be granted for 5 years. So what's the big deal if companies train AI on that pirated content that everyone lauds?

    AIs want data to be free, too.

    • by HiThere ( 15173 )

      It's one thing to make it publicly available, it's another to resell a concealed version. I know which I consider a worse crime.

      P.S.: Copyright should last about 15 years with one allowed (expensive) renewal.

    • by jvkjvk ( 102057 )

      Yeah, for personal people to use. For personal use.

      That's "piracy in general'. Not to screw over artists by creating COMMERCIAL works based on their hard work and PROFITING off of it.

      I hope you can understand the difference, but probably not.

  • Additionally, the documents suggest that these concerns, including discussions about using data from LibGen, reached CEO Mark Zuckerberg, who may have ultimately approved the activity.

    The document "suggests" that Zuckerberg "may have" approved of this? Really?

  • Copyright is all about copying and reselling(at any price) material. I don't understand what the big deal is, as meta is not reselling the material. It is digesting it and using it. Which is the same as any person would do when reading a book. As far as I can see there is no copyright infringement as they aren't reselling it.

  • In China, there is one collection in their digital libraries of all the world's books (with Chinese translations.) Everyone has access to it.

    In the West, you can be sued for reading a book.

    If we had changed our copyright laws, we could have prevailed.

There is never time to do it right, but always time to do it over.

Working...