Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
The Courts Books

Lawsuit Says OpenAI Violated US Authors' Copyrights To Train AI Chatbot (reuters.com) 82

Two U.S. authors have filed a proposed class action lawsuit against OpenAI, claiming that the company infringed their copyrights by using their works without permission to train its generative AI system, ChatGPT. The plaintiffs, Massachusetts-based writers Paul Tremblay and Mona Awad, claim the data used to train ChatGPT included thousands of books, including those from illegal "shadow libraries." Reuters reports: The complaint estimated that OpenAI's training data incorporated over 300,000 books, including from illegal "shadow libraries" that offer copyrighted books without permission. Awad is known for novels including "13 Ways of Looking at a Fat Girl" and "Bunny." Tremblay's novels include "The Cabin at the End of the World," which was adapted in the M. Night Shyamalan film "Knock at the Cabin" released in February.

Tremblay and Awad said ChatGPT could generate "very accurate" summaries of their books, indicating that they appeared in its database. The lawsuit seeks an unspecified amount of money damages on behalf of a nationwide class of copyright owners whose works OpenAI allegedly misused.

This discussion has been archived. No new comments can be posted.

Lawsuit Says OpenAI Violated US Authors' Copyrights To Train AI Chatbot

Comments Filter:
  • weak (Score:5, Interesting)

    by Nite_Hawk ( 1304 ) on Friday June 30, 2023 @06:54PM (#63647472) Homepage

    The AI is only a copyright infringement if it is considered a derivative work that is not transformative and/or covered under fair use. In this case, I suspect a reasonable argument could be made that ChatGPT itself is transformative, especially if the best argument they can make is that it can summarize books. If you could trick ChatGPT into reproducing the books outright, that might be a different story.

    • Re:weak (Score:4, Insightful)

      by Baron_Yam ( 643147 ) on Friday June 30, 2023 @07:00PM (#63647484)

      This whole thing about "the AI stole my work" is really annoying.

      The only valid test is this - "if a human consumed the media and then created similar art based on that consumption, would it be in violation of copyright law?"

      Q: Why is it illegal if I use an AI instead of a human brain?

      A: Because someone wants to get paid, and they don't want me breaking in to their market

      • Re: (Score:2, Insightful)

        If I take a book and photo copy it. Then I make a photo copy of that photo copy and so on and so forth until the image is grainy and maybe every 20th word can't even be read, is it no longer a copy of that book that violates copyright? That seems to be the claim that AI is trying to make.
        • Re: (Score:3, Insightful)

          by Rockoon ( 1252108 )
          "The model doesnt store it in a way where it can recite the entire book..."

          ...only because its inefficient to store it like that. The entire goal of the model training enterprise is getting it as close to reciting as possible within the constraints of the model.

          You use enough hidden nodes in the model when training and you do get reciting, aka "overfitting": ... Not sure that "the model fails to recite in spite of our efforts" is a defense.
          • ...only because its inefficient to store it like that. The entire goal of the model training enterprise is getting it as close to reciting as possible within the constraints of the model.

            The point of training ANNs is to apply knowledge not reinvent the search index. If you did that all you'll end up with is a super slow version of Google search that is equally as clueless. Model designers make a deliberate conscious choice not to minimize loss in order to trade remembering for understanding.

            • by narcc ( 412956 )

              Model designers make a deliberate conscious choice not to minimize loss in order to trade remembering for understanding.

              You're not going to get understanding from a ANN. That's not what they do. As for remembering, that's not really what they do either, not in the way we normally understand 'remembering'. I can explain.

              A feed-forward neural network has very little computational power. Imagine a simple network with one input, one output, and two hidden nodes. If you graph the input on one axis and the output on the other, you'll get a wave. How complicated a wave can this network produce? It turns out, not very. (You

              • You're not going to get understanding from a ANN. That's not what they do. As for remembering, that's not really what they do either, not in the way we normally understand 'remembering'. I can explain.

                What if you drop the A? Does the curve fitting no longer describe what networks of neurons actually do?

                This is why 'generalizing' isn't synonymous with 'understanding' and why 'remembering' isn't really an appropriate term. It's just a function. What the numbers represent to us doesn't matter in the slightest.

                Can you suggest an objective means to test whether or not a black box "understands" something?

                Or does this really just boil down to definitions anchored in mysticism where only organically grown brains can "remember" or "understand" anything at all?

                • by narcc ( 412956 )

                  What if you drop the A? Does the curve fitting no longer describe what networks of neurons actually do?

                  While NNs were inspired by a primitive understanding of the brain, they are nothing alike.

                  Can you suggest an objective means to test whether or not a black box "understands" something?

                  No, I can't. Though in the case of our NN, however, it seems very obvious that nothing even remotely like understanding is possible. The system has no access to the referent. Also ...

                  Or does this really just boil down to definitions anchored in mysticism

                  The mysticism comes from people who either want the technology to be far more than it actually is, or for humans to be far less. (I have no idea why this is so important to some people.) I don't think any further discussion is meaningfu

            • LMs don't "understand" anything. Their developers & AI researchers are very, very emphatic about that.
          • by dynamo ( 6127 )

            Overfitting is an error, not a goal. And using lots and lots of different works is one common way to protect against it. It works similarly to how it would with a person - if you were to grow up only being able to ever read four poems, you'd easily recite them from memory, but if (assuming it were possible for a person) you read hundreds of poems and essays and books and articles every day, you wouldn't be able to remember any of them very well, but you'd have a very good idea of how language works and what

            • Pretty good summary :) I think the word you're reaching for is 'genre': https://en.wikipedia.org/wiki/... [wikipedia.org]

              The main difference with LMs is that they have to rely on humans to do the correcting for them, i.e. they don't understand anything that they're processing (& won't for the foreseeable future) & so if humans don't correct them the models diverge from what humans expect & prefer & become unintelligible &/or incomprehensible to us. Here's an article about how that worked for ChatGPT
        • by dfghjk ( 711126 )

          No it's not. AI makes no copy at all, just like you do not "make a copy" if you memorize a book, nor is memorizing a book a violation of copyright.

          "That seems to be the claim that AI is trying to make."

          AI isn't make a claim at all, two people are making a claim against a company. As usual, all you offer is shit-takes.

          • by vivian ( 156520 )

            Me: please print page 5 of the first edition of gullivers travels
            ChatGPT:
            I apologize for the confusion, but as an AI language model, I don't have direct access to physical books or the ability to print pages. I can provide you with information or answer questions about "Gulliver's Travels" to the best of my knowledge. Is there something specific you would like to know about the book?

            Judge: No copy of the book can be extracted - copyright hasn't been infringed. Case closed.

        • Re:weak (Score:4, Informative)

          by mysidia ( 191772 ) on Friday June 30, 2023 @08:33PM (#63647586)

          so forth until the image is grainy and maybe every 20th word can't even be read, is it no longer a copy of that book

          That would be 95% of the words still legible. This would fail similarity analysis, which weighs strongly towards it being a Copy, because the 20th word is obscured solely to introduce a deliberate dissimilarity --- changes to the content deliberately performed while making a copy to introduce a dissimilarity don't prevent something from being a copy. You would not have created a new work, so if your use wouldn't qualify as fair use for the clean photocopy -- using the grainy photocopy for the same thing is not likely to be fair use, either.

          That seems to be the claim that AI is trying to make.

          No.. AI training doesn't do anything like that. Once they are done: the product of training is ultimately a list of nodes and weights -- the training data is minuscule in size compared to the size of training data sets and doesn't contain the training data, at least not in any perceptible way.

          A book analogy would be extracting all the list of all the words from a chapter of a book, and then creating a printed list of words and word pairs followed by a count of how many times each word appears.

          Of course the training procedures for language models are much complex, but they aren't anything like a photocopy.

          • by dynamo ( 6127 )

            It's important to add in your simplified explanation that you'd include the frequency of the word pairs also. In ngrams, a primitive form of language prediction model, that would be all you need to store for a value of n=2. But for it to be useful, you'd want a value more like 4 or 5 at least. A value of n=3 would mean in addition to storing the frequency of word pairs, you'd store the frequency of word triplets (three words in sequence). n=4 is four words in sequence and so on. The size of the data store g

            • by narcc ( 412956 )

              for a very simple algorithm that isn't even AI.

              n-gram models are absolutely, without question, 'AI'. What are earth would make you thing they weren't?

          • So the photocopier is outputting a list of pixels with different weights. Still say it's the same thing.
        • by jvkjvk ( 102057 )

          No, your analogy is flawed. It is reading a book, then remembering stuff about it. That's it. There is no copyright violation because there were no copies. Where does ChatGPT have a copy of their books? Nowhere.

      • Re: (Score:2, Informative)

        by TheDarkener ( 198348 )

        Humans don't consume media the same way as training AI LLMs. Please stop equating AI (Artificial Intelligence) with HI (Human Intelligence), people.

        • Really? AI ain't intelligent, but otherwise the process seems to be very very similar.

          The primary difference at this point is that AI doesn't know what it's doing and can't apply broader experience to filter the inputs and adjust weights on nodes of the model being built. It has no 'common sense', no larger model of the world to use for sanity checks on whatever model it's building.

        • You don't know that. Please stop pretending there aren't very real similarities.
        • Humans don't consume media the same way as training AI LLMs. Please stop equating AI (Artificial Intelligence) with HI (Human Intelligence), people.

          The truth is closer to "we are more alike than unlike my dear captain"

          "We have shown that transformers with recurrent positional encodings reproduce neural representations found in rodent entorhinal cortex and hippocampus. We then showed these transformers are close mathematical cousins to models of hippocampus that neuroscientists have developed over the last few years. "

          https://openreview.net/pdf?id=... [openreview.net]

          https://www.pnas.org/doi/10.10... [pnas.org]

      • There's an interesting aside to this. We don't grant copyright to anything other than humans, i.e. legally, only a person can create an original work. If a human assimilates a load of books & then writes ones that are very similar, that can be considered original work & copyright-able. But if a machine does the same & *generates* books, what does that count as? It's not a person, it's not event a sentient being, just an automated human-like text generator.

        BTW, I'm all for generative AI being
      • Q: Why is it illegal if I use an AI instead of a human brain?

        You can't copyright a concept, you can only copyright the particular expression of a concept. Since ChatGPT has no concepts, the only thing it does is create expressions, that is why it is illegal if ChatGPT does it but not a human. The human is basing the summary on concepts, he/she did not memorize the whole thing like ChatGPT did.

        • You can't copyright a concept, you can only copyright the particular expression of a concept. Since ChatGPT has no concepts, the only thing it does is create expressions, that is why it is illegal if ChatGPT does it but not a human.

          The whole point of training LLM is acquisition of conceptual knowledge / generalization / understanding.

          Convenient links to Wikipedia pages describing relevant words:

          https://en.wikipedia.org/wiki/... [wikipedia.org]
          https://en.wikipedia.org/wiki/... [wikipedia.org]
          https://en.wikipedia.org/wiki/... [wikipedia.org]
          https://en.wikipedia.org/wiki/... [wikipedia.org]

          The human is basing the summary on concepts, he/she did not memorize the whole thing like ChatGPT did.

          Not only does ChatGPT do no such thing this is something model developers explicitly seek to avoid.

          Here is an example chat with a 13B parameter LLM far less capable than GPT-3.

          Q. "A ropform is a fruit that grows

    • by dfghjk ( 711126 )

      " I suspect a reasonable argument could be made that ChatGPT itself is transformative, especially if the best argument they can make is that it can summarize books."

      If you read a book and could summarize it, that would make you a transformative work? No, it MAY mean a summary is a derivative work, yet where's the summary and who produced it?

      A copyright holder does not have claim on your knowledge, how does it have a claim on AI knowledge?

      • by mysidia ( 191772 )

        yet where's the summary and who produced it?

        The user can ask the system to create a summary. They are under a pretty strict User Agreement when they ask ChatGPT to do something --- An agreement which includes a sweeping Indemnity clause Essentially requiring the end user to assume All liability for anything that happens.

        "You will defend, indemnify, and hold harmless us... from and against any claims, losses, and expenses (including attorneys' fees) arising from or relating to your use of the Services,

        • by GrahamJ ( 241784 )

          EULAs aren't usually legally binding but either way a ruling against ChatGPT in this matter would be far wider ranging than the one user. I'm sure many authors/groups would be happy to pay that to set a precedent.

    • Copyright is about publication rights. Literally the right to publish copies.

      Training a LLM is not publishing copies of a work.

      Using copyrighted works in training a LLM or generative AI is transformative, as the created object (the LLM/AI) is sufficiently different from the original work as to be unrecognizable as the original.

      Creating derivative works is subtler.

      If I use your a photograph of your painting in my book about your paintings, I am creating a derivative work, and should seek permission from yo

    • by ashpool7 ( 18172 )
      I also would think the "transformative work" [cornell.edu] angle is the only one that would work here.
    • First of all, that is not how "AI" works. The way it generates its output is by piecing together things it was fed. The encoding isn't particularly obvious, but it's not "understanding" or "inspiration" that creates something new. It's a collage at best.

      Secondly, if that line of argument succeeds, it will be the end of open access to information. Not diminish it, end it. You won't be allowed to read a headline without first paying for it and signing an NDA that, among other things, will specifically forb
    • The AI is only a copyright infringement if it is considered a derivative work that is not transformative and/or covered under fair use.

      "Transformative" is one of the factors considered under fair use. You can't just say "this is transformative therefore it's fair use," you need to consider all the factors.

    • by stikves ( 127823 )

      This is not too much different than going to a library and studying books.

      At least it should not be in theory. Both machine learning and neuroscience focus on how we humans actually learn stuff. And they drive each other forward: https://www.nature.com/article... [nature.com]. We get to test theories on cognition and human memory by trying them out on artificial models.

      Which means, if you cannot ban reading, you won't be able to ban training either.

      (Again, many of us can memorize entire poem, saying, or some people even

    • I suspect a reasonable argument could be made that ChatGPT itself is transformativ

      I mean, hell, the "T" in "GPT" stands for "Transformer".

  • by evanh ( 627108 ) on Friday June 30, 2023 @07:23PM (#63647522)

    Given the material is from public sources and ChatGPT is for public use, then OpenAI should publicly itemise what training material they use.

  • how do they know it didn't learn how to summarize the books by just ingesting other people's summaries
    • by mysidia ( 191772 )

      That's very likely what happened too. Although there is a chance the text of some books might be online and have gotten scraped, there's presumably more summaries than copies of the book indexed on the public web (and visible to search engines, etc) if the authors have been enforcing their copyrights.

      I somehow doubt openAI has been scanning books and targeting dark web data.

  • by bruce_the_moose ( 621423 ) on Friday June 30, 2023 @07:49PM (#63647546)

    What will ultimately defeat our AI overlords will be the intractable copyright regime enforced by a vicious legal system.

    • by GrahamJ ( 241784 )

      That's the plot of the next Terminator flick: a Connor goes back in time and becomes a successful copyright attourney

  • This is the computer equivalent of reading. And if it can be illegal for a silicon computer it can be illegal for a biological computer.

  • Its free now but hasn't it collected the knowledge through openly avoids ads on sites supported by ads which it will i think may do in the future. Its collecting data from its usage to be paid somehow in the future probably ad supported? Just throwing that out their.
  • Generating a summary doesn't mean ChatGPT read the book, it just means that ChatGPT saw enough summaries and reviews of the book to be able to generate yet another summary.

    In fact I'm pretty sure ChatGPT would be unable to generate a summary of a book it has read, because any book is larger than its context window. It cannot remember the beginning of a book by the time it reads the end. To test whether it's familiar with the books, you could check whether it does style replication:

    [Me] Can you help me co

  • AI is too important to humanity. If you're scared of AI .. do you think China, Saudi Arabia, Russia, Iran, North Korea, India etc. are going to freeze AI development because we're too afraid to develop it? Either you start learning how to say "yes, massah" in Russian or figure out how the F to develop and use AI.

    • by narcc ( 412956 )

      You're confusing fantasy with reality again.

    • AI is too important to humanity.

      Really? What did generative AI, chatbot and image gen, achieve or will achieve? Not talking about domain specific AI such as aeronautic or genetic research which certainly don't scrap copyrighted data from the web. LLM generate low quality content for spam, they are at best a new kind of search engine and a natural scripting language but it won't change humanity. Try to answer without mentioning the singularity fantasy.

      • First off, assuming your premise were true (which it isn't), progress in one aspect of AI drives investment in it and that will improve domain specific AI too. For example the need for better gaming PCs funded development of advanced GPU technology used in research and now GPUs became cheaper and have made faster supercomputers to be built with a lower budget. And btw, chatbots and image gen can be useful for everything you use Google for. Instead of clicking through links you will get the info explained to

  • Authors and lawyers who know nothing about copyright law or generative AI, nor are they willing to do any basic research. The "evidence" they have is weak at the best. I'd question the intelligence of any judge that even agrees to hear these moron's arguments.

  • The gist of their complaint is that the AI read their books so it could learn how words are put together to make a coherent whole. Frankly, everyone who's written anything has done the same, starting in childhood and continuing throughout life. So, what's the fundamental difference between the structured training an AI undergoes vs the adhoc training a human undergoes to learn how to write?

    • by GrahamJ ( 241784 )

      The human part. If a human looks at a drawing and recreates the drawing it's fine. If a computer does it it's a copy.

      I'm not siding with the authors here - I don't think this should be an issue - but that would seem the fundamental difference.

Riches: A gift from Heaven signifying, "This is my beloved son, in whom I am well pleased." -- John D. Rockefeller, (slander by Ambrose Bierce)

Working...