Lawsuit Says OpenAI Violated US Authors' Copyrights To Train AI Chatbot (reuters.com) 82
Two U.S. authors have filed a proposed class action lawsuit against OpenAI, claiming that the company infringed their copyrights by using their works without permission to train its generative AI system, ChatGPT. The plaintiffs, Massachusetts-based writers Paul Tremblay and Mona Awad, claim the data used to train ChatGPT included thousands of books, including those from illegal "shadow libraries." Reuters reports: The complaint estimated that OpenAI's training data incorporated over 300,000 books, including from illegal "shadow libraries" that offer copyrighted books without permission. Awad is known for novels including "13 Ways of Looking at a Fat Girl" and "Bunny." Tremblay's novels include "The Cabin at the End of the World," which was adapted in the M. Night Shyamalan film "Knock at the Cabin" released in February.
Tremblay and Awad said ChatGPT could generate "very accurate" summaries of their books, indicating that they appeared in its database. The lawsuit seeks an unspecified amount of money damages on behalf of a nationwide class of copyright owners whose works OpenAI allegedly misused.
Tremblay and Awad said ChatGPT could generate "very accurate" summaries of their books, indicating that they appeared in its database. The lawsuit seeks an unspecified amount of money damages on behalf of a nationwide class of copyright owners whose works OpenAI allegedly misused.
weak (Score:5, Interesting)
The AI is only a copyright infringement if it is considered a derivative work that is not transformative and/or covered under fair use. In this case, I suspect a reasonable argument could be made that ChatGPT itself is transformative, especially if the best argument they can make is that it can summarize books. If you could trick ChatGPT into reproducing the books outright, that might be a different story.
Re:weak (Score:4, Insightful)
This whole thing about "the AI stole my work" is really annoying.
The only valid test is this - "if a human consumed the media and then created similar art based on that consumption, would it be in violation of copyright law?"
Q: Why is it illegal if I use an AI instead of a human brain?
A: Because someone wants to get paid, and they don't want me breaking in to their market
Re: (Score:2, Insightful)
Re: (Score:3, Insightful)
You use enough hidden nodes in the model when training and you do get reciting, aka "overfitting":
Re: (Score:2)
...only because its inefficient to store it like that. The entire goal of the model training enterprise is getting it as close to reciting as possible within the constraints of the model.
The point of training ANNs is to apply knowledge not reinvent the search index. If you did that all you'll end up with is a super slow version of Google search that is equally as clueless. Model designers make a deliberate conscious choice not to minimize loss in order to trade remembering for understanding.
Re: (Score:2)
Model designers make a deliberate conscious choice not to minimize loss in order to trade remembering for understanding.
You're not going to get understanding from a ANN. That's not what they do. As for remembering, that's not really what they do either, not in the way we normally understand 'remembering'. I can explain.
A feed-forward neural network has very little computational power. Imagine a simple network with one input, one output, and two hidden nodes. If you graph the input on one axis and the output on the other, you'll get a wave. How complicated a wave can this network produce? It turns out, not very. (You
Re: (Score:2)
You're not going to get understanding from a ANN. That's not what they do. As for remembering, that's not really what they do either, not in the way we normally understand 'remembering'. I can explain.
What if you drop the A? Does the curve fitting no longer describe what networks of neurons actually do?
This is why 'generalizing' isn't synonymous with 'understanding' and why 'remembering' isn't really an appropriate term. It's just a function. What the numbers represent to us doesn't matter in the slightest.
Can you suggest an objective means to test whether or not a black box "understands" something?
Or does this really just boil down to definitions anchored in mysticism where only organically grown brains can "remember" or "understand" anything at all?
Re: (Score:2)
What if you drop the A? Does the curve fitting no longer describe what networks of neurons actually do?
While NNs were inspired by a primitive understanding of the brain, they are nothing alike.
Can you suggest an objective means to test whether or not a black box "understands" something?
No, I can't. Though in the case of our NN, however, it seems very obvious that nothing even remotely like understanding is possible. The system has no access to the referent. Also ...
Or does this really just boil down to definitions anchored in mysticism
The mysticism comes from people who either want the technology to be far more than it actually is, or for humans to be far less. (I have no idea why this is so important to some people.) I don't think any further discussion is meaningfu
Re: (Score:2)
Re: (Score:2)
Overfitting is an error, not a goal. And using lots and lots of different works is one common way to protect against it. It works similarly to how it would with a person - if you were to grow up only being able to ever read four poems, you'd easily recite them from memory, but if (assuming it were possible for a person) you read hundreds of poems and essays and books and articles every day, you wouldn't be able to remember any of them very well, but you'd have a very good idea of how language works and what
Re: (Score:2)
The main difference with LMs is that they have to rely on humans to do the correcting for them, i.e. they don't understand anything that they're processing (& won't for the foreseeable future) & so if humans don't correct them the models diverge from what humans expect & prefer & become unintelligible &/or incomprehensible to us. Here's an article about how that worked for ChatGPT
Re: (Score:3)
No it's not. AI makes no copy at all, just like you do not "make a copy" if you memorize a book, nor is memorizing a book a violation of copyright.
"That seems to be the claim that AI is trying to make."
AI isn't make a claim at all, two people are making a claim against a company. As usual, all you offer is shit-takes.
Re: (Score:2)
Re: (Score:2)
Which is stupid. It isn't a violation of *copy*right law to read a book that is copyrighted.
Re: (Score:2)
Re: (Score:1)
Copyright violation would occur when the AI output a very close copy of the book ( there's legalese for it but it's like 99.% identical). Not when the AI holds a "memorized" version of the book which is not a copy but has a lot of facts about the book.
If you read and memorize a book- no violation.
If you write parts of it down for personal use - no violation.
If you make copies available to others- you violate copyright. You don't have the right to copy.
And no... just reading it doesn't violate copyright.
Re: (Score:1)
Me: please print page 5 of the first edition of gullivers travels
ChatGPT:
I apologize for the confusion, but as an AI language model, I don't have direct access to physical books or the ability to print pages. I can provide you with information or answer questions about "Gulliver's Travels" to the best of my knowledge. Is there something specific you would like to know about the book?
Judge: No copy of the book can be extracted - copyright hasn't been infringed. Case closed.
Re:weak (Score:4, Informative)
so forth until the image is grainy and maybe every 20th word can't even be read, is it no longer a copy of that book
That would be 95% of the words still legible. This would fail similarity analysis, which weighs strongly towards it being a Copy, because the 20th word is obscured solely to introduce a deliberate dissimilarity --- changes to the content deliberately performed while making a copy to introduce a dissimilarity don't prevent something from being a copy. You would not have created a new work, so if your use wouldn't qualify as fair use for the clean photocopy -- using the grainy photocopy for the same thing is not likely to be fair use, either.
That seems to be the claim that AI is trying to make.
No.. AI training doesn't do anything like that. Once they are done: the product of training is ultimately a list of nodes and weights -- the training data is minuscule in size compared to the size of training data sets and doesn't contain the training data, at least not in any perceptible way.
A book analogy would be extracting all the list of all the words from a chapter of a book, and then creating a printed list of words and word pairs followed by a count of how many times each word appears.
Of course the training procedures for language models are much complex, but they aren't anything like a photocopy.
Re: (Score:2)
It's important to add in your simplified explanation that you'd include the frequency of the word pairs also. In ngrams, a primitive form of language prediction model, that would be all you need to store for a value of n=2. But for it to be useful, you'd want a value more like 4 or 5 at least. A value of n=3 would mean in addition to storing the frequency of word pairs, you'd store the frequency of word triplets (three words in sequence). n=4 is four words in sequence and so on. The size of the data store g
Re: (Score:2)
for a very simple algorithm that isn't even AI.
n-gram models are absolutely, without question, 'AI'. What are earth would make you thing they weren't?
Re: (Score:2)
Re: (Score:2)
No, your analogy is flawed. It is reading a book, then remembering stuff about it. That's it. There is no copyright violation because there were no copies. Where does ChatGPT have a copy of their books? Nowhere.
Re: (Score:2)
Re:humans can't make digital copies (Score:4, Insightful)
That's how copying a CD works; that's not how training an AI works.
Re: (Score:2)
> Humans take in digital information but don't have digital storage. Humans can't store a digital copy in their brain that allows digital reproduction.
So an analog computer would be just fine in your view? Those existed, you know.
And I really wonder what the difference is in storage when I can recite 09F9-1102-9D74-E35B-D841-56C5-6356-88C0 as well as any computer program. Sure seems like digital data however my brain might be storing it.
Re: (Score:1)
Re: (Score:2)
It's also possible that the training data didn't include anything about the book. Not the text, not a synopsis, not even a review. Programs like ChatGPT will readily produce something that looks like a summary for anything, real or imaginary.
Re: (Score:2)
Re: (Score:2)
Lacking access to the summaries, we can't determine if they are accurate or not. However, we can try it for ourselves. We won't get the same thing, due to how ChatGPT works, but it's better than nothing. The summaries I got were very light on details and read like advertisements. The Cabin at the End of the World, for example, got the names of the author main protagonists right, but missed all of the antagonists as well as the plot. The parent's example of "cheating on a book report" seems to fit very we
Re: (Score:2)
Re: (Score:2)
They "fucking authors" have a "fucking lawsuit" that hinges on those summaries being accurate. They aren't exactly the most trustworthy source here.
Try it for yourself. What I got sure as hell didn't look like an accurate summary.
No go away, creepy stalker.
Re: humans can't make digital copies (Score:2)
Re: (Score:2)
Like I said, try it for yourself. If you still want to claim that "summary" is "highly accurate", I can't help you.
Also, your weird obsession with me is a little disturbing. Seek help.
Re: (Score:2)
Just for fun, a ChatGPT produced summary of a book that does not exist:
Me: write a summary of the book "Mary's Summertime"
ChatGPT: "Mary's Summertime" is a heartwarming coming-of-age story written by local author Sarah Thompson. The novel follows the life of 12-year-old Mary as she navigates through the transitions and challenges of a memorable summer.
Set in the idyllic town of Willow Creek, Mary's summer vacation begins with a sense of longing and uncertainty as her best friend, Emma, leaves for a family trip overseas. Feeling lonely and out of place without Emma, Mary embarks on a journey of self-discovery and friendship.
As the days pass, Mary stumbles upon a hidden community garden tucked away behind an old bookstore. She is captivated by the lush greenery and vibrant flowers, and soon befriends the eccentric garden caretaker, Mrs. Greene. Mrs. Greene becomes a mentor and teacher who introduces Mary to the magic of gardening and helps her develop a newfound sense of purpose and confidence.
Throughout the summer, Mary encounters an array of colorful characters within the garden community, each with their unique stories and life lessons to share. From an elderly couple who reminisce about their youth to a troubled teenager searching for a sense of belonging, these encounters teach Mary the power of connections and the importance of empathy.
As the summer unfolds, Mary's relationships with her family and friends evolve, and she learns to navigate not only the joys and challenges of growing up but also the importance of cherishing the simple moments that make life worthwhile. By summer's end, Mary discovers that she can adapt to change and that even in the absence of her best friend, she can create beautiful memories and find her own place in the world.
"Mary's Summertime" serves as a gentle reminder of the transformative power of nature, the value of human connections, and the resilience of the human spirit. It is a captivating and enchanting tale that will leave readers feeling inspired and nostalgic for their own summer adventures.
Re: (Score:1)
Re: (Score:2)
>How do you know it is nothing like a human?
Because humans aren't a collection of matrix operations?
Re: (Score:2)
How do you know it is nothing like a human?
Because we know how these kinds of programs work and we can identify significant deficiencies that humans do not share.
On an abstract level it's taking in information, processing it, remembering it, making decisions based on it then taking action
That is laughably inaccurate.
Re: (Score:2, Informative)
Humans don't consume media the same way as training AI LLMs. Please stop equating AI (Artificial Intelligence) with HI (Human Intelligence), people.
Re: (Score:2)
Really? AI ain't intelligent, but otherwise the process seems to be very very similar.
The primary difference at this point is that AI doesn't know what it's doing and can't apply broader experience to filter the inputs and adjust weights on nodes of the model being built. It has no 'common sense', no larger model of the world to use for sanity checks on whatever model it's building.
Re: (Score:1)
Re: (Score:2)
You don't know that. Please stop pretending there are notable similarities.
Re: (Score:2)
Re: (Score:2)
Humans don't consume media the same way as training AI LLMs. Please stop equating AI (Artificial Intelligence) with HI (Human Intelligence), people.
The truth is closer to "we are more alike than unlike my dear captain"
"We have shown that transformers with recurrent positional encodings reproduce neural representations found in rodent entorhinal cortex and hippocampus. We then showed these transformers are close mathematical cousins to models of hippocampus that neuroscientists have developed over the last few years. "
https://openreview.net/pdf?id=... [openreview.net]
https://www.pnas.org/doi/10.10... [pnas.org]
Re: (Score:2)
BTW, I'm all for generative AI being
Re: (Score:2)
Q: Why is it illegal if I use an AI instead of a human brain?
You can't copyright a concept, you can only copyright the particular expression of a concept. Since ChatGPT has no concepts, the only thing it does is create expressions, that is why it is illegal if ChatGPT does it but not a human. The human is basing the summary on concepts, he/she did not memorize the whole thing like ChatGPT did.
Re: (Score:2)
You can't copyright a concept, you can only copyright the particular expression of a concept. Since ChatGPT has no concepts, the only thing it does is create expressions, that is why it is illegal if ChatGPT does it but not a human.
The whole point of training LLM is acquisition of conceptual knowledge / generalization / understanding.
Convenient links to Wikipedia pages describing relevant words:
https://en.wikipedia.org/wiki/... [wikipedia.org]
https://en.wikipedia.org/wiki/... [wikipedia.org]
https://en.wikipedia.org/wiki/... [wikipedia.org]
https://en.wikipedia.org/wiki/... [wikipedia.org]
The human is basing the summary on concepts, he/she did not memorize the whole thing like ChatGPT did.
Not only does ChatGPT do no such thing this is something model developers explicitly seek to avoid.
Here is an example chat with a 13B parameter LLM far less capable than GPT-3.
Q. "A ropform is a fruit that grows
Re: (Score:2)
The whole point of training LLM is acquisition of conceptual knowledge / generalization / understanding.
It literally doesn't.
Re: (Score:2)
It literally doesn't.
It literally does and I literally demonstrated that fact.
Re: (Score:2)
" I suspect a reasonable argument could be made that ChatGPT itself is transformative, especially if the best argument they can make is that it can summarize books."
If you read a book and could summarize it, that would make you a transformative work? No, it MAY mean a summary is a derivative work, yet where's the summary and who produced it?
A copyright holder does not have claim on your knowledge, how does it have a claim on AI knowledge?
Re: (Score:3)
yet where's the summary and who produced it?
The user can ask the system to create a summary. They are under a pretty strict User Agreement when they ask ChatGPT to do something --- An agreement which includes a sweeping Indemnity clause Essentially requiring the end user to assume All liability for anything that happens.
"You will defend, indemnify, and hold harmless us... from and against any claims, losses, and expenses (including attorneys' fees) arising from or relating to your use of the Services,
Re: (Score:2)
EULAs aren't usually legally binding but either way a ruling against ChatGPT in this matter would be far wider ranging than the one user. I'm sure many authors/groups would be happy to pay that to set a precedent.
Re: (Score:2)
Copyright is about publication rights. Literally the right to publish copies.
Training a LLM is not publishing copies of a work.
Using copyrighted works in training a LLM or generative AI is transformative, as the created object (the LLM/AI) is sufficiently different from the original work as to be unrecognizable as the original.
Creating derivative works is subtler.
If I use your a photograph of your painting in my book about your paintings, I am creating a derivative work, and should seek permission from yo
Re: (Score:1)
Re: (Score:2)
Secondly, if that line of argument succeeds, it will be the end of open access to information. Not diminish it, end it. You won't be allowed to read a headline without first paying for it and signing an NDA that, among other things, will specifically forb
Re: (Score:2)
The AI is only a copyright infringement if it is considered a derivative work that is not transformative and/or covered under fair use.
"Transformative" is one of the factors considered under fair use. You can't just say "this is transformative therefore it's fair use," you need to consider all the factors.
Re: (Score:2)
This is not too much different than going to a library and studying books.
At least it should not be in theory. Both machine learning and neuroscience focus on how we humans actually learn stuff. And they drive each other forward: https://www.nature.com/article... [nature.com]. We get to test theories on cognition and human memory by trying them out on artificial models.
Which means, if you cannot ban reading, you won't be able to ban training either.
(Again, many of us can memorize entire poem, saying, or some people even
Re: (Score:2)
I mean, hell, the "T" in "GPT" stands for "Transformer".
Training sources should be documented anyway (Score:3)
Given the material is from public sources and ChatGPT is for public use, then OpenAI should publicly itemise what training material they use.
wat (Score:2)
Re: (Score:2)
That's very likely what happened too. Although there is a chance the text of some books might be online and have gotten scraped, there's presumably more summaries than copies of the book indexed on the public web (and visible to search engines, etc) if the authors have been enforcing their copyrights.
I somehow doubt openAI has been scanning books and targeting dark web data.
AI killer (Score:3)
What will ultimately defeat our AI overlords will be the intractable copyright regime enforced by a vicious legal system.
Re: (Score:2)
That's the plot of the next Terminator flick: a Connor goes back in time and becomes a successful copyright attourney
Make reading illegal? (Score:2)
This is the computer equivalent of reading. And if it can be illegal for a silicon computer it can be illegal for a biological computer.
Its free now (Score:2)
Facepalm (Score:2)
Generating a summary doesn't mean ChatGPT read the book, it just means that ChatGPT saw enough summaries and reviews of the book to be able to generate yet another summary.
In fact I'm pretty sure ChatGPT would be unable to generate a summary of a book it has read, because any book is larger than its context window. It cannot remember the beginning of a book by the time it reads the end. To test whether it's familiar with the books, you could check whether it does style replication:
We have to allow AI (Score:2)
AI is too important to humanity. If you're scared of AI .. do you think China, Saudi Arabia, Russia, Iran, North Korea, India etc. are going to freeze AI development because we're too afraid to develop it? Either you start learning how to say "yes, massah" in Russian or figure out how the F to develop and use AI.
Re: (Score:2)
You're confusing fantasy with reality again.
Re: (Score:2)
AI is too important to humanity.
Really? What did generative AI, chatbot and image gen, achieve or will achieve? Not talking about domain specific AI such as aeronautic or genetic research which certainly don't scrap copyrighted data from the web. LLM generate low quality content for spam, they are at best a new kind of search engine and a natural scripting language but it won't change humanity. Try to answer without mentioning the singularity fantasy.
Re: (Score:2)
First off, assuming your premise were true (which it isn't), progress in one aspect of AI drives investment in it and that will improve domain specific AI too. For example the need for better gaming PCs funded development of advanced GPU technology used in research and now GPUs became cheaper and have made faster supercomputers to be built with a lower budget. And btw, chatbots and image gen can be useful for everything you use Google for. Instead of clicking through links you will get the info explained to
Gotta Love It (Score:2)
Authors and lawyers who know nothing about copyright law or generative AI, nor are they willing to do any basic research. The "evidence" they have is weak at the best. I'd question the intelligence of any judge that even agrees to hear these moron's arguments.
Using their logic, every author is violating copyr (Score:2)
The gist of their complaint is that the AI read their books so it could learn how words are put together to make a coherent whole. Frankly, everyone who's written anything has done the same, starting in childhood and continuing throughout life. So, what's the fundamental difference between the structured training an AI undergoes vs the adhoc training a human undergoes to learn how to write?
Re: (Score:2)
The human part. If a human looks at a drawing and recreates the drawing it's fine. If a computer does it it's a copy.
I'm not siding with the authors here - I don't think this should be an issue - but that would seem the fundamental difference.