The Intercept, Raw Story, and AlterNet Sue OpenAI and Microsoft (theverge.com) 58
The Intercept, Raw Story, and AlterNet have filed separate lawsuits against OpenAI and Microsoft, alleging copyright infringement and the removal of copyright information while training AI models. The Verge reports: The publications said ChatGPT "at least some of the time" reproduces "verbatim or nearly verbatim copyright-protected works of journalism without providing author, title, copyright or terms of use information contained in those works." According to the plaintiffs, if ChatGPT trained on material that included copyright information, the chatbot "would have learned to communicate that information when providing responses."
Raw Story and AlterNet's lawsuit goes further (PDF), saying OpenAI and Microsoft "had reason to know that ChatGPT would be less popular and generate less revenue if users believed that ChatGPT responses violated third-party copyrights." Both Microsoft and OpenAI offer legal cover to paying customers in case they get sued for violating copyright for using Copilot or ChatGPT Enterprise. The lawsuits say that OpenAI and Microsoft are aware of potential copyright infringement. As evidence, the publications point to how OpenAI offers an opt-out system so website owners can block content from its web crawlers. The New York Times also filed a lawsuit in December against OpenAI, claiming ChatGPT faithfully reproduces journalistic work. OpenAI claims the publication exploited a bug on the chatbot to regurgitate its articles.
Raw Story and AlterNet's lawsuit goes further (PDF), saying OpenAI and Microsoft "had reason to know that ChatGPT would be less popular and generate less revenue if users believed that ChatGPT responses violated third-party copyrights." Both Microsoft and OpenAI offer legal cover to paying customers in case they get sued for violating copyright for using Copilot or ChatGPT Enterprise. The lawsuits say that OpenAI and Microsoft are aware of potential copyright infringement. As evidence, the publications point to how OpenAI offers an opt-out system so website owners can block content from its web crawlers. The New York Times also filed a lawsuit in December against OpenAI, claiming ChatGPT faithfully reproduces journalistic work. OpenAI claims the publication exploited a bug on the chatbot to regurgitate its articles.
Terminal copyright infringement, you say? (Score:3)
Shudder. Of all the pratfalls and foibles in the way of Skynet domination, the one the computer overlords were least prepared for was pesky copyright infringement? Well, great story Bro, but it's no box office draw, for sure now.
Re: (Score:2)
myself.
when i have communicated to others.
i have never felt the need to state prior art
Re: (Score:3)
Well, here's something to think about.
Did you know it's possible to pirate open-source code? We don't usually call it that - usually it goes under terms like "GPL infringement" or other terms, but in the end, it boils down to "copyright infringement" aka piracy.
What does this have to do with anything? Well, programming languages are just languages, and ChatGPT can spit out code. But ever consider the effects?
If you train an LLM on open-source code, and have it generate the code, you need to figure out the c
Self-important clowns (Score:1)
Sure, if you prompt engineer the hell out of it like the NYT did, then you might get it to regurgitate something it saw before on your website. I am 100% sure this is true of human beings like me, too.
Re:Self-important clowns (Score:4, Informative)
Re: (Score:2)
Sure, if you prompt engineer the hell out of it like the NYT did, then you might get it to regurgitate something it saw before on your website.
Re: (Score:2, Interesting)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
And yet, we still have a bunch of search engines and social media sites who have retained copies of that data. Any plaintiff is going to have to show why they were cool with all of that for so long, but *now* they don't like it when someone isn't even republishing their work.
I do note that I don't see anything in the co
Re: (Score:2)
You claim the very act of copying infringes, which is nonsense. The plaintiffs sat there feeding the defendants data for years and even spent considerable sums of money to entice the defendants to make
Re: (Score:2)
> "You claim the very act of copying infringes, which is nonsense."
That is the basis of EULAs, that copying the software from disc [even if it's the CD the software came on] into RAM is in fact a copy, and thus is subject to copyright. the LA is a license by which the copyright holder grants permission to perform this copy operation.
As much as I might LIKE your interpretation, it is clearly not the binding precedent from the last 40ish years.
Re: (Score:2)
We have a partial answer in Authors Guild v. Google, Inc., of course. It was a repeated bitchslap to the Authors Guild from the district court (summary judgment in favor of Google), Second Circuit (affirming summary judgment), and the Supremes (denied certiorari).[1] This isn't binding, but it does show that the courts don't agree with your absolutist views about copying.
Oh, I
Re: (Score:1)
Re: (Score:2)
Every search engine does this very same thing
I disagree. One of the four factors evaluated when considering whether something is fair use is "the effect of the use upon the potential market for or value of the copyrighted work." The case of a search engine indexing a page and a LLM using a page as training data differ significantly here.
A search engine enhances the market value of the work it incidentally copies by increasing its visibility (and content creators seem to be on board with this, given all the work that goes into SEO).
In contrast, th
Re: (Score:3, Informative)
Self-replying here because I left out I point I meant to make.
they make money off of serving up ads next to summaries of said data that disincline people to actually go to the source of that data.
And, they pay for it: Google To Pay Wikipedia For Content In Knowledge Panel & Search [seroundtable.com]
Re: (Score:2)
Where the law sits on this I don't pretend to know exactly but it strikes me that what the search engine is doing vs what the chatbots are doing is the difference between writing a paper with a quote from another author an not.
chatbot - no idea where anything came from
search results with summary - link right to it.
Re: (Score:3, Insightful)
You don't infringe copyright when you make a copy in your memory, you infringe only when you *distribute* that copy.
Re: (Score:1)
I'll say the same thing I did to the other guy...
Without it being in fact a relevant copying operation, EULAs would not exist.
Distribution being the point of enforcement has had more to do with commercial harm.
Re: (Score:2)
Re: (Score:2)
Creating a local copy of a webpage to read it on your PC is likely to be considered fair use.
The factors of the fair use evaluation that seem most different in the case of making a copy for training a LLM are "the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes" and "the effect of the use upon the potential market for or value of the copyrighted work."
Viewing a web page is unlikely to be commercial in nature. I'm also not i
Painfully inept arguments (Score:1)
ChatGPT does not have any independent knowledge of the information provided in
its responses.
Knowledge is not subject to copyright.
If ChatGPT was trained on works of journalism that included the original author,
title, and copyright information, ChatGPT would have learned to communicate that information
when providing responses to users unless Defendants trained it otherwise.
Cute they believe LLMs are some kind of automated cut and paste machines. This fundamentally is not how the technology works. LLMs like people are notoriously bad at sourcing knowledge.
When providing responses, ChatGPT gives the impression that it is an all-knowing,
"intelligent" source of the information being provided, when in reality, the responses are frequently
based on copyrighted works of journalism that ChatGPT simply mimics.
Again knowledge is not subject to copyright. It doesn't matter how much time and expense a journalist took to surface some bit of knowledge copyright law only protects works not information.
Based on the publicly available information described above, thousands of Plaintiffsâ(TM) copyrighted works were included in Defendantsâ(TM) training sets without the author, title, and copyright information that Plaintiffs conveyed in publishing them.
Copyright law only concerns public performances, copies and preparation of derivative works. Co
Re: (Score:2)
Grandparent is right on target.
Re: (Score:2)
You walked in with the red herring "but copying is copying", which is out of place here as grandparent is addressing specific statements in the complaint that attempt to show a connection with demonstrated "knowledge" and infringement--in part using arguments that factually misrepresent the technology.
Maybe you are the target.
Re: (Score:1)
You were full of crap above. You're full of crap here, too.
Re: (Score:2)
Re: (Score:2)
Quoth SCOTUS:
(a) Article I, 8, cl. 8, of the Constitution mandates originality as a prerequisite for copyright protection. The constitutional requirement necessitates independent creation plus a modicum of creativity. Since facts do not owe their origin to an act of authorship, they are not original, and thus are not copyrightable. Although a compilation of facts may possess the requisite originality because the author typically chooses which facts to include, in what order to place them, and how to arrange the data so that readers may use them effectively, copyright protection extends only to those components of the work that are original to the author, not to the facts themselves. This fact/expression dichotomy severely limits the scope of protection in fact-based works.
As for stripping the copyright, authorship, etc. data--have you seen the datasets, or are you just guessing like the plaintiffs?
Again, quoting them:
If ChatGPT was trained on works of journalism that included the original author, title, and copyright information, ChatGPT would have learned to communicate that information when providing responses to users unless Defendants trained it otherwise.
Isn't that a daisy? On what possible basis could they make such a claim? Everyone knows that no one really knows how the big LLMs are actually doing what they're doing. If they really knew this, they'd be in possession of a serious break
Re: (Score:2)
The Intercept's counsel, Loevy & Loevy (Chicago), bills themselves as a civil rights practice that also does some IP litigation. They are going to get pwned. The dumb shit they said in their complaint about how LLMs work is borderline sanctionable it's so wrong/misleading.
Re: (Score:2)
For the purposes of copyright the information contained within a work is not subject to copyright. For example as a matter of settled law I can OCR a phone book and create a computer database of every phone number in that book. While the phone book itself is copyrighted the "knowledge" it contains is not.
While your conclusion is correct, your example has a flaw. Creating a database that includes all of the names and phone numbers from a phone book would not be copyright infringement. Your step of scanning the phone book in order to extract the data, however, may be infringing. It's clearly making a copy of a protected work, though you might be able to make a fair use defense, depending on other factors.
Re: (Score:2)
No, this was part of SCOTUS's decision in Feist Publications, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340 (1991)
https://supreme.justia.com/cas... [justia.com]
Re: (Score:2)
Humans need to be taught not to plagiarize. Why can't OpenAI spend 1% of their training time on detecting and preventing whole chunks of copied text? Or on running Turnitin on all the text they generate?
Re: (Score:2)
If... they hand in a copy of an existing article on The New York Times, they're liable for copyright infringement... Why should this be different if they use a LLM?
Because the verbatim reproduction is not on purpose in this case. It's so far from on purpose that it shouldn't even have been possible (say, based on the max known text compressibility).
Re: (Score:2)
Re: (Score:2)
What's all this fuss about? (Score:2)
Re: (Score:2)
Re: (Score:2)
The problem is not that it is used for training. Some people complain about that, too, but the issue here is a different one.
The problem is that OpenAI is (allegedly) publishing copyrighted articles without a licence from the copyright owners.
The LLM is not supposed to reproduce the training data verbatim. That's why OpenAI says that you need to use carefully crafted prompts to get ChatGPT to reprint the exact article, and even that depends on chance.
But who knows what is really going on.
Re: (Score:2)
But a few articles are copy-pasted all over the web in forums to avoid paywalls, so they get to have more copies in the
Interesting legal test (Score:2)
If you post something on your internet website that clearly belongs to you as your original work, can everyone else freely copy it and legally provide it to people without any attribution as though it were their original creation? Seems problematic.
As a step somewhat removed, can everyone else feed your original work into their software product, which then does that same thing? It seems like there would be similar legal problems.
They actually want to own ideas, not text (Score:2)
Do you see what they are doing here? A power grab. It used to be that copyright covered expression while ideas were free to reuse. Now they want to close off any formulation of an idea as copyright infringement. They want to own all possible formulations of an idea. Is that copyright anymore, or is it more like patents or trademarks?
The ridi
Re: (Score:2)
Do you see what they are doing here? A power grab. It used to be that copyright covered expression while ideas were free to reuse. Now they want to close off any formulation of an idea as copyright infringement. They want to own all possible formulations of an idea. Is that copyright anymore, or is it more like patents or trademarks?
This has actually always been the case. There's a reason why clean room techniques are used to reimplement code. If you have seen the original copyright work and produce something that is the same or very similar then it is immediately suspect of copyright infringement. Then the case goes through the courts to argue how different it is, cases on code have gone either way.
That's what has happened here. The AI was fed the original copyright work. It produces something which is the same or very similar. Now
Prove they were the source (Score:2)
What if someone else infringed by posting the stories verbatim elsewhere?
That happens to paywalled stuff from time time, I'd imagine.