The Intercept, Raw Story, and AlterNet Sue OpenAI and Microsoft (theverge.com) 58

Posted by BeauHD on Wednesday February 28, 2024 @08:25PM from the future-of-journalism dept.

The Intercept, Raw Story, and AlterNet have filed separate lawsuits against OpenAI and Microsoft, alleging copyright infringement and the removal of copyright information while training AI models. The Verge reports: The publications said ChatGPT "at least some of the time" reproduces "verbatim or nearly verbatim copyright-protected works of journalism without providing author, title, copyright or terms of use information contained in those works." According to the plaintiffs, if ChatGPT trained on material that included copyright information, the chatbot "would have learned to communicate that information when providing responses."

Raw Story and AlterNet's lawsuit goes further (PDF), saying OpenAI and Microsoft "had reason to know that ChatGPT would be less popular and generate less revenue if users believed that ChatGPT responses violated third-party copyrights." Both Microsoft and OpenAI offer legal cover to paying customers in case they get sued for violating copyright for using Copilot or ChatGPT Enterprise. The lawsuits say that OpenAI and Microsoft are aware of potential copyright infringement. As evidence, the publications point to how OpenAI offers an opt-out system so website owners can block content from its web crawlers. The New York Times also filed a lawsuit in December against OpenAI, claiming ChatGPT faithfully reproduces journalistic work. OpenAI claims the publication exploited a bug on the chatbot to regurgitate its articles.

The Intercept, Raw Story, and AlterNet Sue OpenAI and Microsoft

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 58 Comments Log In/Create an Account

Comments Filter:

Terminal copyright infringement, you say? (Score:3)

by rmdingler ( 1955220 ) writes: on Wednesday February 28, 2024 @08:35PM (#64277508) Journal

Shudder. Of all the pratfalls and foibles in the way of Skynet domination, the one the computer overlords were least prepared for was pesky copyright infringement? Well, great story Bro, but it's no box office draw, for sure now.

- Re: (Score:2)
  
  by LifesABeach ( 234436 ) writes:
  
  myself.
  when i have communicated to others.
  i have never felt the need to state prior art
  - Re: (Score:3)
    
    by tlhIngan ( 30335 ) writes:
    
    Well, here's something to think about.
    Did you know it's possible to pirate open-source code? We don't usually call it that - usually it goes under terms like "GPL infringement" or other terms, but in the end, it boils down to "copyright infringement" aka piracy.
    What does this have to do with anything? Well, programming languages are just languages, and ChatGPT can spit out code. But ever consider the effects?
    If you train an LLM on open-source code, and have it generate the code, you need to figure out the c
Self-important clowns (Score:1)

by ihadafivedigituid ( 8391795 ) writes:

Yeeeeeeah, your 0.000000000000000081% share of the training data is gonna jump right out at everyone, verbatim, because it's soooo good and soooo important. No wonder Glenn Greenwald left The Intercept.

Sure, if you prompt engineer the hell out of it like the NYT did, then you might get it to regurgitate something it saw before on your website. I am 100% sure this is true of human beings like me, too.
- Re:Self-important clowns (Score:4, Informative)
  
  by Lehk228 ( 705449 ) writes: on Wednesday February 28, 2024 @09:55PM (#64277602) Journal
  
  if the AI model is reliably spitting out whole chunks of copyrighted material then yea that's a valid case
  
  - Re: (Score:2)
    
    by ihadafivedigituid ( 8391795 ) writes:
    
    Please re-read what I wrote:
    
    Sure, if you prompt engineer the hell out of it like the NYT did, then you might get it to regurgitate something it saw before on your website.
    - Re: (Score:2, Interesting)
      
      by gabebear ( 251933 ) writes:
      
      So you agree they are obviously infringing copyright? Copying a document into a different format doesn’t magically remove copyright.
      - Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        When the alleged victim of infringement goes to great trouble and expense to encourage the big bad evil infotech companies to copy and widely disseminate their work, they might have some problems with this line of reasoning. See discussion of laches elsewhere in the comments.
  - Re: (Score:2)
    
    by Visarga ( 1071662 ) writes:
    
    Don't know if triggering the AI with a whole paragraph lifted from a copyrighted text is not closer to entrapment than to spontaneous copyright violation.
- Re: (Score:2)
  
  by ihadafivedigituid ( 8391795 ) writes:
  
  Every search engine does this very same thing, and they make money off of serving up ads next to summaries of said data that disincline people to actually go to the source of that data.
  
  And yet, we still have a bunch of search engines and social media sites who have retained copies of that data. Any plaintiff is going to have to show why they were cool with all of that for so long, but *now* they don't like it when someone isn't even republishing their work.
  
  I do note that I don't see anything in the co
  - - Re: (Score:2)
      
      by ihadafivedigituid ( 8391795 ) writes:
      
      Which people "didn't exist", Perry Mason? The plaintiffs had obviously existed or they wouldn't have anything to infringe upon. The defendants existed and have been scraping (aka making copes of) their websites with the knowledge and almost certainly with the active cooperation of the plaintiffs for years.
      
      You claim the very act of copying infringes, which is nonsense. The plaintiffs sat there feeding the defendants data for years and even spent considerable sums of money to entice the defendants to make
      - Re: (Score:2)
        
        by tabrisnet ( 722816 ) writes:
        
        > "You claim the very act of copying infringes, which is nonsense."
        That is the basis of EULAs, that copying the software from disc [even if it's the CD the software came on] into RAM is in fact a copy, and thus is subject to copyright. the LA is a license by which the copyright holder grants permission to perform this copy operation.
        As much as I might LIKE your interpretation, it is clearly not the binding precedent from the last 40ish years.
      - Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        OK, Mr. Pedantic: it infringes, super ultra double secret probation technically if you squint, but is there a fair use defense?
        
        We have a partial answer in Authors Guild v. Google, Inc., of course. It was a repeated bitchslap to the Authors Guild from the district court (summary judgment in favor of Google), Second Circuit (affirming summary judgment), and the Supremes (denied certiorari).[1] This isn't binding, but it does show that the courts don't agree with your absolutist views about copying.
        
        Oh, I
  - Re: (Score:1)
    
    by gabebear ( 251933 ) writes:
    
    Verbatim copying is being litigated. OpenAI’s main defense is that to get GPT-4 to verbatim regurgitate articles a user usually has to violate OpenAI’s terms of service which they say is not common. https://www.searchenginejourna... [searchenginejournal.com]
  - Re: (Score:2)
    
    by cmseagle ( 1195671 ) writes:
    
    Every search engine does this very same thing
    I disagree. One of the four factors evaluated when considering whether something is fair use is "the effect of the use upon the potential market for or value of the copyrighted work." The case of a search engine indexing a page and a LLM using a page as training data differ significantly here.
    A search engine enhances the market value of the work it incidentally copies by increasing its visibility (and content creators seem to be on board with this, given all the work that goes into SEO).
    In contrast, th
    - Re: (Score:3, Informative)
      
      by cmseagle ( 1195671 ) writes:
      
      Self-replying here because I left out I point I meant to make.
      they make money off of serving up ads next to summaries of said data that disincline people to actually go to the source of that data.
      And, they pay for it: Google To Pay Wikipedia For Content In Knowledge Panel & Search [seroundtable.com]
    - Re: (Score:2)
      
      by DarkOx ( 621550 ) writes:
      
      Where the law sits on this I don't pretend to know exactly but it strikes me that what the search engine is doing vs what the chatbots are doing is the difference between writing a paper with a quote from another author an not.
      chatbot - no idea where anything came from
      search results with summary - link right to it.
- Re: (Score:3, Insightful)
  
  by nicubunu ( 242346 ) writes:
  
  You don't infringe copyright when you make a copy in your memory, you infringe only when you *distribute* that copy.
  - Re: (Score:1)
    
    by tabrisnet ( 722816 ) writes:
    
    I'll say the same thing I did to the other guy...
    Without it being in fact a relevant copying operation, EULAs would not exist.
    Distribution being the point of enforcement has had more to do with commercial harm.
- Re: (Score:2)
  
  by Visarga ( 1071662 ) writes:
  
  Why the double standard? Even to read the license of a website we need to create in memory a copy of that page, including potentially copyrighted materials. Internet works by copying stuff around, those are temporary copies, technical copies. LLMs too need to create technical copies of copyrighted materials for training, those are not distributed. The trained model is usually 1000x smaller than the original corpus, so it's impossible to "copy" more than 0.1% of it.
  - Re: (Score:2)
    
    by cmseagle ( 1195671 ) writes:
    
    Creating a local copy of a webpage to read it on your PC is likely to be considered fair use.
    The factors of the fair use evaluation that seem most different in the case of making a copy for training a LLM are "the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes" and "the effect of the use upon the potential market for or value of the copyrighted work."
    Viewing a web page is unlikely to be commercial in nature. I'm also not i
Painfully inept arguments (Score:1)

by Anonymous Coward writes:

ChatGPT does not have any independent knowledge of the information provided in
its responses.
Knowledge is not subject to copyright.
If ChatGPT was trained on works of journalism that included the original author,
title, and copyright information, ChatGPT would have learned to communicate that information
when providing responses to users unless Defendants trained it otherwise.
Cute they believe LLMs are some kind of automated cut and paste machines. This fundamentally is not how the technology works. LLMs like people are notoriously bad at sourcing knowledge.
When providing responses, ChatGPT gives the impression that it is an all-knowing,
"intelligent" source of the information being provided, when in reality, the responses are frequently
based on copyrighted works of journalism that ChatGPT simply mimics.
Again knowledge is not subject to copyright. It doesn't matter how much time and expense a journalist took to surface some bit of knowledge copyright law only protects works not information.
Based on the publicly available information described above, thousands of Plaintiffsâ(TM) copyrighted works were included in Defendantsâ(TM) training sets without the author, title, and copyright information that Plaintiffs conveyed in publishing them.
Copyright law only concerns public performances, copies and preparation of derivative works. Co
- - Re: (Score:2)
    
    by ihadafivedigituid ( 8391795 ) writes:
    
    See my reply above to your other badly argued version of this line of thought.
    
    Grandparent is right on target.
    - - Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        No, Johnnie Cochran, I didn't say summaries were the same as copying. Neither did the grandparent.
        
        You walked in with the red herring "but copying is copying", which is out of place here as grandparent is addressing specific statements in the complaint that attempt to show a connection with demonstrated "knowledge" and infringement--in part using arguments that factually misrepresent the technology.
        
        Maybe you are the target.
    - Re: (Score:1)
      
      by Miles_O'Toole ( 5152533 ) writes:
      
      You were full of crap above. You're full of crap here, too.
      - Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        Well, that kind of well-reasoned argument is what will make America great again.
- - - - Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        Dude brought in Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991) correctly.
        
        Quoth SCOTUS:
        (a) Article I, 8, cl. 8, of the Constitution mandates originality as a prerequisite for copyright protection. The constitutional requirement necessitates independent creation plus a modicum of creativity. Since facts do not owe their origin to an act of authorship, they are not original, and thus are not copyrightable. Although a compilation of facts may possess the requisite originality because the author typically chooses which facts to include, in what order to place them, and how to arrange the data so that readers may use them effectively, copyright protection extends only to those components of the work that are original to the author, not to the facts themselves. This fact/expression dichotomy severely limits the scope of protection in fact-based works.
        As for stripping the copyright, authorship, etc. data--have you seen the datasets, or are you just guessing like the plaintiffs?
        
        Again, quoting them:
        If ChatGPT was trained on works of journalism that included the original author, title, and copyright information, ChatGPT would have learned to communicate that information when providing responses to users unless Defendants trained it otherwise.
        Isn't that a daisy? On what possible basis could they make such a claim? Everyone knows that no one really knows how the big LLMs are actually doing what they're doing. If they really knew this, they'd be in possession of a serious break
        
        Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        Have you been around Slashdot long enough to recall the many infringing things Darl McBride claimed were definitely in the Linux source code? I wouldn't take any allegations as written.
        
        The Intercept's counsel, Loevy & Loevy (Chicago), bills themselves as a civil rights practice that also does some IP litigation. They are going to get pwned. The dumb shit they said in their complaint about how LLMs work is borderline sanctionable it's so wrong/misleading.
    - Re: (Score:2)
      
      by Dragonslicer ( 991472 ) writes:
      
      For the purposes of copyright the information contained within a work is not subject to copyright. For example as a matter of settled law I can OCR a phone book and create a computer database of every phone number in that book. While the phone book itself is copyrighted the "knowledge" it contains is not.
      While your conclusion is correct, your example has a flaw. Creating a database that includes all of the names and phone numbers from a phone book would not be copyright infringement. Your step of scanning the phone book in order to extract the data, however, may be infringing. It's clearly making a copy of a protected work, though you might be able to make a fair use defense, depending on other factors.
      - Re: (Score:2)
        
        by ihadafivedigituid ( 8391795 ) writes:
        
        Your step of scanning the phone book in order to extract the data, however, may be infringing.
        
        No, this was part of SCOTUS's decision in Feist Publications, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340 (1991)
        
        https://supreme.justia.com/cas... [justia.com]
- Re: (Score:2)
  
  by fph il quozientatore ( 971015 ) writes:
  
  If I hire a human to write a newspaper article for me, and they hand in a copy of an existing article on The New York Times, they're liable for copyright infringement, together with the organization that hired them and published the article. Why should this be different if they use a LLM?
  
  Humans need to be taught not to plagiarize. Why can't OpenAI spend 1% of their training time on detecting and preventing whole chunks of copied text? Or on running Turnitin on all the text they generate?
  - Re: (Score:2)
    
    by piojo ( 995934 ) writes:
    
    If... they hand in a copy of an existing article on The New York Times, they're liable for copyright infringement... Why should this be different if they use a LLM?
    Because the verbatim reproduction is not on purpose in this case. It's so far from on purpose that it shouldn't even have been possible (say, based on the max known text compressibility).
    - Re: (Score:2)
      
      by Visarga ( 1071662 ) writes:
      
      We have 7B models trained on 6T tokens, almost 1000:1 compression ratio, there is no SPACE to put that copyrighted data in the model.
  - Re: (Score:2)
    
    by Visarga ( 1071662 ) writes:
    
    There are two aspects here: training and generation. I agree when generating a LLM should not regurgitate the copyrighted original material verbatim. But training on it while ensuring it doesn't replicate should be ok. Checking for infringement is easy for exact snippets. Can even be n-grams, such as any 10 consecutive words - they can check the LLM never repeats copyrighted n-grams of a certain size using a bloom filter. Very fast and efficient, only need to index copyrighted works once in the bloom filte
What's all this fuss about? (Score:2)

by stevenm86 ( 780116 ) writes:

I don't understand all the fuss about people getting butthurt that their precious data is used for model training. When I was growing up, I was exposed to tons of copyrighted material - textbooks, news articles, magazines, technical papers, artwork, film, music, television, lessons, lectures, all of which left factual and stylistic impressions on my brain. When training / learning off these materials, I did not have to pay any kind of exorbitant licensing fees to access these materials - only the standard
- - Re: (Score:2)
    
    by ihadafivedigituid ( 8391795 ) writes:
    
    The publishers tried squashing the free public libraries with the same arguments. It's worth noting that I used to read a lot of expensive technical books at the bookstore in the 80s and 90s. I took notes and everything. No payment was involved in a fair chunk of my own technical education. The book gestapo left me alone, thank goodness.
- Re: (Score:2)
  
  by Synonymous Homonym ( 1901660 ) writes:
  
  The problem is not that it is used for training. Some people complain about that, too, but the issue here is a different one.
  The problem is that OpenAI is (allegedly) publishing copyrighted articles without a licence from the copyright owners.
  The LLM is not supposed to reproduce the training data verbatim. That's why OpenAI says that you need to use carefully crafted prompts to get ChatGPT to reprint the exact article, and even that depends on chance.
  But who knows what is really going on.
  - Re: (Score:2)
    
    by Visarga ( 1071662 ) writes:
    
    NYT managed to recall a few articles, maybe 100, but the vast majority would not reproduce verbatim, they would be generated based on the prompt like anything else. These models are trained on a dataset so large they can only do one single pass. So they get to show any example once to the model, and its gradients stack on top of those from billions of other examples, a drop in the ocean.
    
    But a few articles are copy-pasted all over the web in forums to avoid paywalls, so they get to have more copies in the
Interesting legal test (Score:2)

by ZipNada ( 10152669 ) writes:

If you post something on your internet website that clearly belongs to you as your original work, can everyone else freely copy it and legally provide it to people without any attribution as though it were their original creation? Seems problematic.
As a step somewhat removed, can everyone else feed your original work into their software product, which then does that same thing? It seems like there would be similar legal problems.
They actually want to own ideas, not text (Score:2)

by Visarga ( 1071662 ) writes:

> if ChatGPT trained on material that included copyright information, the chatbot "would have learned to communicate that information when providing responses."

Do you see what they are doing here? A power grab. It used to be that copyright covered expression while ideas were free to reuse. Now they want to close off any formulation of an idea as copyright infringement. They want to own all possible formulations of an idea. Is that copyright anymore, or is it more like patents or trademarks?

The ridi
- Re: (Score:2)
  
  by lordlod ( 458156 ) writes:
  
  Do you see what they are doing here? A power grab. It used to be that copyright covered expression while ideas were free to reuse. Now they want to close off any formulation of an idea as copyright infringement. They want to own all possible formulations of an idea. Is that copyright anymore, or is it more like patents or trademarks?
  
  This has actually always been the case. There's a reason why clean room techniques are used to reimplement code. If you have seen the original copyright work and produce something that is the same or very similar then it is immediately suspect of copyright infringement. Then the case goes through the courts to argue how different it is, cases on code have gone either way.
  That's what has happened here. The AI was fed the original copyright work. It produces something which is the same or very similar. Now
Prove they were the source (Score:2)

by sixsixtysix ( 1110135 ) writes:

Wouldn't they have to prove that it came directly from their website?
What if someone else infringed by posting the stories verbatim elsewhere?
That happens to paywalled stuff from time time, I'd imagine.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Terminal copyright infringement, you say? (Score:3)

Re: (Score:2)

Re: (Score:3)

Self-important clowns (Score:1)

Re:Self-important clowns (Score:4, Informative)

Re: (Score:2)

Re: (Score:2, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Painfully inept arguments (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

What's all this fuss about? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Interesting legal test (Score:2)

They actually want to own ideas, not text (Score:2)

Re: (Score:2)

Prove they were the source (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals