Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI The Courts

The Intercept, Raw Story, and AlterNet Sue OpenAI and Microsoft (theverge.com) 58

The Intercept, Raw Story, and AlterNet have filed separate lawsuits against OpenAI and Microsoft, alleging copyright infringement and the removal of copyright information while training AI models. The Verge reports: The publications said ChatGPT "at least some of the time" reproduces "verbatim or nearly verbatim copyright-protected works of journalism without providing author, title, copyright or terms of use information contained in those works." According to the plaintiffs, if ChatGPT trained on material that included copyright information, the chatbot "would have learned to communicate that information when providing responses."

Raw Story and AlterNet's lawsuit goes further (PDF), saying OpenAI and Microsoft "had reason to know that ChatGPT would be less popular and generate less revenue if users believed that ChatGPT responses violated third-party copyrights." Both Microsoft and OpenAI offer legal cover to paying customers in case they get sued for violating copyright for using Copilot or ChatGPT Enterprise. The lawsuits say that OpenAI and Microsoft are aware of potential copyright infringement. As evidence, the publications point to how OpenAI offers an opt-out system so website owners can block content from its web crawlers.
The New York Times also filed a lawsuit in December against OpenAI, claiming ChatGPT faithfully reproduces journalistic work. OpenAI claims the publication exploited a bug on the chatbot to regurgitate its articles.
This discussion has been archived. No new comments can be posted.

The Intercept, Raw Story, and AlterNet Sue OpenAI and Microsoft

Comments Filter:
  • by rmdingler ( 1955220 ) on Wednesday February 28, 2024 @09:35PM (#64277508) Journal

    Shudder. Of all the pratfalls and foibles in the way of Skynet domination, the one the computer overlords were least prepared for was pesky copyright infringement? Well, great story Bro, but it's no box office draw, for sure now.

    • myself.
      when i have communicated to others.
      i have never felt the need to state prior art

      • by tlhIngan ( 30335 )

        Well, here's something to think about.

        Did you know it's possible to pirate open-source code? We don't usually call it that - usually it goes under terms like "GPL infringement" or other terms, but in the end, it boils down to "copyright infringement" aka piracy.

        What does this have to do with anything? Well, programming languages are just languages, and ChatGPT can spit out code. But ever consider the effects?

        If you train an LLM on open-source code, and have it generate the code, you need to figure out the c

  • Yeeeeeeah, your 0.000000000000000081% share of the training data is gonna jump right out at everyone, verbatim, because it's soooo good and soooo important. No wonder Glenn Greenwald left The Intercept.

    Sure, if you prompt engineer the hell out of it like the NYT did, then you might get it to regurgitate something it saw before on your website. I am 100% sure this is true of human beings like me, too.
    • by Lehk228 ( 705449 ) on Wednesday February 28, 2024 @10:55PM (#64277602) Journal
      if the AI model is reliably spitting out whole chunks of copyrighted material then yea that's a valid case
      • Please re-read what I wrote:

        Sure, if you prompt engineer the hell out of it like the NYT did, then you might get it to regurgitate something it saw before on your website.
        • Re: (Score:2, Interesting)

          by gabebear ( 251933 )
          So you agree they are obviously infringing copyright? Copying a document into a different format doesn’t magically remove copyright.
          • When the alleged victim of infringement goes to great trouble and expense to encourage the big bad evil infotech companies to copy and widely disseminate their work, they might have some problems with this line of reasoning. See discussion of laches elsewhere in the comments.
      • Don't know if triggering the AI with a whole paragraph lifted from a copyrighted text is not closer to entrapment than to spontaneous copyright violation.
  • by Anonymous Coward

    ChatGPT does not have any independent knowledge of the information provided in
    its responses.

    Knowledge is not subject to copyright.

    If ChatGPT was trained on works of journalism that included the original author,
    title, and copyright information, ChatGPT would have learned to communicate that information
    when providing responses to users unless Defendants trained it otherwise.

    Cute they believe LLMs are some kind of automated cut and paste machines. This fundamentally is not how the technology works. LLMs like people are notoriously bad at sourcing knowledge.

    When providing responses, ChatGPT gives the impression that it is an all-knowing,
    "intelligent" source of the information being provided, when in reality, the responses are frequently
    based on copyrighted works of journalism that ChatGPT simply mimics.

    Again knowledge is not subject to copyright. It doesn't matter how much time and expense a journalist took to surface some bit of knowledge copyright law only protects works not information.

    Based on the publicly available information described above, thousands of Plaintiffsâ(TM) copyrighted works were included in Defendantsâ(TM) training sets without the author, title, and copyright information that Plaintiffs conveyed in publishing them.

    Copyright law only concerns public performances, copies and preparation of derivative works. Co

    • If I hire a human to write a newspaper article for me, and they hand in a copy of an existing article on The New York Times, they're liable for copyright infringement, together with the organization that hired them and published the article. Why should this be different if they use a LLM?

      Humans need to be taught not to plagiarize. Why can't OpenAI spend 1% of their training time on detecting and preventing whole chunks of copied text? Or on running Turnitin on all the text they generate?
      • by piojo ( 995934 )

        If... they hand in a copy of an existing article on The New York Times, they're liable for copyright infringement... Why should this be different if they use a LLM?

        Because the verbatim reproduction is not on purpose in this case. It's so far from on purpose that it shouldn't even have been possible (say, based on the max known text compressibility).

        • We have 7B models trained on 6T tokens, almost 1000:1 compression ratio, there is no SPACE to put that copyrighted data in the model.
      • There are two aspects here: training and generation. I agree when generating a LLM should not regurgitate the copyrighted original material verbatim. But training on it while ensuring it doesn't replicate should be ok. Checking for infringement is easy for exact snippets. Can even be n-grams, such as any 10 consecutive words - they can check the LLM never repeats copyrighted n-grams of a certain size using a bloom filter. Very fast and efficient, only need to index copyrighted works once in the bloom filte
  • I don't understand all the fuss about people getting butthurt that their precious data is used for model training. When I was growing up, I was exposed to tons of copyrighted material - textbooks, news articles, magazines, technical papers, artwork, film, music, television, lessons, lectures, all of which left factual and stylistic impressions on my brain. When training / learning off these materials, I did not have to pay any kind of exorbitant licensing fees to access these materials - only the standard
    • The problem is not that it is used for training. Some people complain about that, too, but the issue here is a different one.

      The problem is that OpenAI is (allegedly) publishing copyrighted articles without a licence from the copyright owners.

      The LLM is not supposed to reproduce the training data verbatim. That's why OpenAI says that you need to use carefully crafted prompts to get ChatGPT to reprint the exact article, and even that depends on chance.

      But who knows what is really going on.

      • NYT managed to recall a few articles, maybe 100, but the vast majority would not reproduce verbatim, they would be generated based on the prompt like anything else. These models are trained on a dataset so large they can only do one single pass. So they get to show any example once to the model, and its gradients stack on top of those from billions of other examples, a drop in the ocean.

        But a few articles are copy-pasted all over the web in forums to avoid paywalls, so they get to have more copies in the
  • If you post something on your internet website that clearly belongs to you as your original work, can everyone else freely copy it and legally provide it to people without any attribution as though it were their original creation? Seems problematic.

    As a step somewhat removed, can everyone else feed your original work into their software product, which then does that same thing? It seems like there would be similar legal problems.

  • > if ChatGPT trained on material that included copyright information, the chatbot "would have learned to communicate that information when providing responses."

    Do you see what they are doing here? A power grab. It used to be that copyright covered expression while ideas were free to reuse. Now they want to close off any formulation of an idea as copyright infringement. They want to own all possible formulations of an idea. Is that copyright anymore, or is it more like patents or trademarks?

    The ridi
    • by lordlod ( 458156 )

      Do you see what they are doing here? A power grab. It used to be that copyright covered expression while ideas were free to reuse. Now they want to close off any formulation of an idea as copyright infringement. They want to own all possible formulations of an idea. Is that copyright anymore, or is it more like patents or trademarks?

      This has actually always been the case. There's a reason why clean room techniques are used to reimplement code. If you have seen the original copyright work and produce something that is the same or very similar then it is immediately suspect of copyright infringement. Then the case goes through the courts to argue how different it is, cases on code have gone either way.

      That's what has happened here. The AI was fed the original copyright work. It produces something which is the same or very similar. Now

  • Wouldn't they have to prove that it came directly from their website?
    What if someone else infringed by posting the stories verbatim elsewhere?
    That happens to paywalled stuff from time time, I'd imagine.

Air pollution is really making us pay through the nose.

Working...