Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Microsoft The Courts

Major US Newspapers Sue OpenAI, Microsoft For Copyright Infringement (axios.com) 75

Eight prominent U.S. newspapers owned by investment giant Alden Global Capital are suing OpenAI and Microsoft for copyright infringement, in a complaint filed Tuesday in the Southern District of New York. From a report: Until now, the Times was the only major newspaper to take legal action against AI firms for copyright infringement. Many other news publishers, including the Financial Times, the Associated Press and Axel Springer, have instead opted to strike paid deals with AI companies for millions of dollars annually, undermining the Times' argument that it should be compensated billions of dollars in damages.

The lawsuit is being filed on behalf of some of the most prominent regional daily newspapers in the Alden portfolio, including the New York Daily News, Chicago Tribune, Orlando Sentinel, South Florida Sun Sentinel, San Jose Mercury News, Denver Post, Orange County Register and St. Paul Pioneer Press.

This discussion has been archived. No new comments can be posted.

Major US Newspapers Sue OpenAI, Microsoft For Copyright Infringement

Comments Filter:
  • Makes sense (Score:5, Insightful)

    by chipperdog ( 169552 ) on Tuesday April 30, 2024 @10:25AM (#64435812) Homepage
    I mean, if those AI products use copyrighted data in coming up with their responses without citation, that is a copyright violation. I've always pushed that ChatGPT and others need to produce a bibliography for each response indicating the sources used in coming up with it. Not only for copyright compliance, but also so the response can be verified.
    • AI is using articles written by Stephen Glass and Jayson Blair, though.
      • AI is using articles written by Stephen Glass and Jayson Blair, though.

        You make the exact case what citations are necessary. When you see bad data was used in the response, you can throw it out.

    • Re:Makes sense (Score:4, Insightful)

      by The-Ixian ( 168184 ) on Tuesday April 30, 2024 @11:04AM (#64435890)

      I guess I really don't see the problem what data was "ingested", as long as it doesn't reconstitute it whole cloth.

      What if I had a perfect memory and memorized every article and book I ever read and then was somehow able to make a living out of creating reviews or book reports on the subject matter that I read. Is that copyright infringement?

      • If you've ever done a report in school, you learned about citations/attributions in the form of footnotes and bibliographies. You should cite your sources, if not for copyright, to backup the statements you make.
        In your book review example, you would be specifying the book being reviewed, and all the content you create would be your original opinion about the book, maybe with a few quotes in your review, that you would indicate as quotes from the work.
        • Bullshit.

          Cite YOUR fucking sources for your claims that you should cite sources for your own knowledge when used in every day scenarios. Guess what, I don't have to have footnotes when I'm talking to either another geologist, OR the moron that is trying to build a house on an unstable hillside. I just have to say "that's dangerous, permission to build there is DENIED". I don't have to cite sources of papers saying it's stupid to build a house on the edge of a unstable crumbling cliffside.

          • by Khyber ( 864651 )

            "Cite YOUR fucking sources for your claims that you should cite sources for your own knowledge when used in every day scenarios."

            Every fucking customer I deal with demands to know how I know this or that to come to how I repaired or designed their PCB.

            Any customer that isn't asking you to prove your shit is an idiot and is a major cause of why we get tons of hacks that claim they know shit but don't.

      • But what if your family don't like hearing old news stories, they like summaries and mashups. Would it be wrong to share these with your family?
        Hell no!
        And what if, instead of developing perfect recall and sharing it with your family, you just put those stories on a computer and collected millions of dollars in venture capital so the public could enjoy the summaries and mash ups at a price that was practically giving them away. Would that be so wrong?
    • Re:Makes sense (Score:4, Insightful)

      by WaffleMonster ( 969671 ) on Tuesday April 30, 2024 @11:10AM (#64435904)

      I mean, if those AI products use copyrighted data in coming up with their responses without citation, that is a copyright violation.

      Data is not subject to copyright under US law. For example it is settled law anyone can OCR a copyrighted phone book into a database and there isn't anything the copyright holder can do about it. Copyright law in the US protects performances and the (re)production of works and derivatives. It does not limit access to or the use of information.

      I've always pushed that ChatGPT and others need to produce a bibliography for each response indicating the sources used in coming up with it. Not only for copyright compliance, but also so the response can be verified.

      LLMs are influenced by literally everything in their training set.

      • Phone books and cook books are poor examples. The collections as a whole are copyrighted, but individual recipes or phone records are not "copyrightable".
        • Uhhh, that's a perfect example?

          The specific layout of the phone book is copyrightable. The data inside isn't. Just like the specific layout of the "reporting" of "news" is copyrightable, but the "news" information itself is NOT.

          So unless your LLM is putting out exact replicas of the articles written without having to go through extreme measures to get it to put out anything even remotely close to the original sources ( which means something has gone VERY wrong in training I might add ), you aren't doing ANY

    • Re:Makes sense (Score:4, Insightful)

      by HBI ( 10338492 ) on Tuesday April 30, 2024 @11:49AM (#64436032)

      Coming up with an answer using copyrighted materials WITH a citation is still copyright infringement unless you can make a fair use case. Citations are irrelevant for infringement. If it's fair use, you don't need a citation, either. The fair use discussion involves a number of considerations, but if the AI ingested the entire work and is spitting back something based on the entire thing, like, say "Give me the Cliff's Notes version of XXX work", it's going to be hard sell.

      • According to the article one of the claims really is about citation: "The newspapers also claim OpenAI and Microsoft removed copyright management information, like journalists' names and titles, from their work when the information they reported was cited in answers to queries. "

        So even if the unique generated text doesn't violate copyright, by not generating "copyright management information" pertinent to some facts they could still be infringing some aspect of copyright law? Interesting. Seems to me t

        • According to the article one of the claims really is about citation: "The newspapers also claim OpenAI and Microsoft removed copyright management information, like journalists' names and titles, from their work when the information they reported was cited in answers to queries. "

          So even if the unique generated text doesn't violate copyright, by not generating "copyright management information" pertinent to some facts they could still be infringing some aspect of copyright law? Interesting. Seems to me this practice is ubiquitous within the news industry, otherwise every story would have a bibliography like a scientific paper.

          The copyright management bits are a DMCA thing. They are just throwing it on the wall.

      • Coming up with an answer using copyrighted materials WITH a citation is still copyright infringement unless you can make a fair use case.

        No. You make it sound like the non-copyright case is the exception to the norm. In reality it's the copyright infringement which is largely the exception. Virtually nothing exists in a vacuum, and most ideas are in some way derivative.

    • The news outlets should step cautiously here because a whole lot of what they do is repeat each other without attribution. AP and UPI formalize it to a degree, but I bet actually tracing the provenance of stories would reveal a lot of what you call "copyright violation" (although copyright doesn't actually cover facts/information per se).
      • The news outlets should step cautiously here because a whole lot of what they do is repeat each other without attribution.

        All significant news outlets are owned by the same small handful of large corporations, and they collude to present a unified front to the public.

    • I mean, if those AI products use copyrighted data in coming up with their responses without citation, that is a copyright violation.

      Except that it is not. Using copyrighted data is not a copyright violation (with or without citation).
      Republishing copyrighted materials (or a substantial portion thereof) without a license is a copyright violation.
      (There can be a fair-use exception if it is "transformative" as opposed to "derivative" -but this would be subject to interpretation.)

      I hope that this goes to trial -I want to hear the legal arguments:

      According to the law as written, it is not technically a violation. [republishing is not occur

    • I've always pushed that ChatGPT and others need to produce a bibliography for each response indicating the sources used in coming up with it.

      It's ironic that you said that as an idea rather than actually providing a citable legal precedent as to why or at least a bibliography of where you learnt that idea from.

    • by Roogna ( 9643 )

      So unfortunately it's unlikely to be copyright infringement because user's who posted on these sites (/. included) are going to be bound by the TOS of the site they posted to. Which almost universally include clauses saying the site can use the posted information as desired.

  • My brain (Score:2, Redundant)

    by christoban ( 3028573 )

    My squishy, human brain learns the same way, I read a newspaper and absorbs the knowledge.

    Then we both answer questions about that knowledge when asked. Nothing copied verbatim.

    The learning method is quite different, but I don't see how it's plagiarism.

    • by Tablizer ( 95088 )

      > My squishy, human brain learns the same way, I read a newspaper and absorbs the knowledge.

      Our heads would be taxed and billed if someone found a way. With AI it's a bit more objective to prove borrowing, at least with the current state of the art. In the future, slimebags may find ways to disguise the training source.

      • by EvilSS ( 557649 )

        > My squishy, human brain learns the same way, I read a newspaper and absorbs the knowledge.

        Our heads would be taxed and billed if someone found a way.

        College textbook companies would absolutely do this if they could find a way to get away with it.

    • It depends on if the LLM is producing entire passages word for word, which evidence suggests it is.

      In the human world, you might be able to claim (with evidence) that you never read the original work and the issue is parallel development. It's harder for an AI to make a similar claim unless the work was published after training.

      But ultimately it's a question of judgment. If it looks like it's copied, the courts are more likely to determine it's infringement.

      • by EvilSS ( 557649 )

        It depends on if the LLM is producing entire passages word for word, which evidence suggests it is.

        I'm curious how the newspapers are getting them to do this. I suspect they are picking obscure articles so there won't be a lot of training data on it, then seeding the prompt with a few sentence and asking it to generate the next sentence, then re-prompting until they get the exact sentence they are looking for, then moving on to the next one. If so, that's a bit bogus.

        • They will be required to hand over this information during discovery.

          IF this goes to trial, I expect this to be made public. If it is trivial to trigger, the complainants will say so -if it is not, the defense will bring it up on cross.

      • We humans sometimes reproduce phrases we read, too. Where's the evidence LLMs are doing anything more than that?

        IOW, if a human had perfect memory, wouldn't that person sometimes reproduce some phrases unintentionally?

        • That's not a defense for infringement for a human either. Courts might not assess a financial penalty depending on the circumstances, but they are probably going to issue an order to stop infringing activity.

          • I don't think they're even asserting that ChatGPT is plagiarizing like a human would.

            They seem to be claiming that the sort of processing and understanding ChatGPT does/has isn't enough to allow it to answer questions at all without violating copyright. It's like they are prejudiced toward human understanding.

            I think they're just looking for those millions.

    • My squishy, human brain learns the same way, I read a newspaper and absorbs the knowledge.

      Then we both answer questions about that knowledge when asked. Nothing copied verbatim.

      The learning method is quite different, but I don't see how it's plagiarism.

      A good non-fiction book often has a long bibliography of references at the back. I don't see why computers should be exempt from the same sort of verification / bibliographic references.

      • A good non-fiction book often has a long bibliography of references at the back. I don't see why computers should be exempt from the same sort of verification / bibliographic references.

        Exactly. Apparently there is a generation of people who never had to do footnotes and bibliographies on their school papers.

        • A good non-fiction book often has a long bibliography of references at the back. I don't see why computers should be exempt from the same sort of verification / bibliographic references.

          Exactly. Apparently there is a generation of people who never had to do footnotes and bibliographies on their school papers.

          I'm amazed that the Slashdot crowd doesn't have more science readers that see books stuffed full of references on a regular basis. That's been standard MO for as long as any of us have been alive in the western world. Use information from a verified source in your own work: Make direct reference to said source. It seems logical, which probably explains why the techbros running the bigger LLMs would avoid it at all costs. Logic seems to have fled that entire area of expertise as fast as it could.

          • this is the time in the history of the world when a significant number of students are plagiarizing works and claiming that it is 'okay'. So, yeah, a whole generation of students who simply don't do footnotes or bibliographies, just as you say. Just like the anonymous coward who is insulting you with his reference to Billy Bob and Bubba.
        • by EvilSS ( 557649 )
          OK so where are your references? I expect them on every comment going forward if you use any knowledge you gained from reading a source at some time in your life and not from your own original research.
          • ChatGPT absolutely does provide extensive references to every one of its source, annotated very thoroughly in its answers.

            Everyone demanding this seems to have never used it before.

        • ChatGPT answers DO already contain extensive references to its sources.

      • ChatGPT does provided extensive references like that throughout its answers.

        • ChatGPT does provided extensive references like that throughout its answers.

          Having dabbled with it a bit, I can give a firm "sometimes" agreement to this. It'd be nice if every answer came with a clickable "sources" link.

          • I use it a lot and I don't remember anything lately not providing extensive sources, usually several per sentence. But I am sure your experience is valid, too.

    • by r0nc0 ( 566295 )
      And your squishy brain apparently is unable to RTFA as nowhere does it say anything about plagiarism but rather about access to copyrighted material without compensation and without attribution. It also goes on the mention things like diluted trademark infringement and other aspects. The dirty little secret behind all the AI training is that forever and a day now data scientists are hungry for data and will go to any end to get it; they would prefer to not pay for it if possible because they require quite
  • Good first step (Score:1, Redundant)

    by sdinfoserv ( 1793266 )
    Given AI's ever growing need for larger and larger datasets to train, I have always contended that a large portion of that training material has been done with someone elses intellectual property. Like most corporations, AI companies care only of profit and disregard any laws in that pursuit, considering fines as minor operational fees. In this case, since AI companies are feeding off other corporations IP, this is likely to be a huge battle. If findings and awards include percentage of revenue, this wil
    • Sounds like shuffling deck chairs on the Titanic. Even if these lawsuits result in huge damages, the AI crowd will just find a way around it. For example, they can just create big pile of training data that's free from copyright. The AI bots are getting good enough that they need less and less training material to be useful. Then there are the folks who will simply shuck and jive and simply lie about the origin of their training data. Then there are those who might run an LLM based on whatever they feel lik
      • can just create big pile of training data that's free from copyright.

        IMO hyper-focusing on copyright *status* this way doesn't make sense. If a work is created anywhere where copyright is automatic, it';s copyrighted regardless of the licensing or lack thereof. If we focus on making "using copyrighted works" itself wrong OR illegal (OR both) you'd kill off the ability to use works where the creators either gave explicit permission, or implicit permission through use of the appropriate Creative Commons license for instance, to use for training, which doesn't make sense to m

        • Okay, sure. Maybe there are some copyrighted works that could still be used. I wouldn't presume to argue on that lawyerly-point. However, it doesn't change my point that nobody is going to put AI back on the shelf because of concerns over copyright infringement.
  • Anyway, I've got to prepare for that neat eclipse coming Monday, don't miss this one! It promises to be the eclipse of the century!
  • by Tulsa_Time ( 2430696 ) on Tuesday April 30, 2024 @10:56AM (#64435870)

    An AI can read it also.

    The only issue is whether it creates an non-derivative output.

    • But even if you create a piece of work based on your reading, you should be citing the source material used in your analysis. Biggest question that comes into play is what is "common knowledge" and doesn't have to be cited...
    • by darkain ( 749283 )

      Derivative works is still protected under United State Copyright Law.

    • by Anonymous Coward
      An AI can read it also.

      Did the AI have a license to access it in the first place to read it? Can it then pass a bunch copies of what it read (not what it output) around to its progeny?

      There are many issues here, and you've missed an obvious one in saying "only."
  • Forget the mild headlines; here's the real title: 'Eight Struggling Newspapers Take a Wild Swing at Tech Titans in a Last-Ditch Effort to Claim Relevance!' For a quarter-century, these papers turned a blind eye as giants like Google and countless other search engines freely scavenged and repurposed their content without a peep. Now, they're making a play with claims so thin, they're practically nonexistent. It's high time OpenAI turns the tables and countersues Google for decades of unconsented content use
    • It’s fascinating how these struggling newspapers are taking on tech giants in a bid to reclaim their relevance. But let’s face it, their claims seem flimsier than tissue paper. It’s ironic how they turned a blind eye to content scavenging for years, only to cry foul now.
  • It's scary how someone can invade the privacy of thousands of people like that. The fact that at least one person took their life due to this breach is heartbreaking. Justice being served is a relief, but it's hard to shake off the feeling that the punishment might not match the scale of the harm caused. As a law student, I recently delved into a related topic while working on an assignment about cybersecurity laws and their enforcement. It's fascinating to see how legal frameworks evolve to tackle these mo
    • This lawsuit really highlights the ongoing struggle between technology and copyright law. It's concerning to see the potential consequences of AI scraping content without permission. Hopefully, this case sets a precedent for better protection of intellectual property in the digital age.
  • Wow, this lawsuit is definitely going to shake things up! It's interesting to see the different approaches newspapers are taking when it comes to AI and copyright.

To communicate is the beginning of understanding. -- AT&T

Working...