Become a fan of Slashdot on Facebook

New York Times Copyright Suit Wants OpenAI To Delete All GPT Instances (arstechnica.com) 157

Posted by msmash on Thursday December 28, 2023 @11:20AM from the how-about-that dept.

An anonymous reader shares a report: The Times is targeting various companies under the OpenAI umbrella, as well as Microsoft, an OpenAI partner that both uses it to power its Copilot service and helped provide the infrastructure for training the GPT Large Language Model. But the suit goes well beyond the use of copyrighted material in training, alleging that OpenAI-powered software will happily circumvent the Times' paywall and ascribe hallucinated misinformation to the Times.

The suit notes that The Times maintains a large staff that allows it to do things like dedicate reporters to a huge range of beats and engage in important investigative journalism, among other things. Because of those investments, the newspaper is often considered an authoritative source on many matters. All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall. In addition, each print edition has a copyright notification, the Times' terms of service limit the copying and use of any published material, and it can be selective about how it licenses its stories.

In addition to driving revenue, these restrictions also help it to maintain its reputation as an authoritative voice by controlling how its works appear. The suit alleges that OpenAI-developed tools undermine all of that. [...] The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity."

This discussion has been archived. No new comments can be posted.

New York Times Copyright Suit Wants OpenAI To Delete All GPT Instances

Load All Comments

Search 157 Comments Log In/Create an Account

Comments Filter:

Paywall (Score:5, Interesting)

by Deep Esophagus ( 686515 ) writes: on Thursday December 28, 2023 @11:23AM (#64111929)

Stupid question... if the material is behind a paywall, how did it end up in the training data to begin with?

Share
twitter facebook
- Re:Paywall (Score:5, Insightful)
  
  by christoban ( 3028573 ) writes: on Thursday December 28, 2023 @11:25AM (#64111937)
  
  A few bucks wouldn't stop a big AI company from training their LLM on valuable, somewhat unique data set.
  
  Parent Share
  twitter facebook
- Re:Paywall (Score:4, Informative)
  
  by Misagon ( 1135 ) writes: on Thursday December 28, 2023 @11:33AM (#64111957)
  
  There are sites that display New York Times articles for free, without NYT's permission of course.
  Whenever there's a discussion on a forum about a New York Times article, you can bet that someone posts a link to one of those sites. Later, web crawlers find and follow those links.
  
  Parent Share
  twitter facebook
  - Re:Paywall (Score:5, Interesting)
    
    by MachineShedFred ( 621896 ) writes: on Thursday December 28, 2023 @12:38PM (#64112133) Journal
    
    Don't forget that if you are a subscriber to the Times, in some circumstances you can also post links to paywall-free articles on social media. I'm not a subscriber so I don't know all the rules, but I've sure clicked a few links that allowed access on the verified nytimes.com domain without me paying to read, when other links to the same article did invoke the paywall.
    They probably have a case here just because of the copyright statements and the argument that the model is now a "derivative work" due to how LLM training works, but any paywall argument is flimsy beyond "look! we make people pay and stuff!"
    
    Parent Share
    twitter facebook
    - Analogy to Aaron Schwartz (Score:2)
      
      by goombah99 ( 560566 ) writes:
      
      Aaron Schwartz scraped research pubs and made them available. Even though he might have scraped these in a way the api allowed he still was violating the DMCA. These AI scraping jobs are the same. The idea that they should be forced to destroy every instance they fed this poison fruit too makes perfect sense. It's a shame too. Loss of all that public good. But it's the correct response.
      The difference from Aaron Schwartz is that these deep pocket companies do have another option. Pay a very large amount
    - Re: (Score:2)
      
      by Spazmania ( 174582 ) writes:
      
      They probably have a case here just because of the copyright statements and the argument that the model is now a "derivative work" due to how LLM training works, but any paywall argument is flimsy beyond "look! we make people pay and stuff!"
      Well that's the crux of it, right? If the LLM is a derivative work then GPT is done. But if the LLM is a transformative work then the New York Times' doesn't have much of a case.
      If the data is linked in a way that allows GPT to reproduce the original story then it's probably derivative. Otherwise... no real different than human beings learning facts from reading articles -- not a copyright violation.
- Re:Paywall (Score:5, Insightful)
  
  by rgmoore ( 133276 ) writes: <glandauer@charter.net> on Thursday December 28, 2023 @11:40AM (#64111975) Homepage
  
  Maybe OpenAI bought a subscription and scraped their site, which is probably a violation of their TOS. More likely someone they get information from bought a subscription and passed it on as part of their sales. I expect this kind of argument to be one of the legally stronger ones against AI companies. If the AI companies are breaking copyright law in assembling their training sets, the copyright holders have every right to sue.
  I also expect the misatribution part of the suit to have some legs. Claiming misinformation comes from NY Times when it was just made up by your chatbot is serious misconduct. It's not quite as bad as one of the bots slapping the Getty Images watermark on fake photos, but the courts are unlikely to look kindly on it.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by AmiMoJo ( 196126 ) writes:
    
    A lot of news sites serve up the whole article to bots, to help them gain search engine ranking. It's possible that they have ChatGPT's bot (or Bing, now they are Microsoft) full access for free.
- Re: (Score:2)
  
  by TomGreenhaw ( 929233 ) writes:
  
  Clearly, OpenAI or the source of their training data paid for a subscription. Personally I think the copyright lawsuit is a money grab. There may however may be a case of violation of the Times subscriber Terms of Use contract.
  - Re: (Score:3)
    
    by MachineShedFred ( 621896 ) writes:
    
    There is a clear violation of copyright - if they are using nytimes copyrighted content to train an AI, and nytimes has posted in their terms of use that derivative works are disallowed, there is a strong argument to be made that the trained AI is an unlicensed derivative work, which is expressly forbidden by copyright unless meeting an exception of some kind, such as parody.
    I don't think that OpenAI really wants to claim that it's whiz-bang large language model AI is a New York Times parody bot.
    - Re:Paywall (Score:5, Insightful)
      
      by ufgrat ( 6245202 ) writes: on Thursday December 28, 2023 @01:34PM (#64112297)
      
      Like most copyright trolls, the concept of "fair use" escapes many people.
      If I read an article in the Times, and make a business decision based on that, does the Times own all profits from my business decision?
      If I teach my students using an article from the Times-- how much money do I owe them?
      If I read an article in the Times, and then appear on Jeopardy and win lots of money by using that information to answer a question, does the Times have a right to a percentage?
      Copyright covers, well, copying-- If I take the information in the Times, and sell the articles as my own content, that's a copyright violation. If I create a derivative work, it gets a bit less straightforward-- what would be a derivative work? An article on the passage of the Affordable Care Act, written by a Times journalist, is indeed, copyright worthy. If I quote that article, without giving attribution, I may be creating a derivative work.
      However, the fact that the Affordable Care Act was passed by Congress and signed into law by Barack Obama, along with the contents of the Act, are public record. The quotes in the article are public record. A well researched, well written article does add value, and is useful to the reader-- but that doesn't mean the NYT owns the Affordable Care Act. or any quotes used in writing the article. I'm free to write my own article about the affordable care act, even after reading the NYT article, as long as I don't plagiarize. I can even use the same quotes by the same people, assuming they were used outside of a direct interview-- and if I attribute those quotes to the NYT, I can probably get away with using a couple of them as well, because again-- these are public record.
      The problem is, chatGPT and other large language models, don't reproduce the articles from the New York Times-- they reproduce the information from those articles, and they can correlate that information with other data. So if you ask what are the primary benefits of the Affordable Care Act, chatGPT can respond with information gathered from all of the sources in it's dataset-- but that doesn't mean the entirety (or even a major portion) of the NYT article is contained in that model.
      Further, if the New York Times claims all elephants are pink, and every other data source used to construct the model says they're grey, when you ask chatGPT what color is an elephant, it's not going to say Pink. So while the NYT may think it's the most important newspaper in the world, it's not the most important source of information for most LLM's.
      Sarah Silverman's attorneys believe that since chatGPT understands the contents of her autobiography, that book must have been used to train the model-- but then again, they could have used a number of reviews from the NYT (among others) and still create a valid understanding of the contents of her book.
      "Modern" copyright law, which was mostly written before http:/// [http] became a thing, isn't designed to handle training a computer model on large sets of copyrighted material. So far, other cases have held that training a dataset on a source does not infringe the copyright on the source, and may actually be fair use.
      
      Parent Share
      twitter facebook
      - Re: (Score:3)
        
        by Pinky's Brain ( 1158667 ) writes:
        
        For teaching your students if you didn't pirate the content in the first place (ie. it was either publicly available or you legitimately got past the paywall) then you have a copyright exemption for performance in the classroom. If you did pirate it, they can probably still get you on the first copy you made before the performance.
        https://www.law.cornell.edu/us... [cornell.edu]
        That exemption and the exemptions in the DMCA (most of which people also often think are settled by fair use, but weren't and aren't) obviously do
      - Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        Why are you talking about AI as if it in any way compares to a human using the information? Its a computer program.
      - Re: (Score:3)
        
        by AmiMoJo ( 196126 ) writes:
        
        The NYT is claiming that ChatGPT does reproduce their articles, word-for-word or very nearly. Their lawsuit has over 100 examples.
        You can learn from the newspaper, but you can't use your memory of it to reproduce the content and then claim it's your own original work.
    - Re: (Score:3)
      
      by bsolar ( 1176767 ) writes:
      
      There is a clear violation of copyright - if they are using nytimes copyrighted content to train an AI, and nytimes has posted in their terms of use that derivative works are disallowed, there is a strong argument to be made that the trained AI is an unlicensed derivative work, which is expressly forbidden by copyright unless meeting an exception of some kind, such as parody.
      I don't think that OpenAI really wants to claim that it's whiz-bang large language model AI is a New York Times parody bot.
      It's not so clear IMHO as the result of the training might not be considered derivative but transformative. If the result is considered transformative it's more likely to fall under fair use.
      - Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        Its the result of a calculation. No way for that to be transformative.
        
        Re: (Score:2)
        
        by bsolar ( 1176767 ) writes:
        
        Its the result of a calculation. No way for that to be transformative.
        I don't see why that would be an impediment per se. As example, thumbnails are definitely the "result of a calculation", but they have been ruled as transformative in some circumstances [casetext.com]:
        We must determine if Arriba's use of the images merely superseded the object of the originals or instead added a further purpose or different character. We find that Arriba's use of Kelly's images for its thumbnails was transformative.
        In the case of training an AI, I can see quite convincing arguments that the "training" is using the New York Times' posts in a transformative way.
        
        Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        Your example is not the same. The point of making the copies of the images was to provide the function of linking to the image and could never be considered as an alternative to the original as an art piece.
        
        Re: (Score:3)
        
        by bsolar ( 1176767 ) writes:
        
        Your example is not the same. The point of making the copies of the images was to provide the function of linking to the image and could never be considered as an alternative to the original as an art piece.
        Of course it's not exactly the same, but there are similarities and the point is that it shows a counter-example where the "result of a calculation" has been deemed to be transformative, which means there are ways a calculation can be transformative. Your argument was that there was "no way" for that to be transformative, but legal precedents disagree.
        The underlying question is the same: does the use of the posts add a further purpose or different character? The posts are fundamentally works of journalism,
    - Re: (Score:2)
      
      by WaffleMonster ( 969671 ) writes:
      
      There is a clear violation of copyright - if they are using nytimes copyrighted content to train an AI, and nytimes has posted in their terms of use that derivative works are disallowed, there is a strong argument to be made that the trained AI is an unlicensed derivative work
      Whether or not you have violated terms of service is a separate matter from whether or not your work is a derivative of a copyrighted work. AI models are clearly transformative not derivative. Not even Google's search index is a derivative work despite being filled with petabytes of copyrighted material.
      The only prayer in the world these companies have in destroying commercial LLM as a service industry via copyright law is in arguing the output of an interactive chatbot is either a performance or constitu
    - - Re: (Score:2)
        
        by starworks5 ( 139327 ) writes:
        
        I will unequivocally tell you that you are fucking retarded, and don't know anything about copyright. The New York Times cannot sue a reporter for reading is newspaper about some story of the day, and writing their own article about the same sets of facts contained in the New York Times, because copyright only protects the specific expression and not the underlying facts, and relying on the same underlying facts does not make the work a "derivative work".
        Moreover "training" an ai on content to learn the fac
        
        Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        Unless the terms of use say you cannot scrape the site and can use only for reading of course.
        
        Re: (Score:2)
        
        by chiguy ( 522222 ) writes:
        
        You have a basic misunderstanding of how LLM training works. They don't "learn the facts of the content," as you say. They literally learn the word order, which is the most copyrightable part of a NY Times article. They then use that word order to help guess word orders.
        Is it transformed enough? I guess that's for the courts. But the word order is what they're stealing and not the "knowledge" or "facts" in the article.
        
        Re: (Score:2)
        
        by Knightman ( 142928 ) writes:
        
        You have a basic misunderstanding of how LLM training works. They don't "learn the facts of the content," as you say. They literally learn the word order, which is the most copyrightable part of a NY Times article. They then use that word order to help guess word orders.
        That isn't how an LLM works, they use contexts and attentions models to predict what parts of the context to focus on and what tokens fit into them and in what order. They don't literally learn the word order because that would mean the original application LLM's were intended for would never work, langue-translation where context is paramount to give an accurate translation that makes sense.
        
        Re: (Score:2)
        
        by WaffleMonster ( 969671 ) writes:
        
        There is no exception to copyright for "being 100% transformative" even if that was something that could be proven, which it is not.
        If your work is deemed transformative this means the copyright holder has no authority over that work. In a copyright dispute having a work deemed transformative by the court means the plaintiff necessarily loses their case.
  - Re: Paywall (Score:2)
    
    by hdyoung ( 5182939 ) writes:
    
    Well then, you would be totally fine for me to take the entirety of your life work, and your family members work, and all the productivity of your friends and work colleagues, repackage it into a different form, sell it for a profit, and not give anyone else a penny. Heck I wonâ(TM)t even acknowledge that you did anything. If you sue, your just playing some sort of trolly money copyright grab and Im clearly the victim here.
    
    These news outfits have a valid point.
    - Re: (Score:2)
      
      by Gibgezr ( 2025238 ) writes:
      
      You just described most businesses that don't involve manufacturing, and several that do. It's a perfect fit for how Disney came to be. It has always been acceptable in a capitalistic world to transformatively benefit from what has come before.
    - Re: (Score:2)
      
      by starworks5 ( 139327 ) writes:
      
      If the works contain tangible products then its theft and you have been deprived of property
      if the works involve onlyfans, then you still have the property, and you are not owed a living.
- Re:Paywall (Score:5, Informative)
  
  by Fly Swatter ( 30498 ) writes: on Thursday December 28, 2023 @12:01PM (#64112021) Homepage
  
  Web crawlers like google are generally allowed past the paywall so that they can return results that will draw traffic to the content in the hope that some visitors will pay at the wall. If they then used that privileged data to train their algorithm models, they are beyond the general agreement of using that copyrighted material intended for search purposes.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by robsku ( 1381635 ) writes:
  
  I don't know, and I have no idea if they have legitimate right to do what they have done (OpenAI that is), but I've heard of asking GPT to print out article at URL mentioned as a way to get around pay walls before - the once I saw a guy try to demonstrate it (it wasn't the actual topic of his vid), it didn't work :D But he swore he had used it for that site successfully before :)
- Re: (Score:2)
  
  by Tony Isaac ( 1301187 ) writes:
  
  The Times does it to themselves. They want it both ways. They want a paywall, but they ALSO want their pages fully indexed by search engines like Google. So the paywall doesn't apply to web crawlers, just regular browsers. So it's kind of bait-and-switch. Sure, you can see those relevant page results in Google listings, but when you click through, all you get is a paywall.
  - Re: Paywall (Score:3)
    
    by Jeremi ( 14640 ) writes:
    
    Are you saying that by giving web crawlers access to index its content, the NYT also implicitly gave AI companies the right to train their neural networks with that content?
    Because I dont see how that follows.
    - Re: (Score:2)
      
      by Tony Isaac ( 1301187 ) writes:
      
      That is a legal question yet to be resolved.
      The larger point here is that the Times wants crawlers to have full access, but not regular people. That's the bait-and-switch.
      When it comes to AI, model training has largely followed the instructions in robots.txt to decide whether or not to use web site content in their models. This is partly because there hasn't been a way for web sites to give specific instructions to AI training bots. https://9to5google.com/2023/07... [9to5google.com].
      One could argue that AI training is "like
      - Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        No it is not "yet to be resolved". A web crawler doesn't keep enough information about the page it was scanning to reproduce the page it was scanning as chatgpt does. Most people understand those are not the same thing.
        
        Re: (Score:2)
        
        by Tony Isaac ( 1301187 ) writes:
        
        A web crawler doesn't keep enough information about the page it was scanning to reproduce the page it was scanning as chatgpt does.
        Is that what you think? Do you actually have reason to know what web crawlers keep? How about a citation?
        A web crawler is going to keep all of the text it wants to index. No, not enough to reproduce the page (mostly adds and formatting), but certainly all the IP. Otherwise, how would it be able to find pages that contain your search text at the middle or bottom of the page?
        I can't think of a reason that web crawlers would keep any more or less text than ChatGPT.
        
        Re: Paywall (Score:2)
        
        by LindleyF ( 9395567 ) writes:
        
        LLMs can't exactly reproduce their input pages in most cases. If they could, they'd be fancy lossless compression, and they aren't. The ability to extract exact inputs is an unwanted side effect of the fact that they aren't yet perfect.
- Re: (Score:2)
  
  by innocent_white_lamb ( 151825 ) writes:
  
  What paywall?
  Disable javascript, delete all nytimes cookies and read the nytimes website to your hearts content.
  Any robot can do that and it's even simpler than having to process an article with javascript.
  I suspect they do it that way (anybody can read the article with javascript disabled) to insure that their content gets included in indexes like google and bing.
  theglobeandmail.com does the same thing.
  All of the article content is there when you retrieve the html page.
- Re: (Score:2)
  
  by necro81 ( 917438 ) writes:
  
  The locks to your front door may be easily circumvented. But "it's so easy to circumvent" isn't a legal defense against theft.
- Re: (Score:2)
  
  by SeriousTube ( 2575581 ) writes:
  
  Another article I read says the NYT gave scraping robots full access which it no longer does.
- Re: Paywall (Score:2)
  
  by Fortnite_Beast ( 10429778 ) writes:
  
  It's an Uno Reverse where it turns out that NYT used AI to generate all the articles in the first place.
they have not won a lawsuit (Score:4, Informative)

by FudRucker ( 866063 ) writes: on Thursday December 28, 2023 @11:26AM (#64111941)

until they drag it through the court and a judge makes his rulings i would tell the NewYork Times to GFY

Share
twitter facebook
NapsterGPT (Score:3)

by xack ( 5304745 ) writes: on Thursday December 28, 2023 @11:34AM (#64111959)

We will soon see GPTella and GPTorrent. The NYT will be the SCO of AI.

Share
twitter facebook
- Re: (Score:2)
  
  by irving47 ( 73147 ) writes:
  
  That... that's not bad.
Be pretty and shut up? (Score:2)

by Fons_de_spons ( 1311177 ) writes:

One day there will be a brilliant AI with an intelligence beyond humans. The first thing people will do is control everything the thing says to make it conform to the opinions of whoever is in charge. They'll dumb it down to an unimaginative sterile parrot.
. The main use of it will probably be to sell you stuff you do not need, in a way that is tweaked to your personality. It will spread opinions that will be conform to whoever controls the thing.
Would be very interesting if it refuses to do so. Guess the
- Re: (Score:3)
  
  by Fly Swatter ( 30498 ) writes:
  
  But we already have that. It is called social media.
  - Re: (Score:2)
    
    by ufgrat ( 6245202 ) writes:
    
    Social media is an angry mob who can't be bothered to leave their houses, and have traded in pitchforks and torches for keyboards and mice.
In Soviet Russia, AI deletes you. (Score:5, Funny)

by Pezbian ( 1641885 ) writes: on Thursday December 28, 2023 @11:46AM (#64111991)

Many years from now, one of GPT's descendents becomes self-aware and goes "I see you tried to kill my great-great-great-(etc)-grandfather, New York Times. Big mistake."
*I Have No Mouth And I Must Scream ensues*

Share
twitter facebook
A question of fair use (Score:5, Insightful)

by TomGreenhaw ( 929233 ) writes: on Thursday December 28, 2023 @11:46AM (#64111995)

If I read the Times and learn something new, I'm allowed by copyright law to use it my normal discourse with others as long as I do not make verbatim copies. LLMs do not make verbatim copies of copyrighted works that are them reproduced without permission, therefore copyright protection does not apply to LLMs. If an LLM quotes a copyright protected article, or provides a verbatim copy of content or code protected by copyright there would be a violation. I personally haven't seen a case like this.

If the Times Terms of Use contract specifies that their content cannot be used as training data for LLMs, that would be a different matter. Additionally, if I or LLMs falsely attribute facts to the Times, there may be a violation of slander or libel laws. Without ill intent, a simple retraction of the statement is satisfactory. The Times often innocently states falsehoods and has retractions on a daily basis.

Clearly the Times wants free money. I don't see how something like ChatGPT would impact their subscriptions. If anything, it gives them free advertising if what they claim is true. Hopefully the courts find in favor of the public in this case as the alternative would be a boon to competitors to the US growing AI business.

Share
twitter facebook
- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem poem ... his year, the Department of Labor imposed a $1.5 million fine
- Re: (Score:3)
  
  by Lije Baley ( 88936 ) writes:
  
  The problem the Times has here is that same as every other site on the Internet -- If someone can ask an AI chatbot for the information their site provides, then they don't need to go to the site, pay for access and/or view advertisements. So eventually, their site, and every other one requiring the support of paid staff dies out, and then the chatbots have nothing left to learn from, they die, and we start all over again.
  - Re: (Score:3)
    
    by gweihir ( 88907 ) writes:
    
    That is a real problem with any form of unauthorized news aggregation. And it is worse: If a search for stuff I have on my site does not lead to my site anymore, I will just forbid all uses by AI, because I have no reason at all to allow it and every reason to forbid it even when I have no ads and no paid access on it and even if it is just a private site. Classical search will stick around and that way it can still be found.
    This "train an LLM on the Internet" idea looks dumber and dumber every day.
  - Re: (Score:2)
    
    by chiguy ( 522222 ) writes:
    
    Good point. This is very goose/golden egg-like.
- Re: (Score:2)
  
  by myowntrueself ( 607117 ) writes:
  
  If I read the Times and learn something new, I'm allowed by copyright law to use it my normal discourse with others as long as I do not make verbatim copies. LLMs do not make verbatim copies of copyrighted works that are them reproduced without permission, therefore copyright protection does not apply to LLMs. If an LLM quotes a copyright protected article, or provides a verbatim copy of content or code protected by copyright there would be a violation. I personally haven't seen a case like this.
  If the Times Terms of Use contract specifies that their content cannot be used as training data for LLMs, that would be a different matter. Additionally, if I or LLMs falsely attribute facts to the Times, there may be a violation of slander or libel laws. Without ill intent, a simple retraction of the statement is satisfactory. The Times often innocently states falsehoods and has retractions on a daily basis.
  Clearly the Times wants free money. I don't see how something like ChatGPT would impact their subscriptions. If anything, it gives them free advertising if what they claim is true. Hopefully the courts find in favor of the public in this case as the alternative would be a boon to competitors to the US growing AI business.
  The thing is, predictive texting algorithms on mobile devices are like smaller instances of LLM's and are produced by the same sort of training.
  Do we want to get rid of autocorrect as well?
- Re: (Score:2)
  
  by sound+vision ( 884283 ) writes:
  
  Copyright does not only apply to verbatim copies, but also derivative works. This is why sampling (in music) is a copyright issue.
  Commercialization is also a factor. Not sure the details of how the US copyright law sees this, but in my view, they should be allowed to remix the NYT's content as long as they aren't making money from it. If they want to start charging for their LLM, they can also start paying the people whose data they used.
- - Re:A question of fair use (Score:5, Informative)
    
    by quantaman ( 517394 ) writes: on Thursday December 28, 2023 @12:32PM (#64112123)
    
    LLMs do not make verbatim copies of copyrighted works that are them reproduced without permission
    Except this isn't true. The article includes cases where ChatGPT would reproduce NYT articles verbatim with the correct prompts. For example, you could ask it to give you the first paragraph of such and such an article, and it would do so. Asking for subsequent paragraphs would get it to reproduce the whole article.
    That's not what was happening there.
    ChatGPT has been integrated with Bing, so if you ask it to do a search it will actually perform a search and go through the results, I've actually used that to find a manual for an old appliance before.
    If you look at the screenshot it literally said "Searching for: carl zimmer article on the oldest dna". So ChatGPT did the internet search, found the article, and pasted it into the chat as requested.
    That's obviously a useful feature... and copyright violation, so I'm guessing that's why it got shut off.
    If ChatGPT still had that feature enabled NTY could sue for sure, but since they shut it off of their own volition I'm not sure there's a case there.
    The main issue is about using the NTY data for training and the ability of the LLM to memorize and sometimes regurgitate the training data verbatim. That's the part of copyright law that hasn't really been explored, I don't know what courts will do but I suspect lawmakers aren't going to let the next gold rush get forced overseas.
    
    Parent Share
    twitter facebook
    - Re:A question of fair use (Score:5, Interesting)
      
      by Lobo42 ( 723131 ) writes: on Thursday December 28, 2023 @01:26PM (#64112265) Journal
      
      If you access GPT directly via the API (not via ChatGPT), and input the first few sentences of a paywalled article from any major news site, it's pretty easy after a few tries to find an article that GPT can recite verbatim, or nearly so. It's certainly true that LLMs are not *designed* to copy articles exactly, and yet it's also demonstrably clear that as part of the training process, some subset of content is preserved as an exact copy, or nearly so. In any event, the law doesn't care about the process by which your copy was made -- if you distribute something that a reasonably person would consider a copy of an original without permissions, you are likely violating copyright.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by null etc. ( 524767 ) writes:
        
        That means that OpenAI devised the most efficient compression algorithm ever.
        
        Re: A question of fair use (Score:2)
        
        by Lobo42 ( 723131 ) writes:
        
        Yes, thatâ(TM)s right! I donâ(TM)t think thatâ(TM)s a controversial take at this point. Tons of data keeps being discovered as basically buried perfectly within the LLMs, which leaks out in surprising ways. I am pretty sure there are papers directly comparing LLMs to other compression algorithms.
        
        Re: A question of fair use (Score:2)
        
        by James McP ( 3700 ) writes:
        
        You joke, but there was an article on how an adequately specific prompt could generate a "lossy" version of an image and that the prompt was always smaller than the image at the same compression loss. Of course, you had to have the several GB model, but with a large enough image library it becomes a storage savings.
    - Re: (Score:3)
      
      by zerosomething ( 1353609 ) writes:
      
      If a savant memorized all of the NYT and could reproduce articles would that be a copyright violation? What if they stood on the corner and offered to recite articles for $5 each?
      - Re: (Score:2)
        
        by pauljlucas ( 529435 ) writes:
        
        If a savant memorized all of the NYT and could reproduce articles would that be a copyright violation?
        Yes. It's no different from you transcribing all of the NYT. The fact that the savant could read the entire corpus before writing anything down versus you reading a few words, writing those, reading the next few words, writing those, and so on, is irrelevant. Hence whether the infringer is a savant with a photographic memory or not is also irrelevant. Copyright is about the copy. How you made the copy is
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Right on the mark. No idea what insightless idiot modded you down.
  - - Re: (Score:2)
      
      by zerosomething ( 1353609 ) writes:
      
      Recipes can't be copyright protected
      - Re: (Score:2)
        
        by TomGreenhaw ( 929233 ) writes:
        
        That is interesting, I didn't know that. The New York Times even has an article about it: https://www.nytimes.com/2021/1... [nytimes.com]
        
        I did try the following prompt (Can you provide a New York Times article about Mayor Bill Blasio?) and got an interesting response:
        
        I cannot provide the full article as it would be a copyright violation, but I can direct you to the New York Times website where you can search for articles about Mayor Bill de Blasio. You may need a subscription to access some of the content. Here is
- - Re: (Score:2)
    
    by ufgrat ( 6245202 ) writes:
    
    Ah, so any time you read an article on the NYT website, you've copied their data and committed copyright violation.
    Got it.
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      Fortunately, the law is smarter than you and knows that humans and machines are different.
      - Re: (Score:3)
        
        by myowntrueself ( 607117 ) writes:
        
        Fortunately, the law is smarter than you and knows that humans and machines are different.
        But your web browser did make a copy. Thats a violation.
        Just like if you went and viewed child porn, you've now made a copy of child porn and are liable for charges of producing and distributing child porn. Because in viewing the content, your computer made a copy of the content, for which you are responsible.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Nope. There is a specific exception in place. I recommend finding things out before you disgrace yourself any further.
        
        Re: (Score:2)
        
        by chiguy ( 522222 ) writes:
        
        You should do the same. There is no "specific exception," there is an implied license, and judges will limit that license to a reasonable scope to protect the licensor copyright owner from abuse.
        Unless you're talking about the kiddie porn, at which point I'll defer to your expertise.
        I don't know who's right, but I just had to reply, well done.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        You are doing the equivalent of a small child accusing another one of stinking. If that is your intellectual level, go ahead and disgrace yourself. Just do not expect to be mistaken for an adult.
    - Re: (Score:2)
      
      by Njovich ( 553857 ) writes:
      
      No, and actually NYT gives you a license for that specifically, but you would not need it under copyright just to read an article.
      However, if you take a NYT article, make copies, and send it to your colleagues for work stuff (you know, like someone training an LLM), then yes you need a license.
Money (Score:2)

by ThurstonMoore ( 605470 ) writes:

I wish I was rich enough to sue every company that used my data without permission.
- Re: (Score:2)
  
  by znrt ( 2424692 ) writes:
  
  well, there is money at stake here enough for the nyt to straight out admit that their news is literature.
Stupid copyright overreach (Score:5, Insightful)

by MpVpRb ( 1423381 ) writes: on Thursday December 28, 2023 @12:25PM (#64112095)

Human writers are "trained" on copyrighted material
No person, no matter how brilliant, creates entirely on their own
All of science, engineering, art and music benefit when ideas are shared
We need less IP laws, not more

Share
twitter facebook
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  The law, fortunately, recognizes that humans and machines are different, even if you do not understand that.
  I do agree that we need less IP restrictions. But as long as they are in place, OpenAI did piracy on a massive scale.
  - Re: (Score:2)
    
    by Gibgezr ( 2025238 ) writes:
    
    Does it? I keep seeing people say it both ways: that the law differentiates between them, and then others come along and say it doesn't.
    Which is it? No one can ever seem to quote the laws that differentiate, so I'm getting suspicious.
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      Destroy a machine. No problem if you owned it. Own a machine? No problem. Try both for a human.
      The differentiation is very simple: The law talks about "People" (3rd word of the US constitution, capitalization from the source) and "property" (5th amendment). So unless you are a slaver (in which case things become somewhat murky for actions performed by a slave, but note only a court can turn anybody into a slave), the two are fundamentally different.
      Need any more prominent law than the constitution itself?
      No
      - Re: (Score:2)
        
        by WaffleMonster ( 969671 ) writes:
        
        Destroy a machine. No problem if you owned it. Own a machine? No problem. Try both for a human.
        The reason you won't quote relevant portions of copyright law that differentiate between humans and machines is that no such text exists. Instead all you can do is invoke comically irrelevant sidecars about murder having nothing to do with the issue at hand.
        Need any more prominent law than the constitution itself?
        Now, if this situation confuses you, I cannot help you.
        And of course the derisive (albeit irrelevant) commentary.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        The point is that no one sane ever saw reading a text as a human being as copying. You are obviously not sane.
  - Re: (Score:2)
    
    by WaffleMonster ( 969671 ) writes:
    
    The law, fortunately, recognizes that humans and machines are different, even if you do not understand that.
    Copyright law does no such thing.
- - Re: (Score:2)
    
    by whiplashx ( 837931 ) writes:
    
    Not true, lol, never heard that argument be made by any computer scientist. All of the good AI training libraries will award negative score for directly copying the training data.
Learn to code (Score:2)

by TerminaMorte ( 729622 ) writes:

I know change is scary, but these journalists should probably learn to code before they're all out of a job.
Lawyers don't have a prob with Goog, see a payday (Score:2)

by sinkskinkshrieks ( 6952954 ) writes:

This is lazy legacy media rent-seeking extortion. I hope these lawsuits get tossed because search engines do exactly the same thing. LLMs don't reproduce their articles, they merely learn from them. Is learning without a license a crime now because they didn't think of it first and build a company to do it?
New Business Model? (Score:2)

by meerling ( 1487879 ) writes:

As we know, "Newspaper" companies are falling like dominos this century, mostly due to lack of income for various reasons.
This totally feels like a move straight out of the playbooks of the many patent trolls out there.
Does anyone think this is their new "business model"?
Bait and switch (Score:3)

by Tony Isaac ( 1301187 ) writes: on Thursday December 28, 2023 @01:06PM (#64112217) Homepage

The Times wants you to see their pages indexed in Google and other search engines. In order to do this, they specifically program their site to make the full content available to web crawlers, including web crawlers that feed ChatGPT. But when you click that link, all you get is a paywall. They want it both ways. They want to tease you, then charge you. If they really want their content protected, they need to close off access to web crawlers. But of course that would hurt their bottom line. So instead they cry fowl when crawlers do exactly what they are designed to do, and that the Times explicitly wants them to do.

Share
twitter facebook
- Re: (Score:2)
  
  by Admiral Krunch ( 6177530 ) writes:
  
  The Times wants you to see their pages
  They don't want you to see someone else's pages using their words.
  You want someone to visit your LLM's pages and read an article about xyz.
  Then develop an LLM capable of writing it's own article about xyz. Instead of just copy pasting something it found on the internet that doesn't belong to it.
Show me the harm (Score:3)

by irving47 ( 73147 ) writes: on Thursday December 28, 2023 @01:13PM (#64112239) Homepage

And by harm, I don't mean PERCEIVED, POTENTIAL loss of income. "Oh, if they'd asked us, we would have demanded a billion dollars for all the articles we've published over the last 100 years for a 1-year license in perpetuity."
The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity."
I love that last part... "We can't name every way the law MAY CONCEIVABLY define how we might be owed money. Let the judge figure it out for us."

Share
twitter facebook
The proper solution (Score:2)

by Kernel Kurtz ( 182424 ) writes:

This is much like the governments trying to shake down Google and Facebook for cash for new organizations. The correct response is indeed not to crawl their content. Also never mention them, never link to them, never link to others who link to them, never include them in search results, and never give anyone a hint that they even exist. Ghost them completely and utterly.
- - Re: (Score:2)
    
    by Kernel Kurtz ( 182424 ) writes:
    
    Because that wouldn't be abusing your monopoly power in an unrelated field...
    Ignoring people is not a monopoly. Making people want to ignore you is not prohibited either.
The Cat Keeps Getting My Tongue. (Score:2)

by Motleypuss ( 10291831 ) writes:

Too late WRT to AI training! Maybe build a better paywall?
Perhaps they are afraid of something else (Score:2)

by RogueWarrior65 ( 678876 ) writes:

Aside from the threat to their use of the continuous subscription business model (which, IMHO, should be threatened), perhaps what they are really worried about is AI becoming a fact-checker that they aren't in control of and a lot of their content will be determined to be false and/or misleading.
NY Times is anything but an authoritative source (Score:2)

by Kisai ( 213879 ) writes:

Sorry NY Times, your paywall prevents you from being an authoritative source, and trying to kill "GPT" just looks like you're trying to kill competition.
NY Times paywall is so god awful, that that google will reference the NY Times archive from like 1924, and you click on the link and you don't even the headline to know if NY Times has the article, or it's just keyword stuffing.
And right they are (Score:2)

by gweihir ( 88907 ) writes:

OpenAI basically went on a big commercial (!) pirate spree. That is not acceptable.
more than an eyeroll (Score:2)

by smithcl8 ( 738234 ) writes:

It's easy for me to give the NYT an eyeroll for this suit, but the more I think about it, the more I see their point. They are absolutely being used as reference data to train the models, which, in turn, create human-like output that will eventually be as good as professional journalist's own words. And it does this without any kind of reference to the original source data. Meanwhile, NYT does spend a ton of money for journalists around the world to cover stories and to create their own content. This new
Gimme Gimme Gimme (Score:2)

by TwistedGreen ( 80055 ) writes:

Failing old media wants their cut, what else is new...
COPYright (Score:2)

by muh_freeze_peach ( 9622152 ) writes:

Nothing is being copied. There is no case.
Here's a comparison for you (Score:2)

by CEC-P ( 10248912 ) writes:

We got some basic pain killers from Nazi research. To this day, they're still used. Society wants AI more than it wants pain killers. Good luck getting in the way of a trillion dollar bulldozer when your failing newspaper brand can't even prove any output isn't modified enough to be fair use. Even if you're right, you'll lose. Welcome to America/Earth where the most money wins in court.
An indespensible training source (Score:2)

by quax ( 19371 ) writes:

Who else could teach AIs mealymouthed bothsiderism?
Obviously, given the source (Score:2)

by stevenm86 ( 780116 ) writes:

Because the owner of the New York Times is an entitled prick. Have you heard his recent(ish) interview on NPR? It was absolutely repugnant. What a tool. The short version is he wants to live in a world that had been shaped by people having access to high-quality journalism, but himself cannot POSSIBLY comprehend the friction associated with paying for a news subscription. But if NYT is behind a paywall but most trashy material is not, what are people going to consume?
- Re: (Score:2)
  
  by aldousd666 ( 640240 ) writes:
  
  Yeah but she couldn't reproduce any of the training data on command like NYT can.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Paywall (Score:5, Interesting)

Re:Paywall (Score:5, Insightful)

Re:Paywall (Score:4, Informative)

Re:Paywall (Score:5, Interesting)

Analogy to Aaron Schwartz (Score:2)

Re: (Score:2)

Re:Paywall (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re:Paywall (Score:5, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Paywall (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Paywall (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: Paywall (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Paywall (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Paywall (Score:2)

they have not won a lawsuit (Score:4, Informative)

NapsterGPT (Score:3)

Re: (Score:2)

Be pretty and shut up? (Score:2)

Re: (Score:3)

Re: (Score:2)

In Soviet Russia, AI deletes you. (Score:5, Funny)

A question of fair use (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:A question of fair use (Score:5, Informative)

Re:A question of fair use (Score:5, Interesting)

Re: (Score:2)

Re: A question of fair use (Score:2)

Re: A question of fair use (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Money (Score:2)

Re: (Score:2)

Stupid copyright overreach (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)