Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
The Courts

OpenAI Accidentally Deleted Potential Evidence in New York Times Copyright Lawsuit (techcrunch.com) 66

An anonymous reader shares a report: Lawyers for The New York Times and Daily News, which are suing OpenAI for allegedly scraping their works to train its AI models without permission, say OpenAI engineers accidentally deleted data potentially relevant to the case. Earlier this fall, OpenAI agreed to provide two virtual machines so that counsel for The Times and Daily News could perform searches for their copyrighted content in its AI training sets.

In a letter, attorneys for the publishers say that they and experts they hired have spent over 150 hours since November 1 searching OpenAI's training data. But on November 14, OpenAI engineers erased all the publishers' search data stored on one of the virtual machines, according to the aforementioned letter, which was filed in the U.S. District Court for the Southern District of New York late Wednesday. OpenAI tried to recover the data -- and was mostly successful. However, because the folder structure and file names were "irretrievably" lost, the recovered data "cannot be used to determine where the news plaintiffs' copied articles were used to build [OpenAI's] models," per the letter. "News plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time," counsel for The Times and Daily News wrote.

This discussion has been archived. No new comments can be posted.

OpenAI Accidentally Deleted Potential Evidence in New York Times Copyright Lawsuit

Comments Filter:
  • You misspelled.. (Score:5, Insightful)

    by The Faywood Assassin ( 542375 ) <benyjr AT yahoo DOT ca> on Thursday November 21, 2024 @09:07AM (#64962101) Homepage

    on purpose.

    "Accidentally" my aunt fanny.

    • by HiThere ( 15173 ) <.charleshixsn. .at. .earthlink.net.> on Thursday November 21, 2024 @09:20AM (#64962147)

      Maybe. Considering the problems I've read about where there was NO benefit, I'm willing to believe that it was unintentional. But it's still OpenAI's responsibility, and they need to pay all relevant expenses, including any legal expenses (extra lawyer hours), etc., more expenses for additional court time, etc., etc. And there should be notice by the court that it MAY have been intentional.

      • by burtosis ( 1124179 ) on Thursday November 21, 2024 @10:33AM (#64962343)

        Maybe. Considering the problems I've read about where there was NO benefit, I'm willing to believe that it was unintentional. But it's still OpenAI's responsibility, and they need to pay all relevant expenses, including any legal expenses (extra lawyer hours), etc., more expenses for additional court time, etc., etc. And there should be notice by the court that it MAY have been intentional.

        Intentional should barely factor into anything. If the deleted materials meet the definition of reasonable suspicion then it should be the full crackdown of the law with felony charges. People should be shitting themselves continuously until they have backed up and preserved it seven ways to Sunday. None of this oopsies crap should ever fly.

        • Reasonable suspicion is an absurd standard that only applies to officer safety for detained subjects. So long as the lawyers are able to recreate the necessary analysis this will just be a billing problem. It gets messy if openai succeeds on defending their position because they won't likely be on the hook for opposing legal fees except for these extra ones caused by their negligence(?)

          • by HiThere ( 15173 )

            Yes, but they *should* be on the hook for these expenses before the trial even starts.

            • Yes, but they *should* be on the hook for these expenses before the trial even starts.

              Precisely. By inaction or malicious action a crime has been committed beyond reasonable doubt.

          • Reasonable suspicion is an absurd standard that only applies to officer safety for detained subjects. So long as the lawyers are able to recreate the necessary analysis this will just be a billing problem. It gets messy if openai succeeds on defending their position because they won't likely be on the hook for opposing legal fees except for these extra ones caused by their negligence(?)

            So a cop pulls me over because I’m driving erratically. After having reasonable suspicion to search the vehicle, as soon as it starts I push a button and disappear lots of items from the vehicle. That is guilt, tautologically and would be a felony if that’s how physics worked. Now I have that same device and after being advised of the search I know it has a touchy delete button and it goes off accidentally. That’s also an action that should be a felony because I knew the risk and faile

        • If the deleted materials meet the definition of reasonable suspicion then it should be the full crackdown of the law with felony charges.

          Have there ever been felony charges for deleting data in a civil case?

      • When the penalties for "accidentally" deleting the data will be FAR, FAR less than they would be for those from all of the IP they used without permission, "accidentally" becomes rather suspiciously convenient.

      • Being responsible, following the laws, adhering to rules, doing proper diligence, all gets in the way of making profits. What are you, some kind of commie?

    • Absolutely.
    • by gweihir ( 88907 )

      Indeed. They cannot be this _extremely_ incompetent. Also, backups are a thing.

      • "They cannot be this _extremely_ incompetent."

        You don't have much involvement in IT or follow IT news much, do you?

        "Also, backups are a thing."

        And people who don't seem to realize that backups that are not regularly restored from to prove that they work are not backups at all are also a thing.

        • by gweihir ( 88907 )

          You overlook how extremely important it was to get this one right. If you give this to ordinary IT ops, you have already failed massively. Looks like _you_ would have done exactly that.

    • by tlhIngan ( 30335 )

      On purpose.

      "Accidentally" my aunt fanny.

      You'd think so, but it's in the original filing that the plaintiffs have said they acknowledge it to be a legitimate accidental error on OpenAI's part and not malicious.

      The stuff that got deleted was work they were using during discovery - they were provided a VM to find their contents. The VM was reverted back to the original state so they lost the results of their research.

      The filing was basically to ask the court for more time because this happened and the work tha

  • "Accidentally" (Score:5, Insightful)

    by khchung ( 462899 ) on Thursday November 21, 2024 @09:09AM (#64962109) Journal

    "My dog ate my homework"

    Yeah, we totally believed it. /s

    Did OpenAI hire teenagers to work as "engineers"? Have they ever heard of taking backups? Do they have no disaster recovery plan? Oh wait, is this their disaster recovery plan against the disaster of being sued?

    • Re:"Accidentally" (Score:4, Interesting)

      by bill_mcgonigle ( 4333 ) * on Thursday November 21, 2024 @09:14AM (#64962125) Homepage Journal

      > the disaster of being sued?

      You got it, bud.

      The LLM companies appear to be taking the Uber strategy - burn VC money doing something wildly illegal but wildly popular to force a legal reform.

      Can I root against both these companies somehow?

      • Fake hiring without paying taxes is on a different level than doing statistical analysis of text. Nobody could prove harm from generative AI "infringement" in court so far. I mean, even if you wanted to recreate the whole training set from the model, it would not be possible. Or any long text, impossible to recreate. If someone wanted to infringe copyright, we have internet for free copying that works 1000x better than AI models, and it copies things precisely. Who would prefer to read the LLM hallucinated
        • So breaking the law a little bit is fine so long as no one gets hurt? It's late at night at 3am and there's no one on the streets so it's fine if I blow through all the red lights.

          The copying they're doing is both against sites' Terms Of Use, is their complete works, and is for commercial purposes so none of the copyright fair use arguments apply. Why are you fine with larger companies getting away with breaking Terms Of Use while regular users get screwed when a company comes after them for even less tri

    • Do they have no disaster recovery plan?

      Of course they do. ChatGPT came up with it for them.

    • Actually in this case it's more "Teacher's dog ate your homework" and now you need to do it again.

  • That's a great way to turn a civil issue into a criminal one. Sounds like it can be resolved monetarily but they better watch those fat fingers next time.

    • Unfortunately, the 'whoopsie, I lost the incriminating evidence' cases are very hard to prove (which is why so many of these accidents happen...)
      • by cob666 ( 656740 )

        Unfortunately, the 'whoopsie, I lost the incriminating evidence' cases are very hard to prove (which is why so many of these accidents happen...)

        I would normally agree with you, but in this case (yes, pun intended) what was deleted wasn't potentially incriminating evidence, it was the results of the plaintiff's search for incriminating evidence on their system.

      • I think the dude in NY who said "I changed the password to keep the evidence safe, but then I forgot it" had the best excuse. The burden of proof is on the government.

    • by gweihir ( 88907 )

      And maybe do minimal due diligence and not be so grossly negligent as having no backups. Seriously, nobody with any sysadmin experience can believe this was an accident.

  • Bad For OpenAI (Score:5, Interesting)

    by StormReaver ( 59959 ) on Thursday November 21, 2024 @09:16AM (#64962133)

    In most cases, failure to preserve evidence results in an adverse inference against the failing party. If that happens here, the court could instruct the jury to assume the allegations against OpenAI are true. That would end OpenAI.

    I would think that people who destroy evidence have concluded that the deletion of evidence would result in a penalty less severe than if the evidence had been preserved. I suspect there are some really, really devastating proofs in that deleted data, so much so that OpenAI concluded that facing the consequences of deleting it is better than the consequences of it being made public.

    • Re:Bad For OpenAI (Score:5, Informative)

      by DarkOx ( 621550 ) on Thursday November 21, 2024 @09:35AM (#64962197) Journal

      IANAL but I have been involved in enough discovery processes, given testimony, and seen the outcome of enough cases that I can say once you have been ordered to preserve someone or facilitate discovery, if someone tells you to instead destroy evidence that is bad advice.

      This will hurt their cause in court, it will hurt their cause a lot of the Judge comes to believe it was wilful.

      • Re:Bad For OpenAI (Score:5, Interesting)

        by thegarbz ( 1787294 ) on Thursday November 21, 2024 @02:03PM (#64962963)

        Given the HalfLife anniversary recently there's been a lot of interesting stories about Valve coming out. One of them was related to a contractual dispute that nearly sank the company. Apparently after a relatively beginning court case was started against Vivendi they retaliated with full intent to completely destroy not just valve, but also personally bankrupt their founders, while stealing the Half-Life IP.Turns out that during discovery phase in a malicious compliance move they released an insane amount of Korean language rubbish, virtually dumping all communications from the company into Valve's lap hoping to overwhelm them with both language and amount of documents running down any money Valve had to fight the suit.

        Why is this relevant? In that trove of rubbish a Korean speaking intern found a single email that was instructing someone at Vivendi to destroy information related to the trial. The judge ultimately through out every counter claim for Vivendi, found for valve in all claims, and terminated their agreements without penalty to Valve. Basically the email instructing documents to be destroyed caused the case to be summarily decided against them and saved Valve and Newell from going bankrupt.

        https://www.gamedeveloper.com/... [gamedeveloper.com]

    • Except literally nobody wants there to be an assumption. Per the article, "The plaintiffs’ counsel makes clear that they have no reason to believe the deletion was intentional."

      Usually the goal is to win the case. Unless you want to set a precedent. You want rock solid proof, not any sort of default assumption. Because you then use this to go after everyone else doing the same thing.

    • Re:Bad For OpenAI (Score:4, Interesting)

      by hdyoung ( 5182939 ) on Thursday November 21, 2024 @09:53AM (#64962241)
      It wouldnt end OpenAI. This is all about $$$. OpenAI scraped a ton of protected info because it would help them make $$$, and the legal owners of that info want their share of the $$$. OpenAI is essentially a huge linear algebra matrix, connected to the internet by a bit of clever programming, supported by small team of humans. Their assets are as intangible as it gets, but investors are irrational and have priced the company based on the assumption that it will eventually dominate “all the business on the planet”.

      In reality, liquidating OpenAI would leave nothing but pocket change in terms of real $$$ to split up. The old publishing houses are suing in order to gain a share of the potential future $$$, not shut them down.

      But deleting court-protected data was NOT a smart move. OpenAI just handed the publishing houses a MUCH longer lever to use as they try to extract $$$. The payout cost just went up.
      • I bet NYT will still lose. What they are after is copyright over abstract ideas. That would kill creativity. Copyright is a dead man walking, it means nothing to generative models. You can't reverse the trend, we used to be passive consumers of books, music, radio and TV. Now we are interactive, we prefer to play games, use social networks and web search. We create the content we are reading, like here on /. This process has been going on for 25 years, generative AI is just the last nail in the coffin. Auth
        • Web content is not public domain. There is no explicit or implicit license for the literal copy into the training set and the only exemption possible is fair use.

          It's a plain and simple copyright issue for literal copies of registered works. Registered so no need to prove damages even for statutory fines. Fair use or bust.

    • If it was discoverable, then failure to protect that information is as much as admitting the plaintiff's claims. This could lead to such wonderful things as summary judgment, which should happen. Generative AI companies need to understand this and while the amount of data may be in the petabyte range, it still has to be preserved.

      • Why are you rooting for copyright hoarders? Authors have been unable to make a living off royalties for a very long time. Books, music, art - they are all incapable of paying for a decent living. Instead we have corporations collecting as much of these copy rights as they can. And they use it for ad revenue which leads to shit content, just attention grabbers. The high quality work is now done collaboratively in open source and public domain.

        On the other hand, as a user of generative AI you get all the p
        • I don't think articles of the NYT that they create and retain copyright are a "hoarder." I was bringing up the legal ramifications, even if unintentional, of failing to preserve evidence when directed by a judicial proceeding. If you don't like corporations holding copyrights, then leave it to the original authors or contributors but then that creates other issues. There's also legal precedents for E-Discovery and handling of data, the casework is huge, but it's well known to preserve everything that is cit

    • Never attribute to malice...

      This doesn't really pass a smell test. Not just the fact that they only deleted data from one virtual machine, but they didn't delete source data just the work completed, meaning that work can with some effort be redone, and they also attempted to recover the data.

      If this was a coverup it was truly an incompetent one.

    • I think everyone aside from OpenAI would love to be surprised and see that a massive $billion firm is held to the same standard, and punished the same way that normal people would be punished if we destroyed material evidence in a case.

    • Re:Bad For OpenAI (Score:4, Insightful)

      by quantaman ( 517394 ) on Thursday November 21, 2024 @11:50AM (#64962583)

      In most cases, failure to preserve evidence results in an adverse inference against the failing party. If that happens here, the court could instruct the jury to assume the allegations against OpenAI are true. That would end OpenAI.

      I would think that people who destroy evidence have concluded that the deletion of evidence would result in a penalty less severe than if the evidence had been preserved. I suspect there are some really, really devastating proofs in that deleted data, so much so that OpenAI concluded that facing the consequences of deleting it is better than the consequences of it being made public.

      Read the summary more closely.

      OpenAI gave the NYTimes a couple VMs so its experts could search through OpenAI's training data.

      OpenAI accidentally deleted some of the analysis that the experts generated, but the original training data is still there.

      The only consequence to this is the experts need to spend some time regenerating that analysis. As "failure to preserve evidence goes" this is largely the equivalent of accidentally knocking over a stack of papers on someone's desk.

    • by gweihir ( 88907 )

      I suspect there are some really, really devastating proofs in that deleted data, so much so that OpenAI concluded that facing the consequences of deleting it is better than the consequences of it being made public.

      That would need to be some evidence. But yes, it is possible. As this deletion looks extremely bad and may well have them lose the case, I suspect it would need to be at least evidence of serious criminal wrongdoing on "people go to prison" level.

  • The AI has a mind of its own! That's Open AI's excuse.
    • Maybe it was the very same NYT articles that gave the AI this idea.... we need to investigate this for 10 years to learn.
  • Teehee
  • by fluffernutter ( 1411889 ) on Thursday November 21, 2024 @09:47AM (#64962221)
    I'm surprised the author of this resisted the urge to surround accidentally with quotes.
  • Ooopsie. I'm so sorry that relevant evidence was destroyed.
    happens all the time, especially when it could be damaging to us. .... chuckles...
  • by fuzzyfuzzyfungus ( 1223518 ) on Thursday November 21, 2024 @10:15AM (#64962287) Journal
    Even if this was actually accidental rather than 'accidental'; it's a bad look. Someone just deleted litigation-critical data? No data classification? No "let the retention policy handle the deletion instead of just cowboy purging stuff"? And then they tried to recover the data in a way that lost all file and folder names, rather than just restoring backups from your backup mechanism?

    I'd honestly be curious to know what OpenAI's IT operations look like. Charitably presuming that it's not just outright destruction of evidence; it sounds like they've got eleventy-zillion dollars worth of 'AI' specialists and enough data scrapers to plunder the internet twice over and some engineers keeping the APIs running and so on; but the boring-beige-IT-computer-janitor-nonsense maturity of a vastly smaller company.

    Heck, I worked for a random town's school department over a decade ago and they had backup policies set up that included the ability to mark things as under litigation holds and exempt from deletion through normal manual or scheduled actions(usually pretty low stakes stuff; but at least once or twice a year a parent would get really fighty about an IEP or something, and there litigation is litigation in terms of data preservation).
  • And no one so much as implemented, much less followed, 3 2 1 backup schemes?

  • by nehumanuscrede ( 624750 ) on Thursday November 21, 2024 @10:43AM (#64962369)

    Until the penalties for destroying evidence exceed the penalties for the crimes they are accused of, this will always be a thing.

    Companies simply weigh which is the lesser of two evils and go with that.

    That said, if OpenAI is so incompetent with data they've been ordered to retain, imagine how incompetent they will be with any
    of the data they will collect and store on you.

    • Exactly, this speaks more about OpenAI incompetence than anything else. They only destroyed the lawyer's work (which can be redone) so no evidence is destroyed, but how can you trust OpenAI provided all of the files used for training if they can't even perform a backup properly?
  • #OOOPS

  • Hanlon's Razor [wikipedia.org] suggests that yes, this was an accident due to incompetence.

    On the other hand, all real-world evidence coming from OpenAI strongly indicates that no one should give them the benefit of the doubt.
  • Didn't read the article, but from the summary it seems OpenAI only deleted some of the analysis results of the evidence. The evidence (OpenAI's training data potentially containing NYT IP) is still intact; the foul up just means the search for NYT IP in that evidence has to be redone.

  • Unbelievable, unretrievable and not even the devil could get that lucky.

    OpenAI version of magical realism for an excuse

  • ... is one sign of intelligence. The AI deleted the stuff on its own.

"Nuclear war can ruin your whole compile." -- Karl Lehenbauer

Working...