OpenAI Accidentally Deleted Potential Evidence in New York Times Copyright Lawsuit (techcrunch.com) 63
An anonymous reader shares a report: Lawyers for The New York Times and Daily News, which are suing OpenAI for allegedly scraping their works to train its AI models without permission, say OpenAI engineers accidentally deleted data potentially relevant to the case. Earlier this fall, OpenAI agreed to provide two virtual machines so that counsel for The Times and Daily News could perform searches for their copyrighted content in its AI training sets.
In a letter, attorneys for the publishers say that they and experts they hired have spent over 150 hours since November 1 searching OpenAI's training data. But on November 14, OpenAI engineers erased all the publishers' search data stored on one of the virtual machines, according to the aforementioned letter, which was filed in the U.S. District Court for the Southern District of New York late Wednesday. OpenAI tried to recover the data -- and was mostly successful. However, because the folder structure and file names were "irretrievably" lost, the recovered data "cannot be used to determine where the news plaintiffs' copied articles were used to build [OpenAI's] models," per the letter. "News plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time," counsel for The Times and Daily News wrote.
In a letter, attorneys for the publishers say that they and experts they hired have spent over 150 hours since November 1 searching OpenAI's training data. But on November 14, OpenAI engineers erased all the publishers' search data stored on one of the virtual machines, according to the aforementioned letter, which was filed in the U.S. District Court for the Southern District of New York late Wednesday. OpenAI tried to recover the data -- and was mostly successful. However, because the folder structure and file names were "irretrievably" lost, the recovered data "cannot be used to determine where the news plaintiffs' copied articles were used to build [OpenAI's] models," per the letter. "News plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time," counsel for The Times and Daily News wrote.
You misspelled.. (Score:5, Insightful)
on purpose.
"Accidentally" my aunt fanny.
Re:You misspelled.. (Score:5, Insightful)
Maybe. Considering the problems I've read about where there was NO benefit, I'm willing to believe that it was unintentional. But it's still OpenAI's responsibility, and they need to pay all relevant expenses, including any legal expenses (extra lawyer hours), etc., more expenses for additional court time, etc., etc. And there should be notice by the court that it MAY have been intentional.
Re:You misspelled.. (Score:4, Insightful)
Maybe. Considering the problems I've read about where there was NO benefit, I'm willing to believe that it was unintentional. But it's still OpenAI's responsibility, and they need to pay all relevant expenses, including any legal expenses (extra lawyer hours), etc., more expenses for additional court time, etc., etc. And there should be notice by the court that it MAY have been intentional.
Intentional should barely factor into anything. If the deleted materials meet the definition of reasonable suspicion then it should be the full crackdown of the law with felony charges. People should be shitting themselves continuously until they have backed up and preserved it seven ways to Sunday. None of this oopsies crap should ever fly.
Re: You misspelled.. (Score:2)
Reasonable suspicion is an absurd standard that only applies to officer safety for detained subjects. So long as the lawyers are able to recreate the necessary analysis this will just be a billing problem. It gets messy if openai succeeds on defending their position because they won't likely be on the hook for opposing legal fees except for these extra ones caused by their negligence(?)
Re: (Score:2)
Yes, but they *should* be on the hook for these expenses before the trial even starts.
Re: (Score:2)
Yes, but they *should* be on the hook for these expenses before the trial even starts.
Precisely. By inaction or malicious action a crime has been committed beyond reasonable doubt.
Re: (Score:2)
Reasonable suspicion is an absurd standard that only applies to officer safety for detained subjects. So long as the lawyers are able to recreate the necessary analysis this will just be a billing problem. It gets messy if openai succeeds on defending their position because they won't likely be on the hook for opposing legal fees except for these extra ones caused by their negligence(?)
So a cop pulls me over because I’m driving erratically. After having reasonable suspicion to search the vehicle, as soon as it starts I push a button and disappear lots of items from the vehicle. That is guilt, tautologically and would be a felony if that’s how physics worked. Now I have that same device and after being advised of the search I know it has a touchy delete button and it goes off accidentally. That’s also an action that should be a felony because I knew the risk and faile
Re: (Score:3)
This is a civil case.
Re: (Score:2)
This is a civil case.
Destroying evidence is a seperate criminal act. No longer a civil matter, tampering with and destruction of evidence are criminal charges.
Re: (Score:2)
Civil cases (like this one) and criminal cases have far different standards.
Re: (Score:2)
Re: (Score:2)
When the penalties for "accidentally" deleting the data will be FAR, FAR less than they would be for those from all of the IP they used without permission, "accidentally" becomes rather suspiciously convenient.
Re: (Score:3)
Being responsible, following the laws, adhering to rules, doing proper diligence, all gets in the way of making profits. What are you, some kind of commie?
Re: (Score:2)
Re: (Score:2)
Indeed. They cannot be this _extremely_ incompetent. Also, backups are a thing.
Re: (Score:2)
"They cannot be this _extremely_ incompetent."
You don't have much involvement in IT or follow IT news much, do you?
"Also, backups are a thing."
And people who don't seem to realize that backups that are not regularly restored from to prove that they work are not backups at all are also a thing.
"Accidentally" (Score:4, Insightful)
"My dog ate my homework"
Yeah, we totally believed it. /s
Did OpenAI hire teenagers to work as "engineers"? Have they ever heard of taking backups? Do they have no disaster recovery plan? Oh wait, is this their disaster recovery plan against the disaster of being sued?
Re:"Accidentally" (Score:4, Interesting)
> the disaster of being sued?
You got it, bud.
The LLM companies appear to be taking the Uber strategy - burn VC money doing something wildly illegal but wildly popular to force a legal reform.
Can I root against both these companies somehow?
Re: (Score:2)
Re: (Score:1)
So breaking the law a little bit is fine so long as no one gets hurt? It's late at night at 3am and there's no one on the streets so it's fine if I blow through all the red lights.
The copying they're doing is both against sites' Terms Of Use, is their complete works, and is for commercial purposes so none of the copyright fair use arguments apply. Why are you fine with larger companies getting away with breaking Terms Of Use while regular users get screwed when a company comes after them for even less tri
Re: (Score:2)
Of course they do. ChatGPT came up with it for them.
Re: (Score:3)
Actually in this case it's more "Teacher's dog ate your homework" and now you need to do it again.
Crime (Score:2)
That's a great way to turn a civil issue into a criminal one. Sounds like it can be resolved monetarily but they better watch those fat fingers next time.
Re: (Score:1)
Re: (Score:2)
Unfortunately, the 'whoopsie, I lost the incriminating evidence' cases are very hard to prove (which is why so many of these accidents happen...)
I would normally agree with you, but in this case (yes, pun intended) what was deleted wasn't potentially incriminating evidence, it was the results of the plaintiff's search for incriminating evidence on their system.
Re: (Score:1)
I think the dude in NY who said "I changed the password to keep the evidence safe, but then I forgot it" had the best excuse. The burden of proof is on the government.
Re: (Score:2)
The government? It is a civil case.
Re:Crime (Score:4, Funny)
Wake me when a suit from a company does jail time.
That would add a whole new - and very welcome - connotation to the word "lawsuit".
Re: (Score:2)
And maybe do minimal due diligence and not be so grossly negligent as having no backups. Seriously, nobody with any sysadmin experience can believe this was an accident.
Bad For OpenAI (Score:5, Interesting)
In most cases, failure to preserve evidence results in an adverse inference against the failing party. If that happens here, the court could instruct the jury to assume the allegations against OpenAI are true. That would end OpenAI.
I would think that people who destroy evidence have concluded that the deletion of evidence would result in a penalty less severe than if the evidence had been preserved. I suspect there are some really, really devastating proofs in that deleted data, so much so that OpenAI concluded that facing the consequences of deleting it is better than the consequences of it being made public.
Re:Bad For OpenAI (Score:5, Informative)
IANAL but I have been involved in enough discovery processes, given testimony, and seen the outcome of enough cases that I can say once you have been ordered to preserve someone or facilitate discovery, if someone tells you to instead destroy evidence that is bad advice.
This will hurt their cause in court, it will hurt their cause a lot of the Judge comes to believe it was wilful.
Re:Bad For OpenAI (Score:5, Interesting)
Given the HalfLife anniversary recently there's been a lot of interesting stories about Valve coming out. One of them was related to a contractual dispute that nearly sank the company. Apparently after a relatively beginning court case was started against Vivendi they retaliated with full intent to completely destroy not just valve, but also personally bankrupt their founders, while stealing the Half-Life IP.Turns out that during discovery phase in a malicious compliance move they released an insane amount of Korean language rubbish, virtually dumping all communications from the company into Valve's lap hoping to overwhelm them with both language and amount of documents running down any money Valve had to fight the suit.
Why is this relevant? In that trove of rubbish a Korean speaking intern found a single email that was instructing someone at Vivendi to destroy information related to the trial. The judge ultimately through out every counter claim for Vivendi, found for valve in all claims, and terminated their agreements without penalty to Valve. Basically the email instructing documents to be destroyed caused the case to be summarily decided against them and saved Valve and Newell from going bankrupt.
https://www.gamedeveloper.com/... [gamedeveloper.com]
Re: (Score:2)
Except literally nobody wants there to be an assumption. Per the article, "The plaintiffs’ counsel makes clear that they have no reason to believe the deletion was intentional."
Usually the goal is to win the case. Unless you want to set a precedent. You want rock solid proof, not any sort of default assumption. Because you then use this to go after everyone else doing the same thing.
Re:Bad For OpenAI (Score:4, Interesting)
In reality, liquidating OpenAI would leave nothing but pocket change in terms of real $$$ to split up. The old publishing houses are suing in order to gain a share of the potential future $$$, not shut them down.
But deleting court-protected data was NOT a smart move. OpenAI just handed the publishing houses a MUCH longer lever to use as they try to extract $$$. The payout cost just went up.
Re: (Score:1)
Re: (Score:2)
Web content is not public domain. There is no explicit or implicit license for the literal copy into the training set and the only exemption possible is fair use.
It's a plain and simple copyright issue for literal copies of registered works. Registered so no need to prove damages even for statutory fines. Fair use or bust.
Re: (Score:3)
If it was discoverable, then failure to protect that information is as much as admitting the plaintiff's claims. This could lead to such wonderful things as summary judgment, which should happen. Generative AI companies need to understand this and while the amount of data may be in the petabyte range, it still has to be preserved.
Re: (Score:1)
On the other hand, as a user of generative AI you get all the p
Re: (Score:2)
I don't think articles of the NYT that they create and retain copyright are a "hoarder." I was bringing up the legal ramifications, even if unintentional, of failing to preserve evidence when directed by a judicial proceeding. If you don't like corporations holding copyrights, then leave it to the original authors or contributors but then that creates other issues. There's also legal precedents for E-Discovery and handling of data, the casework is huge, but it's well known to preserve everything that is cit
Re: (Score:2)
Never attribute to malice...
This doesn't really pass a smell test. Not just the fact that they only deleted data from one virtual machine, but they didn't delete source data just the work completed, meaning that work can with some effort be redone, and they also attempted to recover the data.
If this was a coverup it was truly an incompetent one.
Re: (Score:2)
I think everyone aside from OpenAI would love to be surprised and see that a massive $billion firm is held to the same standard, and punished the same way that normal people would be punished if we destroyed material evidence in a case.
Re:Bad For OpenAI (Score:4, Insightful)
In most cases, failure to preserve evidence results in an adverse inference against the failing party. If that happens here, the court could instruct the jury to assume the allegations against OpenAI are true. That would end OpenAI.
I would think that people who destroy evidence have concluded that the deletion of evidence would result in a penalty less severe than if the evidence had been preserved. I suspect there are some really, really devastating proofs in that deleted data, so much so that OpenAI concluded that facing the consequences of deleting it is better than the consequences of it being made public.
Read the summary more closely.
OpenAI gave the NYTimes a couple VMs so its experts could search through OpenAI's training data.
OpenAI accidentally deleted some of the analysis that the experts generated, but the original training data is still there.
The only consequence to this is the experts need to spend some time regenerating that analysis. As "failure to preserve evidence goes" this is largely the equivalent of accidentally knocking over a stack of papers on someone's desk.
Re: (Score:2)
I suspect there are some really, really devastating proofs in that deleted data, so much so that OpenAI concluded that facing the consequences of deleting it is better than the consequences of it being made public.
That would need to be some evidence. But yes, it is possible. As this deletion looks extremely bad and may well have them lose the case, I suspect it would need to be at least evidence of serious criminal wrongdoing on "people go to prison" level.
ChatGPT Did it! (Score:2)
Re: (Score:2)
Oopsie (Score:2)
Quote (Score:3)
ha ha! I'd say that too! (Score:2)
happens all the time, especially when it could be damaging to us.
Sloppy at best (Score:3)
I'd honestly be curious to know what OpenAI's IT operations look like. Charitably presuming that it's not just outright destruction of evidence; it sounds like they've got eleventy-zillion dollars worth of 'AI' specialists and enough data scrapers to plunder the internet twice over and some engineers keeping the APIs running and so on; but the boring-beige-IT-computer-janitor-nonsense maturity of a vastly smaller company.
Heck, I worked for a random town's school department over a decade ago and they had backup policies set up that included the ability to mark things as under litigation holds and exempt from deletion through normal manual or scheduled actions(usually pretty low stakes stuff; but at least once or twice a year a parent would get really fighty about an IEP or something, and there litigation is litigation in terms of data preservation).
All those big brains (Score:2)
And no one so much as implemented, much less followed, 3 2 1 backup schemes?
Of course it was :| (Score:5, Insightful)
Until the penalties for destroying evidence exceed the penalties for the crimes they are accused of, this will always be a thing.
Companies simply weigh which is the lesser of two evils and go with that.
That said, if OpenAI is so incompetent with data they've been ordered to retain, imagine how incompetent they will be with any
of the data they will collect and store on you.
Re: (Score:2)
Hashtag (Score:2)
#OOOPS
Hanlon's Razor (Score:2)
On the other hand, all real-world evidence coming from OpenAI strongly indicates that no one should give them the benefit of the doubt.
Didn't delete the evidence (Score:1)
Didn't read the article, but from the summary it seems OpenAI only deleted some of the analysis results of the evidence. The evidence (OpenAI's training data potentially containing NYT IP) is still intact; the foul up just means the search for NYT IP in that evidence has to be redone.
Re:Didn't delete the evidence (Score:5, Insightful)
meh, hand of God made them do it (Score:2)
Unbelievable, unretrievable and not even the devil could get that lucky.
OpenAI version of magical realism for an excuse
Self preservation ... (Score:2)