Google Hit With Lawsuit Alleging It Stole Data From Millions of Users To Train Its AI Tools (cnn.com) 46
"CNN reports on a wide-ranging class action lawsuit claiming Google scraped and misused data to train its AI systems," writes long-time Slashdot reader david.emery. "This goes to the heart of what can be done with information that is available over the internet." From the report: The complaint alleges that Google "has been secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans" and using this data to train its AI products, such as its chatbot Bard. The complaint also claims Google has taken "virtually the entirety of our digital footprint," including "creative and copywritten works" to build its AI products. The complaint points to a recent update to Google's privacy policy that explicitly states the company may use publicly accessible information to train its AI models and tools such as Bard.
In response to an earlier Verge report on the update, the company said its policy "has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate. This latest update simply clarifies that newer services like Bard are also included." [...] The suit is seeking injunctive relief in the form of a temporary freeze on commercial access to and commercial development of Google's generative AI tools like Bard. It is also seeking unspecified damages and payments as financial compensation to people whose data was allegedly misappropriated by Google. The firm says it has lined up eight plaintiffs, including a minor. "Google needs to understand that 'publicly available' has never meant free to use for any purpose," Tim Giordano, one of the attorneys at Clarkson bringing the suit against Google, told CNN in an interview. "Our personal information and our data is our property, and it's valuable, and nobody has the right to just take it and use it for any purpose."
The plaintiffs, the Clarkson Law Firm, previously filed a similar lawsuit against OpenAI last month.
In response to an earlier Verge report on the update, the company said its policy "has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate. This latest update simply clarifies that newer services like Bard are also included." [...] The suit is seeking injunctive relief in the form of a temporary freeze on commercial access to and commercial development of Google's generative AI tools like Bard. It is also seeking unspecified damages and payments as financial compensation to people whose data was allegedly misappropriated by Google. The firm says it has lined up eight plaintiffs, including a minor. "Google needs to understand that 'publicly available' has never meant free to use for any purpose," Tim Giordano, one of the attorneys at Clarkson bringing the suit against Google, told CNN in an interview. "Our personal information and our data is our property, and it's valuable, and nobody has the right to just take it and use it for any purpose."
The plaintiffs, the Clarkson Law Firm, previously filed a similar lawsuit against OpenAI last month.
Google must be arrested. (Score:2)
Somehow corporations can't commit crimes? (Score:3, Informative)
People, not things, commit crimes. (Score:2)
A "corporation" in legal code, is approximately equivalent to a Virtual Interface in programming. It is only a "person" for the sake of torts (lawsuits, contracts, and the like), not criminal law. And despite the rhetoric you hear from the idiot left (admittedly not quite so numerous as the idiot right), this is a very good thing.
Just for example, say you're trying to build a high rise. If you know anything about buildings, you know this involves all sorts of people, firms, specialists, and groups. You've g
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Individual responsibility doesn't work here. Let's take dumping benzene. John Smith is a truck driver. He needs his job to help support his family. His employer has deliberately set things in a way that he needs to dump the benzene in the river to meet the numbers he needs to keep his job. If John stays legal, he gets fired, and Jim is hired to do the same thing. Due to the corporate policy, and the lack of strong labor unions, the company will find someone to dump the benzene in the river, whether i
Re: (Score:1)
Re: (Score:2)
If you steal a pack of gum from Walmart, you can be arrested, and you would be charged under criminal law.
If a corporation defrauds millions of people, $10 each, snd makes hundreds of millions of dollars in Iâ(TM)ll gotten gains ⦠they can only be sued under civil law ⦠and are even able to write themselves out if that in their fine print.
Only one of those so-called people have the potential of being locked in a cage, will for the rest of their lives have trouble renting an apartm
Re: (Score:2)
The reason is that if a collective of people accomplish something good, you'd like them all to be recognized for the accomplishment, and possibly profit. If a collective of people accomplish something bad, the more difficult task of apportioning blame appropriately, in ratio to level of individual culpability, becomes necessary, and in many cases, it's easy to deflect blame within a group.
Re: (Score:2)
Corporations were created to shield rich people from their decisions when those decisions end up being detrimental to a whole larger segment of society. "Wasn't me, your honor. It was the corporation." And for whatever reason, modern governments have pretty much just let themselves be absorbed into the monster that is corporate control, rather than trying to regulate and keep them in check.
Re: (Score:1)
Guilty (Score:1)
Re: (Score:2)
"Copywritten" is a legitimate word, being an adjective derived from copywriting [wikipedia.org]. Of course, it means something completely different to "copyrighted".
You can't steal (Score:3, Interesting)
publicly available data
Re:You can't steal (Score:5, Insightful)
publicly available data
Sure you can. Many songs, movies, photos, articles, etc. are publicly available for viewing on the internet. Even though viewing is unlimited, deriving new commercial works based on that content is prohibited by copyright laws. We'll see if a court accepts a fair use claim for the content.
Re: (Score:2)
> deriving new commercial works based on that content is prohibited by copyright laws.
If deriving is not allowed, why do we have so many zombie movies?
Re: (Score:3)
You can steal FOSS.
Re: The stupidity of.... (Score:3)
I believe the cases are about reproducing the data, not merely referencing it
AI devs got greedy and lazy with the training data (Score:5, Informative)
I've said it before and I'll say it again. AI devs got greedy and lazy with the training data.
* Get copyrighted books wholesale in the training set taken from KNOWN pirate (e)books sites? Check.*
* Scan code without checking that the licenses allow it (Like BSD, MIT, Apache or MPL)? Check.
* Scan/scrap text from the internet without making sure that the licenses allow it (like say, creative commons)? Check
* Scan material from sites using free APIs or scrapping tools without negotiating a license from the site owners? Check
So, instead of curating the dataset, they lifted it, much of it wholesale, whitout the propper credit/negotiation/care.
Now, slowly but deftly, the chickens are comming home to roost.
Couple that with the fact that part of the data used to train LLMs was SEO crap, and most likely data that will be used to train future generation models will use crap generated by past LLMs , and this is a recipe for "a bad experience" tm
Let's hope that these guys do a better job curating the training data for the newer LLMs like Llama2, ChatGPT4, Bard2 and such.
One can only dream Right?
* I bring this topic first because, whith the headlines being so clickbaity, no one noticed that Sarah Silverman was only part of a group of writers suing, and that part of the lawsuit claimed that their books were "ingested" by the AI from pirate ebook sites, without any form of compensation (not even buying one copy) for the authors. For crying out loud, even "Reader's Digest" has their ducks in a row regarding copyright and compensation before making their abridged versions.
Re: (Score:1)
You're way way stretching the definition of a derivative work.
An LLM is not a derivative work of some random copyrighted work that it was trained on.
By that argument, google search is a derivative work because it cannot do a search without going through the data.
Re: (Score:2)
Yes, he's totally missing the point. Google used information that was published publicly, for public use, in the training of their models. Next, we'll have people insisting that it's ok to READ publicly-available information, but only if it's read for the correct purpose.
Re: (Score:2)
> Get copyrighted books wholesale in the training set taken from KNOWN pirate (e)books sites? Check.*
You never hard of Google Books where they scanned thousands of books? So they are the last company that would need to use pirate site for this and more importantly, they most likely have more books than pirate sites.
> Scan code without checking that the licenses allow it (Like BSD, MIT, Apache or MPL)? Check.
I don't understand why reading code and learning from it and using that knowledge to write new
Web Scraping (Score:5, Insightful)
Web scraping is not stealing, and it is not illegal. It may be a violation of terms of use agreements, but that requires an agreement be made.
Publicly available information is just that. If you published it on the www, you made it available to others to read, and use -you may have a copyright claim if your published work (published on the www is published) is re-published (in whole or in part), but you cannot claim copyright on facts. Information is not copyrightable, only the expression of the information as a creative work.
Information about you is not your information. Personal data is the data you personally possess. If you want to keep information about you private -keep it to yourself. Once you tell someone a secret, it is theirs to keep or share.
Re: (Score:3)
In Europe your personal information very much is yours, and even if you tell it to someone they are still bound to only use it in ways absolutely necessary to provide a service, or in ways that you have explicitly and freely given them permission to.
As for information published publicly, you may have read "all rights reserved" in books. Obviously reading the book is not illegal, and nor is using knowledge you learned from the book. But doing other things with it, like making a complete copy and giving it to
Re: (Score:2)
Re: (Score:2)
Web scraping is not stealing, and it is not illegal. It may be a violation of terms of use agreements
Creating derived works from scraped content is copyright infringement. Whether or not using scraped data to train AI models constitutes creation of a derived work is something courts and/or legislative bodies are going to have to decide. That's basically the question at issue in this lawsuit.
Re: (Score:2)
Creating derived works from scraped content is copyright infringement.
I should have been a little more precise here. I should have said:
Creating derived works from scraped content that is copyrighted and for which the scraper/user does not have a license to create derived works is copyright infringement.
Re: (Score:2)
...And I don't see how it's decidable, because you would have to show that significant parts of the original work were found in the derived work. But there is no one derived work, only the potential for a derived work. In addition, the original work is nowhere to be found inside the bot, only millions or billions or trillions of weights on fragments and relationships.
Re: (Score:2)
Agreed. IMO, I think this is exactly analogous to the way human creators absorb large amounts of content and then synthesize new things which may have obvious influences but are considered new expressions. In some cases human creators create expressions that are so obviously similar to something they saw that it does cross the line into infringement. I don't see how we can treat AI any differently. The mere training can't be considered creation of a derived work. Using the resulting AI may cause it to gener
How dare they ? (Score:2)
One User who Doesn't Care. (Score:1)
robots.txt you imbeciles (Score:1)
If you had a robots.txt file that blocked google, Google would have been blocked. I hope you hired those lawyers on a commission basis.
Re: (Score:2)
you will get like $100-$400 for this like the othe (Score:2)
you will get like $100-$400 for this like the other pay outs from big tech.
Link to the complaint (Score:1)
AI vs humans (Score:2)
Re: (Score:2)
Because we simply assume that knowledge granted to humans will make our collective lives a tiny better.
With AI, we know it will certainly make someone richer. That's about it.
pffff... (Score:2)
If it's publicly accessible to any browser it isn't stealing. What does it matter with what data these ai systems are trained, certainly not if it is based on publicly available data which anybody can access with a simple browser.
The plaintiffs need to read some documentation (Score:3)
Wayyyy back in 1994, Martijn Koster wrote about the rather narrow-minded Robots Exclusion Protocol. A protocol that exists solely for human-supremacist content owners to be able to make a public declaration that robots are not welcome to read their stuff. ...so for thirty years any specists who didn't want an AI reading their content have had an easy way to prevent such from happening. No one has any excuse for getting upset and suing now.
Secret?? (Score:1)
I don't know that this is illegal, or wrong. (Score:2)
I don't see it as that clear-cut.
Artists keep looking at other works in their field, and learning from them. All art is derivative to some extent, and a lot of it is copyrighted. I can have a brilliant idea for a story, and I'll still write dialogue and descriptions in a way influenced by everything else I've read. Alternatively, I can write a story with no significant new ideas, taking elements of plot from various other sources. It'll be unpublishable (I hope) but legal. That seems to me to be com
Do you actually care? (Score:2)
A great example is to look at email, if you encrypt and sign your emails with PGP, then at the very least it shows you care about email based communication. If you don't, then the chances you care about any communica