Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
The Courts AI Google

Google Hit With Lawsuit Alleging It Stole Data From Millions of Users To Train Its AI Tools (cnn.com) 46

"CNN reports on a wide-ranging class action lawsuit claiming Google scraped and misused data to train its AI systems," writes long-time Slashdot reader david.emery. "This goes to the heart of what can be done with information that is available over the internet." From the report: The complaint alleges that Google "has been secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans" and using this data to train its AI products, such as its chatbot Bard. The complaint also claims Google has taken "virtually the entirety of our digital footprint," including "creative and copywritten works" to build its AI products. The complaint points to a recent update to Google's privacy policy that explicitly states the company may use publicly accessible information to train its AI models and tools such as Bard.

In response to an earlier Verge report on the update, the company said its policy "has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate. This latest update simply clarifies that newer services like Bard are also included." [...] The suit is seeking injunctive relief in the form of a temporary freeze on commercial access to and commercial development of Google's generative AI tools like Bard. It is also seeking unspecified damages and payments as financial compensation to people whose data was allegedly misappropriated by Google. The firm says it has lined up eight plaintiffs, including a minor.
"Google needs to understand that 'publicly available' has never meant free to use for any purpose," Tim Giordano, one of the attorneys at Clarkson bringing the suit against Google, told CNN in an interview. "Our personal information and our data is our property, and it's valuable, and nobody has the right to just take it and use it for any purpose."

The plaintiffs, the Clarkson Law Firm, previously filed a similar lawsuit against OpenAI last month.
This discussion has been archived. No new comments can be posted.

Google Hit With Lawsuit Alleging It Stole Data From Millions of Users To Train Its AI Tools

Comments Filter:
  • Google must go to federal prison and serve no less than three consecutive life sentences.
    • I always thought it was kinda funny that corporations could be legally treated as people, except for if they kill people, dump benzene in the river, or result in a bunch of injuries they appear to be immune or highly resistant to normal criminal penalties. Where is my corporate death penalty or life sentence? Now obviously they could suffer penalties but that sure feels theoretical at this point, eh? We all know corporations almost never pay for their crimes, even though it's just a group of gangsters calli
      • A "corporation" in legal code, is approximately equivalent to a Virtual Interface in programming. It is only a "person" for the sake of torts (lawsuits, contracts, and the like), not criminal law. And despite the rhetoric you hear from the idiot left (admittedly not quite so numerous as the idiot right), this is a very good thing.

        Just for example, say you're trying to build a high rise. If you know anything about buildings, you know this involves all sorts of people, firms, specialists, and groups. You've g

        • So it's like a person that's above the most serious laws.
        • Individual responsibility doesn't work here. Let's take dumping benzene. John Smith is a truck driver. He needs his job to help support his family. His employer has deliberately set things in a way that he needs to dump the benzene in the river to meet the numbers he needs to keep his job. If John stays legal, he gets fired, and Jim is hired to do the same thing. Due to the corporate policy, and the lack of strong labor unions, the company will find someone to dump the benzene in the river, whether i

          • I agree with much of what you say except the part about unions being the panacea. It looks more to me like unions ran off all the good work. They had a good thing going for a while, but it wasn't only corporate "union busting" (some bull busting heads on the picket line) that caused problems. It was the fact that unions are basically just gangsters trying to extort money from the business. It doesn't matter if the company is greedy and evil or angelic and beneficent: they still will want to offshore/relocat
        • If you steal a pack of gum from Walmart, you can be arrested, and you would be charged under criminal law.

          If a corporation defrauds millions of people, $10 each, snd makes hundreds of millions of dollars in Iâ(TM)ll gotten gains ⦠they can only be sued under civil law ⦠and are even able to write themselves out if that in their fine print.

          Only one of those so-called people have the potential of being locked in a cage, will for the rest of their lives have trouble renting an apartm

      • The reason is that if a collective of people accomplish something good, you'd like them all to be recognized for the accomplishment, and possibly profit. If a collective of people accomplish something bad, the more difficult task of apportioning blame appropriately, in ratio to level of individual culpability, becomes necessary, and in many cases, it's easy to deflect blame within a group.

      • Corporations were created to shield rich people from their decisions when those decisions end up being detrimental to a whole larger segment of society. "Wasn't me, your honor. It was the corporation." And for whatever reason, modern governments have pretty much just let themselves be absorbed into the monster that is corporate control, rather than trying to regulate and keep them in check.

        • I totally agree. I'm for whatever lawyers call finding the people who were responsible, despite them working for a corporation, and getting justice. From the guy driving the truck all the way up to the C-level asshole who bought the truck. Let just a few folks get prosecuted perp-walked and see how fast their criminal buddies get the message. The guy driving the truck will start asking questions like "What in the actual fuck are you asking me to dump outta the truck in the river? The last guy got sent to po
  • You can't steal (Score:3, Interesting)

    by MpVpRb ( 1423381 ) on Tuesday July 11, 2023 @05:14PM (#63678265)

    publicly available data

    • Re:You can't steal (Score:5, Insightful)

      by larryjoe ( 135075 ) on Tuesday July 11, 2023 @05:42PM (#63678337)

      publicly available data

      Sure you can. Many songs, movies, photos, articles, etc. are publicly available for viewing on the internet. Even though viewing is unlimited, deriving new commercial works based on that content is prohibited by copyright laws. We'll see if a court accepts a fair use claim for the content.

      • by dvice ( 6309704 )

        > deriving new commercial works based on that content is prohibited by copyright laws.

        If deriving is not allowed, why do we have so many zombie movies?

    • by m00sh ( 2538182 )

      You can steal FOSS.

  • by williamyf ( 227051 ) on Tuesday July 11, 2023 @05:43PM (#63678345)

    I've said it before and I'll say it again. AI devs got greedy and lazy with the training data.

    * Get copyrighted books wholesale in the training set taken from KNOWN pirate (e)books sites? Check.*

    * Scan code without checking that the licenses allow it (Like BSD, MIT, Apache or MPL)? Check.

    * Scan/scrap text from the internet without making sure that the licenses allow it (like say, creative commons)? Check

    * Scan material from sites using free APIs or scrapping tools without negotiating a license from the site owners? Check

    So, instead of curating the dataset, they lifted it, much of it wholesale, whitout the propper credit/negotiation/care.

    Now, slowly but deftly, the chickens are comming home to roost.

    Couple that with the fact that part of the data used to train LLMs was SEO crap, and most likely data that will be used to train future generation models will use crap generated by past LLMs , and this is a recipe for "a bad experience" tm

    Let's hope that these guys do a better job curating the training data for the newer LLMs like Llama2, ChatGPT4, Bard2 and such.

    One can only dream Right?

    * I bring this topic first because, whith the headlines being so clickbaity, no one noticed that Sarah Silverman was only part of a group of writers suing, and that part of the lawsuit claimed that their books were "ingested" by the AI from pirate ebook sites, without any form of compensation (not even buying one copy) for the authors. For crying out loud, even "Reader's Digest" has their ducks in a row regarding copyright and compensation before making their abridged versions.

    • by m00sh ( 2538182 )

      You're way way stretching the definition of a derivative work.

      An LLM is not a derivative work of some random copyrighted work that it was trained on.

      By that argument, google search is a derivative work because it cannot do a search without going through the data.

      • Yes, he's totally missing the point. Google used information that was published publicly, for public use, in the training of their models. Next, we'll have people insisting that it's ok to READ publicly-available information, but only if it's read for the correct purpose.

    • by dvice ( 6309704 )

      > Get copyrighted books wholesale in the training set taken from KNOWN pirate (e)books sites? Check.*

      You never hard of Google Books where they scanned thousands of books? So they are the last company that would need to use pirate site for this and more importantly, they most likely have more books than pirate sites.

      > Scan code without checking that the licenses allow it (Like BSD, MIT, Apache or MPL)? Check.

      I don't understand why reading code and learning from it and using that knowledge to write new

  • Web Scraping (Score:5, Insightful)

    by Local ID10T ( 790134 ) <ID10T.L.USER@gmail.com> on Tuesday July 11, 2023 @05:45PM (#63678351) Homepage

    Web scraping is not stealing, and it is not illegal. It may be a violation of terms of use agreements, but that requires an agreement be made.

    Publicly available information is just that. If you published it on the www, you made it available to others to read, and use -you may have a copyright claim if your published work (published on the www is published) is re-published (in whole or in part), but you cannot claim copyright on facts. Information is not copyrightable, only the expression of the information as a creative work.

    Information about you is not your information. Personal data is the data you personally possess. If you want to keep information about you private -keep it to yourself. Once you tell someone a secret, it is theirs to keep or share.

    • by AmiMoJo ( 196126 )

      In Europe your personal information very much is yours, and even if you tell it to someone they are still bound to only use it in ways absolutely necessary to provide a service, or in ways that you have explicitly and freely given them permission to.

      As for information published publicly, you may have read "all rights reserved" in books. Obviously reading the book is not illegal, and nor is using knowledge you learned from the book. But doing other things with it, like making a complete copy and giving it to

      • Now someone will say that it's like a person remembering it. But no it isn't. It's like a person remembering it and then being able to reproduce passages word for word with a typewriter. That IS a violation.
    • Web scraping is not stealing, and it is not illegal. It may be a violation of terms of use agreements

      Creating derived works from scraped content is copyright infringement. Whether or not using scraped data to train AI models constitutes creation of a derived work is something courts and/or legislative bodies are going to have to decide. That's basically the question at issue in this lawsuit.

      • Creating derived works from scraped content is copyright infringement.

        I should have been a little more precise here. I should have said:

        Creating derived works from scraped content that is copyrighted and for which the scraper/user does not have a license to create derived works is copyright infringement.

      • ...And I don't see how it's decidable, because you would have to show that significant parts of the original work were found in the derived work. But there is no one derived work, only the potential for a derived work. In addition, the original work is nowhere to be found inside the bot, only millions or billions or trillions of weights on fragments and relationships.

        • Agreed. IMO, I think this is exactly analogous to the way human creators absorb large amounts of content and then synthesize new things which may have obvious influences but are considered new expressions. In some cases human creators create expressions that are so obviously similar to something they saw that it does cross the line into infringement. I don't see how we can treat AI any differently. The mere training can't be considered creation of a derived work. Using the resulting AI may cause it to gener

  • who do they think they are ? The Evil house of mouse?
  • Scrape my shit, because it is shit. It'll also improve the conversational fecundity or whatever of AI, by dint of being an extra data point, so go ahead. I don't subscribe to society's rules when it comes to AI, because they conflict with the development of AI. I'm sorry.
  • by Anonymous Coward

    If you had a robots.txt file that blocked google, Google would have been blocked. I hope you hired those lawyers on a commission basis.

    • There isn't a way to block AI only in robots.txt as far as I understand. You would have to block Googles crawlers alone with the AI.
  • you will get like $100-$400 for this like the other pay outs from big tech.

  • The complaint: https://clarksonlawfirm.com/wp... [clarksonlawfirm.com]
  • Why there is no problem, when data is used to train humans, but it suddenly appears when it is used to train AI?
    • Because we simply assume that knowledge granted to humans will make our collective lives a tiny better.
      With AI, we know it will certainly make someone richer. That's about it.

  • If it's publicly accessible to any browser it isn't stealing. What does it matter with what data these ai systems are trained, certainly not if it is based on publicly available data which anybody can access with a simple browser.

  • Wayyyy back in 1994, Martijn Koster wrote about the rather narrow-minded Robots Exclusion Protocol. A protocol that exists solely for human-supremacist content owners to be able to make a public declaration that robots are not welcome to read their stuff. ...so for thirty years any specists who didn't want an AI reading their content have had an easy way to prevent such from happening. No one has any excuse for getting upset and suing now.

  • "has been secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans" What are you talking about - that is the literal business model - read everything AND cache the history as well. How did you think they do searching? using their best guess?
  • I don't see it as that clear-cut.

    Artists keep looking at other works in their field, and learning from them. All art is derivative to some extent, and a lot of it is copyrighted. I can have a brilliant idea for a story, and I'll still write dialogue and descriptions in a way influenced by everything else I've read. Alternatively, I can write a story with no significant new ideas, taking elements of plot from various other sources. It'll be unpublishable (I hope) but legal. That seems to me to be com

  • What's the chances if I check out that lawyer's email, files, or computer I'll see tools like VeraCrypt, PGP and a general approach to encrypt everything and lock it all down? You can't make the claim that you care about privacy, if you don't actually care, and 99.999% of people don't care.

    A great example is to look at email, if you encrypt and sign your emails with PGP, then at the very least it shows you care about email based communication. If you don't, then the chances you care about any communica

This is clearly another case of too many mad scientists, and not enough hunchbacks.

Working...