Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI Government The Courts

AI Companies Would Be Required To Disclose Copyrighted Training Data Under New Bill (theverge.com) 42

An anonymous reader quotes a report from The Verge: Two lawmakers filed a bill requiring creators of foundation models to disclose sources of training data so copyright holders know their information was taken. The AI Foundation Model Transparency Act -- filed by Reps. Anna Eshoo (D-CA) and Don Beyer (D-VA) -- would direct the Federal Trade Commission (FTC) to work with the National Institute of Standards and Technology (NIST) to establish rules for reporting training data transparency. Companies that make foundation models will be required to report sources of training data and how the data is retained during the inference process, describe the limitations or risks of the model, how the model aligns with NIST's planned AI Risk Management Framework and any other federal standards might be established, and provide information on the computational power used to train and run the model. The bill also says AI developers must report efforts to "red team" the model to prevent it from providing "inaccurate or harmful information" around medical or health-related questions, biological synthesis, cybersecurity, elections, policing, financial loan decisions, education, employment decisions, public services, and vulnerable populations such as children.

The bill calls out the importance of training data transparency around copyright as several lawsuits have come out against AI companies alleging copyright infringement. It specifically mentions the case of artists against Stability AI, Midjourney, and Deviant Art, (which was largely dismissed in October, according to VentureBeat), and Getty Images' complaint against Stability AI. The bill still needs to be assigned to a committee and discussed, and it's unclear if that will happen before the busy election campaign season starts. Eshoo and Beyer's bill complements the Biden administration's AI executive order, which helps establish reporting standards for AI models. The executive order, however, is not law, so if the AI Foundation Model Transparency Act passes, it will make transparency requirements for training data a federal rule.

This discussion has been archived. No new comments can be posted.

AI Companies Would Be Required To Disclose Copyrighted Training Data Under New Bill

Comments Filter:
  • For example, have federal law enforcement at least harass the hell out of the top users and downloaders of LAION-5B because it was discovered to have thousands of CSAM images in it [404media.co]. In Discord chats they even admitted it could easily end up being full of CSAM, but were like "whatcha gonna do?"

    From where I stand, it looks an awful lot like Western AI companies are wiping their asses with copyright and CSAM law. At the very least, with these large scraped data sets we could have the feds go around "verifying

    • by bagofbeans ( 567926 ) on Saturday December 23, 2023 @11:29AM (#64101065)

      Will every human author now have to disclose every copyrighted book, film, newspaper, website that they were exposed to during their lives ("training")?

      Though not.

      More useful would be plagiarism analyses on the outputs.

      • Repeat after me: AI is not human. AI does not have human rights. AI does not have legal rights. AI is just server technology.

        These arguments comparing a machine to a human being as if the two are interchangeable are hollywood fairytales for easily duped people, not slashdot readers. Let's discuss the tech as if it was tech, which it is, not as if it were a human, which it is not.

        The fact is that AI is just software with bugs, and "training" is not the same as a human being reading a book or watching a f

        • Repeat after me: AI is not human.

          Be fair; drawing an analogy can be flawed, but the analogy doesn't require thinking it's human - since you're comparing functionality of one system or component to something else that tries to replicate it.

          If you had a prosthetic leg that actually functioned perfectly like a real one, comparing its functionality wouldn't automatically be saying "it's a real leg." People who make this argument, IMO, miss that scope where the attempted comparison is being made - even if that doesn't change the flaws that ex

          • Analogies are drawn precisely to transpose existing experience and knowledge into new situations. It's a method of filling in the blanks.

            I think it's counterproductive to encourage people to view AIs as reading books or watching movies by analogy to humans. It suggests that an AI will act and behave roughly like a human author with the "training" it receives. When it inevitably doesn't, it causes disruption and damage to society and markets. That disruption is on a large scale because AIs interact with m

      • That "Machines learn just like humans" argument is getting old and has been debunked numerous times
    • Kinda weird they didn't just link the images through a website and let Cloudflare scan it for them.

      Still going to miss some, but at least the major visual hashes get checked.

    • They don't do much to a person storing copyrighted material in their brain and reproducing it when prompted, so why would they do the same to a computer?
    • have federal law enforcement at least harass the hell out of the top users and downloaders of LAION-5B because it was discovered to have thousands of CSAM images in it.

      Do you mean "in the training data" I thought the dataset used for image training doesn't (or isn't) supposed to have actual image data in it IIRC at least.

  • by Rei ( 128717 ) on Saturday December 23, 2023 @09:40AM (#64100913) Homepage

    Just provide us a service where we can upload a file and have an automatic rapid assessment on whether it's copyrighted or not. Thanks!

    Until then, I sure hope this bill is based on a "best faith effort" principle, otherwise, these reporting requirements are a death sentence for most AI training. Because whether something falls under copyright as a unique work, a derivative work, or non-derivative work is something that even attorneys frequently have to debate over. And the concept of hiring an attorney to analyse every single thing that a model trains on is laughable. You can't rely on "what the license of the particular website said" either, because this is the internet, and people post things that they themselves didn't create constantly.

    • by Rei ( 128717 )

      ** for most AI training *in the US*.

    • by Rei ( 128717 )

      Okay, reading over the bill, this is encouraging:

      (b) CONSULTATION.—In establishing the standards and issuing the guidance required by subsection (a), the Commission shall consult with the Director of the National Institute of Standards and Technology, the Director of the Office of Science and Technology Policy, the Reg9 ister of Copyrights, and other relevant stakeholders, including standards bodies, covered entities, academia, technology experts, and advocates for civil rights and consumers

      There shou

    • If it's not public domain, it's copyrighted, it's really not hard. If you don't gamble and rely on fair use exemption, you'll need specific licenses, so again quite easy to keep track.

      Gamble on fair use and just copy everything, or use public domain and specifically licenses content. Either way, easy.

      • PS. the fair use is a REALLY big gamble on the initial pirated copy from the internet. Fair use copying on something you own is one thing, arguing fair use for pirated content is quite another.

      • Re:Sure thing. (Score:4, Insightful)

        by sixsixtysix ( 1110135 ) on Saturday December 23, 2023 @01:45PM (#64101383)
        Sure, but first the public gets to claw back all the stuff that was stolen from the public domain by incessant copyright extensions.
      • by Rei ( 128717 )

        Except that doesn't work either. Someone can mark something "public domain", but that doesn't actually mean it is. Most websites ban copyright violation from their users and mandate they have the right to use the data as they see fit (whether or not the users still own the copyright or not), but that doesn't mean that the users aren't actually violating copyright. So even if the website gives you full permission (which you don't actually need) to train off their content, that doesn't mean that you're not

        • The only thing you can use is ancient enough to be in public domain from copyright expiry or copyrighted, treat corner cases as copyrighted. Easy.

          That will lose you an infinitesimal amount of content.

          • by Rei ( 128717 )

            If AIs can only train from Project Gutenberg data from ancient books, they're going to start talking funny and be very confused about modern technology ;)

    • Just provide us a service where we can upload a file and have an automatic rapid assessment on whether it's copyrighted or not. Thanks!

      Interesting idea, but not really relevant. The bill just asks them to disclose the sources, not to establish their copyright status.

      Until then, I sure hope this bill is based on a "best faith effort" principle, otherwise, these reporting requirements are a death sentence for most AI training. Because whether something falls under copyright as a unique work, a derivative work, or non-derivative work is something that even attorneys frequently have to debate over. And the concept of hiring an attorney to analyse every single thing that a model trains on is laughable. You can't rely on "what the license of the particular website said" either, because this is the internet, and people post things that they themselves didn't create constantly.

      Again, this doesn't put the burden of establishing copyright on them, it just discloses where they got their training data, presumably so the rest of us can figure out how to start addressing the copyright question.

      And given that, I don't think the reporting requirements would be very burdensome, they probably already have every bit of training data timestamped with a URL. The

    • Just provide us a service where we can upload a file and have an automatic rapid assessment on whether it's copyrighted or not. Thanks!

      Wouldn't it be easy in terms of technically being copyrighted if talking about a work that is made in any country where copyright is automatic? Maybe I am being a dumbass in asking this... but my greater point is the flaw of just hyperfocusing on copyright status as being the line to draw in the sand - in that, if for instance, people are conditioned to think "copyrighted automatically == off limits" - and that thinking gets into legislative efforts, that'd kill off a lot of efforts to make training data

  • We may as well use the "Mickey-Mouse-date" (January 1, 2024) as the symbolic date as the day traditional copyright died. Computers and AI automate so much that just a single copyrighted "atom" can infect a whole work. You remember the whole FUD about the GPL being "viral", now with co-pilot and other AI "concrete-mixers" we are churning out whole buildings of copyright laundered works. It's time to think beyond copyright, and maybe the concepts related to it in general. Actors and Artists should be transit
  • The source of the data was the Internet. Happy now? First the DMCA, then "click to accept cookies" and now this. It makes politicians look extremely stupid.

  • I'll use a VPN to connect to an LLM outside of the US. From writing bash scripts to researching products, I now rely on LLMs.

    • by Rei ( 128717 )

      Or run your own. Mixtral GGUFs are now out which can run on a 24GB card. Mixtral performs better than GPT-3.5, though not yet better than GPT-4, and runs super-fast (its architecture is basically a "mini GPT-4")

      • Or run your own. Mixtral GGUFs are now out which can run on a 24GB card. Mixtral performs better than GPT-3.5, though not yet better than GPT-4, and runs super-fast (its architecture is basically a "mini GPT-4")

        Do people actually like Mixtral? I mean it's fast for a 56B model due to MoE scheme but given a choice for fast models I would rather run a tuned yi-34b which fits entirely in vram and runs at more or less the same speed as mixtral because nothing has to spill over to ram.

        I think MoE is a cool idea but the model itself... is a bit meh...

        • by Rei ( 128717 )

          Mixtral only has 46,7B parameters. Only the linear networks are broken out into experts, not the heads. You should be able to run it entirely in VRAM on a 24GB card if sufficiently quantized (Q3_K_M states that max ram required = 22,86 GB - though I haven't had a chance to try it yet; Q2_K is supposed to run in 18.14 GB. And even if there's *some* offloading required, the amount needed should be so small that it shouldn't affect performance)

          I'd still like to know what Yi's tuning method is. Would be grea

  • by ahoffer0 ( 1372847 ) on Saturday December 23, 2023 @02:12PM (#64101451)

    and it influenced my thinking. It was part of my training data as a human being. And the book was copyrighted! Will this law finally hold me accountable for being educated?

    • That's the silly part of this. The training is irrelevant in terms of copyright. What matters is the output of the model and what is actually stored within it.

      If the model contains enough of the work to constitute infringement, and the model itself is being distributed then that's already illegal. If the model can be directed to substantially recreate an infringing work then surely that's little different to running off bad photocopies of a book.

      • by Rei ( 128717 )

        Apparently my brain is a gigantic copyright infringement under your theory of "anything physically capable of being coaxed into reproducing copyrighted data is a violation" model of copyright law.

        You realize that wordpad "can be directed to substantially recreate an infringing work", as can a piece of paper and a couple crayons? If the user is deliberately choosing to violate copyright law, then that's not the tool's violation. There are of course cases where defendents were accused of creating tools s

        • >Apparently my brain is a gigantic copyright infringement under your theory of "anything physically capable of being coaxed into reproducing copyrighted data is a violation" model of copyright law.

          If the model contains enough of the work to constitute infringement, and the model itself is being distributed then that's already illegal.

          Yes, I if you intend to duplicate your brain for distribution.

          >You realize that wordpad "can be directed to substantially recreate an infringing work", as can a piece of

      • Fortunately AI has a fix for its own copyright problem. We can use models to read, summarize and report on copyrighted data. This synthetic data would contain the same information but it would be expressed differently. Basically it follows the idea vs expression distinction closely - it separates one from the other. So the downstream models can't possibly reproduce copyrighted text because they have never seen one.
        • Yeah, that's the key. So long as doesn't retain enough to violate copyright, either in its database or what can output then it's all good. It could still run into trademark issues. For example, if it churns out new Batman stories that incorporate elements (e.g. Batman, Batmobile) protected by trademarks.

        • by Rei ( 128717 )

          This isn't the problem at all. Knowing what data is copyrightable under what set of permissions is the problem. And you can't just trust a given site's shrinkwrap license because people post copyvio constantly every day across the internet.

          An artist may mark their data "ABSOLUTELY COPYRIGHTED ALL RIGHTS RESERVED AI NOT ALLOWED STAY AWAY!", but then one of their fans goes on posts it on a social media site, maybe as a meme template so that it no longer matches a similarity metric (not that determining "whi

  • This is a meta bill that punts creating regulation to technocrats and specifies dopey things that are fundamentally impossible to comply with "and whether and how data is collected and retained during inference" .

    Copyright owners don't have a right to the underlying knowledge contained in their works nor to track or judge usage. The copyright system is designed to give the rights holder control over production and performance of fixed works. Nothing more than that.

Life is a healthy respect for mother nature laced with greed.

Working...