Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
AI Databases Government

US Lawmaker Proposes a Public Database of All AI Training Material 30

An anonymous reader quotes a report from Ars Technica: Amid a flurry of lawsuits over AI models' training data, US Representative Adam Schiff (D-Calif.) has introduced (PDF) a bill that would require AI companies to disclose exactly which copyrighted works are included in datasets training AI systems. The Generative AI Disclosure Act "would require a notice to be submitted to the Register of Copyrights prior to the release of a new generative AI system with regard to all copyrighted works used in building or altering the training dataset for that system," Schiff said in a press release.

The bill is retroactive and would apply to all AI systems available today, as well as to all AI systems to come. It would take effect 180 days after it's enacted, requiring anyone who creates or alters a training set not only to list works referenced by the dataset, but also to provide a URL to the dataset within 30 days before the AI system is released to the public. That URL would presumably give creators a way to double-check if their materials have been used and seek any credit or compensation available before the AI tools are in use. All notices would be kept in a publicly available online database.

Currently, creators who don't have access to training datasets rely on AI models' outputs to figure out if their copyrighted works may have been included in training various AI systems. The New York Times, for example, prompted ChatGPT to spit out excerpts of its articles, relying on a tactic to identify training data by asking ChatGPT to produce lines from specific articles, which OpenAI has curiously described as "hacking." Under Schiff's law, The New York Times would need to consult the database to ID all articles used to train ChatGPT or any other AI system. Any AI maker who violates the act would risk a "civil penalty in an amount not less than $5,000," the proposed bill said.
Schiff described the act as championing "innovation while safeguarding the rights and contributions of creators, ensuring they are aware when their work contributes to AI training datasets."

"This is about respecting creativity in the age of AI and marrying technological progress with fairness," Schiff said.
This discussion has been archived. No new comments can be posted.

US Lawmaker Proposes a Public Database of All AI Training Material

Comments Filter:
  • Doesn't the constitution prohibit the making of ex-post-facto laws?
    • No, or we wouldn't require drivers to have licenses and vehicle registration wouldn't be a thing.
    • Re:retroactive? (Score:5, Informative)

      by Mitchblahman ( 3022943 ) on Thursday April 11, 2024 @05:55PM (#64387710)

      This does not count as ex post facto. They would not be punished for their actions prior to this law, but would be for not complying with it going forward.

      To go off the license example. You would be fine if you drove without a license prior to them being required. But if you continued to drive without one after they became required, then you would get in trouble.

    • The retrospectiveness isn't ex-post-facto in the sense the law talks about.

      It would be if they where fined for not providing that link in the past. providing that link in the future, for the previous data is whats at question

  • Or is this just a good old fashioned human-dupe?

  • This is a law that will allow the federal government to take total control of AI forever, and will further allow them to take control of all creative work forever, which will essentially repeal the First Amendment.

    It will proceed thusly: as of now, the presumption is that all work is created by people. This will give way to suspicion that all work is created by AI, which will lead to "approved" and "not approved" works, which will lead to "this work is presumed to have been created by AI but your papers are

    • by Kisai ( 213879 )

      Nope.

      All AI models should disclose their entire datasets. If something is copyrighted, then the copyright owner should at the bare minimum know, so if someone uses the model and "generates something that devalues their copyrighted work" they know that model in fact ingested it, and it's not the product of "infinite monkeys with typewriters" RNG.

      The biggest reason to do this is not with text or visuals however, it's to do with deep-fakes. Because if a model has in fact been trained on, say present US preside

      • All AI models should disclose their entire datasets. If something is copyrighted, then the copyright owner should at the bare minimum know, so if someone uses the model and "generates something that devalues their copyrighted work"

        Copyright law grants no such protections over the value of a copyright holders work. It merely protects works from unauthorized performance and reproductions.

        The biggest reason to do this is not with text or visuals however, it's to do with deep-fakes. Because if a model has in fact been trained on, say present US president's faces, video, audio, and can reproduce that president's speech and visuals at a fairly high accuracy, then anyone using that model should disclose that it was done with that model, that it's not real.

        Good luck, model adaption can be done by anyone in seconds to hours. In the case of voice it literally takes seconds. Model pretraining takes real effort, the rest is trivial.

    • This is a law that will allow the federal government to take total control of AI forever

      No. The tech is already out — this horse is so far out of the barn you'd need a passport and numerous border crossings to even find hoofprints.

      Not only is such a law completely unable to regulate GPT/LLM/generative software in the USA's non-commercial software ecosphere, it can have no effect across national borders and you may be absolutely certain that other state actors will simply smile and wave at such ideas (

  • So would it require that this individual message be separately listed or would a link to the top page be enough. After all they are all copyright material.
    Looking at who came up with this makes it clear why it is so badly created.
    • Theoretically, it would only require a link. However, in practice, this might not be useful. The AI training set might be a modified version of the copyrighted material, as opposed to raw web-pages.

      I don't think the authors of the bill realize:

      1. How massive training datasets can be (exabytes),

      2. The training data might be not be random scrapes from the internet. AI training data can be anything, depending on the application.

      3. The copyright office is ill-equipped to handle and store the information.

      • by Anonymous Coward
        Oh dear. You mean to say it's impractical for AI developers to maintain basic standards of accountability, audit trails, and the respect of other people's labour and intellectual property? Well then too fucking bad, I guess they just shouldn't be in business then. I'm sure they will be missed.
  • The list would contain billions, maybe trillions, of items and would be impossible to compile
    Very few images on the public web have embedded copyright information
    I would guess 1% or less

    • And that might well make the bill impossible to comply with which would likely trigger a judge to knock the bill down as unworkable.

    • More than unworkable. One could get a robot or another AI to feed it (say NYT) news daily into the training DB. Deliberate contamination to create work for the legal industry. Training DB's are DYNAMIC, and the content could come from anywhere - even globally. The output from AI's will be tiny - what should be fair use. I bet song lyrics will be the first cab off the rank. Just ask about 'She loves me' People who want content do NOT use AI to get it - it can be downloaded elsewhere. Furthermore, woke cen
    • The majority are copyrighted by default - you often don't need to take any action for your work to be protected by copyright.

      It's on those producing the training database to ensure that every item that goes into it is being used in accordance with the rights available for it.

  • Just like now, except Schiff gets to claim credit for Doing Something?

  • Protectionism (Score:4, Interesting)

    by WaffleMonster ( 969671 ) on Thursday April 11, 2024 @10:37PM (#64388208)

    This is purely about creating laws that are impossible to comply with in order to protect "creators" from the future.

  • If we slow down AI research by using obstacles that don't actually protect anyone, then our adversaries will have it first and have it better.

    Small startups will not be able to train AIs because of the red tape.

    Imagine every professional having to explain all the places they trained.
  • Doesn't this suggested law presuppose that these AI vendors even know what the sources of all their training materials are? I suspect they don't, really, which is a problem in-and-of itself in the sense that we know poor-quality/inaccurate data has been used to train many of these models.

  • A real solution would be to release a retrovirus into the water supply which would terraform everyone's thought process to build in elite-level bureaucracy into every aspect of life from the get-go then we can stifle all innovation and reduce the government footprint.

"Poor man... he was like an employee to me." -- The police commisioner on "Sledge Hammer" laments the death of his bodyguard

Working...