US Lawmaker Proposes a Public Database of All AI Training Material 30
An anonymous reader quotes a report from Ars Technica: Amid a flurry of lawsuits over AI models' training data, US Representative Adam Schiff (D-Calif.) has introduced (PDF) a bill that would require AI companies to disclose exactly which copyrighted works are included in datasets training AI systems. The Generative AI Disclosure Act "would require a notice to be submitted to the Register of Copyrights prior to the release of a new generative AI system with regard to all copyrighted works used in building or altering the training dataset for that system," Schiff said in a press release.
The bill is retroactive and would apply to all AI systems available today, as well as to all AI systems to come. It would take effect 180 days after it's enacted, requiring anyone who creates or alters a training set not only to list works referenced by the dataset, but also to provide a URL to the dataset within 30 days before the AI system is released to the public. That URL would presumably give creators a way to double-check if their materials have been used and seek any credit or compensation available before the AI tools are in use. All notices would be kept in a publicly available online database.
Currently, creators who don't have access to training datasets rely on AI models' outputs to figure out if their copyrighted works may have been included in training various AI systems. The New York Times, for example, prompted ChatGPT to spit out excerpts of its articles, relying on a tactic to identify training data by asking ChatGPT to produce lines from specific articles, which OpenAI has curiously described as "hacking." Under Schiff's law, The New York Times would need to consult the database to ID all articles used to train ChatGPT or any other AI system. Any AI maker who violates the act would risk a "civil penalty in an amount not less than $5,000," the proposed bill said. Schiff described the act as championing "innovation while safeguarding the rights and contributions of creators, ensuring they are aware when their work contributes to AI training datasets."
"This is about respecting creativity in the age of AI and marrying technological progress with fairness," Schiff said.
The bill is retroactive and would apply to all AI systems available today, as well as to all AI systems to come. It would take effect 180 days after it's enacted, requiring anyone who creates or alters a training set not only to list works referenced by the dataset, but also to provide a URL to the dataset within 30 days before the AI system is released to the public. That URL would presumably give creators a way to double-check if their materials have been used and seek any credit or compensation available before the AI tools are in use. All notices would be kept in a publicly available online database.
Currently, creators who don't have access to training datasets rely on AI models' outputs to figure out if their copyrighted works may have been included in training various AI systems. The New York Times, for example, prompted ChatGPT to spit out excerpts of its articles, relying on a tactic to identify training data by asking ChatGPT to produce lines from specific articles, which OpenAI has curiously described as "hacking." Under Schiff's law, The New York Times would need to consult the database to ID all articles used to train ChatGPT or any other AI system. Any AI maker who violates the act would risk a "civil penalty in an amount not less than $5,000," the proposed bill said. Schiff described the act as championing "innovation while safeguarding the rights and contributions of creators, ensuring they are aware when their work contributes to AI training datasets."
"This is about respecting creativity in the age of AI and marrying technological progress with fairness," Schiff said.
retroactive? (Score:2)
Re: (Score:2)
Re:retroactive? (Score:5, Informative)
This does not count as ex post facto. They would not be punished for their actions prior to this law, but would be for not complying with it going forward.
To go off the license example. You would be fine if you drove without a license prior to them being required. But if you continued to drive without one after they became required, then you would get in trouble.
Re: (Score:2)
The retrospectiveness isn't ex-post-facto in the sense the law talks about.
It would be if they where fined for not providing that link in the past. providing that link in the future, for the previous data is whats at question
Is an AI making these articles? (Score:2)
Or is this just a good old fashioned human-dupe?
Not Accurate (Score:2)
This is a law that will allow the federal government to take total control of AI forever, and will further allow them to take control of all creative work forever, which will essentially repeal the First Amendment.
It will proceed thusly: as of now, the presumption is that all work is created by people. This will give way to suspicion that all work is created by AI, which will lead to "approved" and "not approved" works, which will lead to "this work is presumed to have been created by AI but your papers are
Re: (Score:2)
Nope.
All AI models should disclose their entire datasets. If something is copyrighted, then the copyright owner should at the bare minimum know, so if someone uses the model and "generates something that devalues their copyrighted work" they know that model in fact ingested it, and it's not the product of "infinite monkeys with typewriters" RNG.
The biggest reason to do this is not with text or visuals however, it's to do with deep-fakes. Because if a model has in fact been trained on, say present US preside
Re: (Score:2)
All AI models should disclose their entire datasets. If something is copyrighted, then the copyright owner should at the bare minimum know, so if someone uses the model and "generates something that devalues their copyrighted work"
Copyright law grants no such protections over the value of a copyright holders work. It merely protects works from unauthorized performance and reproductions.
The biggest reason to do this is not with text or visuals however, it's to do with deep-fakes. Because if a model has in fact been trained on, say present US president's faces, video, audio, and can reproduce that president's speech and visuals at a fairly high accuracy, then anyone using that model should disclose that it was done with that model, that it's not real.
Good luck, model adaption can be done by anyone in seconds to hours. In the case of voice it literally takes seconds. Model pretraining takes real effort, the rest is trivial.
Not to worry (Score:2)
No. The tech is already out — this horse is so far out of the barn you'd need a passport and numerous border crossings to even find hoofprints.
Not only is such a law completely unable to regulate GPT/LLM/generative software in the USA's non-commercial software ecosphere, it can have no effect across national borders and you may be absolutely certain that other state actors will simply smile and wave at such ideas (
Really stupid bill. (Score:2)
Looking at who came up with this makes it clear why it is so badly created.
Re: (Score:3)
Theoretically, it would only require a link. However, in practice, this might not be useful. The AI training set might be a modified version of the copyrighted material, as opposed to raw web-pages.
I don't think the authors of the bill realize:
1. How massive training datasets can be (exabytes),
2. The training data might be not be random scrapes from the internet. AI training data can be anything, depending on the application.
3. The copyright office is ill-equipped to handle and store the information.
Really stupid AI companies (Score:1)
Unworkable (Score:2)
The list would contain billions, maybe trillions, of items and would be impossible to compile
Very few images on the public web have embedded copyright information
I would guess 1% or less
Re: (Score:3)
And that might well make the bill impossible to comply with which would likely trigger a judge to knock the bill down as unworkable.
Re: (Score:2)
Re: (Score:2)
The majority are copyrighted by default - you often don't need to take any action for your work to be protected by copyright.
It's on those producing the training database to ensure that every item that goes into it is being used in accordance with the rights available for it.
So...honor system (Score:2)
Just like now, except Schiff gets to claim credit for Doing Something?
Protectionism (Score:4, Interesting)
This is purely about creating laws that are impossible to comply with in order to protect "creators" from the future.
Only lawyers and rich will be able to AI (Score:2)
Small startups will not be able to train AIs because of the red tape.
Imagine every professional having to explain all the places they trained.
Do the vendors even know (Score:2)
Doesn't this suggested law presuppose that these AI vendors even know what the sources of all their training materials are? I suspect they don't, really, which is a problem in-and-of itself in the sense that we know poor-quality/inaccurate data has been used to train many of these models.
\o/ (Score:1)
A real solution would be to release a retrovirus into the water supply which would terraform everyone's thought process to build in elite-level bureaucracy into every aspect of life from the get-go then we can stifle all innovation and reduce the government footprint.