US Launches Task Force To Open Government Data For AI Research (wsj.com) 14
An anonymous reader quotes a report from The Wall Street Journal: The Biden administration launched an initiative Thursday aiming to make more government data available to artificial intelligence researchers, part of a broader push to keep the U.S. on the cutting edge of the crucial new technology. The National Artificial Intelligence Research Resource Task Force, a group of 12 members from academia, government, and industry led by officials from the White House Office of Science and Technology Policy and the National Science Foundation, will draft a strategy for potentially giving researchers access to stores of data about Americans, from demographics to health and driving habits. They would also look to make available computing power to analyze the data, with the goal of allowing access to researchers across the country. The task force, which Congress mandated in the National Artificial Intelligence Initiative Act of 2020, is part of an effort across the government to ensure the U.S. remains at the vanguard of technological advancements.
Many researchers, particularly in academia, simply don't have access to these computational resources and data, and this is hampering innovation. One example: The Transportation Department has access to a set of data gathered from vehicle sensors about how people drive, said Erwin Gianchandani, senior adviser at the National Science Foundation and co-chairman of the new AI task force. "Because you have very sensitive data about individuals, there are challenges in being able to make that data available to the broader research community," he said. On the other hand, if researchers could get access, they could develop innovations designed to make driving safer. Census data, medical records, and other data sets could also potentially be made available for research by both private companies and academic institutions, officials said. They said the task force will evaluate how to make such data available while protecting Americans' privacy and addressing other ethical concerns.
Many researchers, particularly in academia, simply don't have access to these computational resources and data, and this is hampering innovation. One example: The Transportation Department has access to a set of data gathered from vehicle sensors about how people drive, said Erwin Gianchandani, senior adviser at the National Science Foundation and co-chairman of the new AI task force. "Because you have very sensitive data about individuals, there are challenges in being able to make that data available to the broader research community," he said. On the other hand, if researchers could get access, they could develop innovations designed to make driving safer. Census data, medical records, and other data sets could also potentially be made available for research by both private companies and academic institutions, officials said. They said the task force will evaluate how to make such data available while protecting Americans' privacy and addressing other ethical concerns.
Really? (Score:2)
Re: Really? (Score:1)
an Artificial Intelligence
"An?" This is not Neuromancer; "AI" is a commodity - like stupidity.
Re: (Score:2)
Good datasets are hard to find (Score:5, Interesting)
Shouldn't they have to prove they've developed an Artificial Intelligence? I would be a little less excited about AI doing the job than a really good algorithm. I can't trust AI, but if an open algorithm is used (and applied 'as is'), I'd be more comfortable letting it process government data. At least I can depend on people being corrupt. Who knows about AI...
I do a lot of AI research, and finding good training data is really hard.
Here's a suggestion for anyone who wants to enter the field: pick a problem that is difficult for computers but that humans find easy, then try to find a dataset to train on, then think through the steps needed to solve the problem.
Text can sometimes be difficult. I've been on Kaggle and seen text challenges where the data contains snips of HTML tags (not the entire tag, just snips here and there), Project Gutenberg has typos in text and encoding, there's no easy way to distinguish between narrative text and other forms (dictionary, poetry, or inventory lists) which are not narrative, online text from reviews or posts has a ton of abbreviations, mistypos, leet-speek, and has a tone and tenor that isn't representative of normal speech.
The simplest image sources are probably the zipcode digit recognizing images, which are scanned and already hand-labelled so we know what the correct answer is... which is fine, except that for actual V1-style recognition you need greyscale and not binary (1=white, 0=black) images. You can get greyscale versions, but these are greyscale interpolations of the binary original scans! The data is shot through with quantization noise.
An acquaintance was kind enough to send me a set of high-res topo images (example [duckduckgo.com]) from mars. The images are X-Y-Z, black-and-white with Z being the ground-level altitude. Craters are obvious in profile, the base of a crater is lower than the surrounding land (and mostly flat). (Mostly - some nuance applies.)
Craters are circles, and a human has no problem identifying the location and size. Craters can overlap, and a human has no problem telling which crater came first, and whether it's old or young depending on the weathering of the edge.
I feel lucky to have the Mars image data to play with - having a conceptually simple problem with really good data helps eliminate a lot of proposed algorithms for how AI really works.
But finding good data is surprisingly hard.
(I'm well aware of the myriad AI data corpora online. Many have defects in some form or another, as listed.)
(Apropos of nothing: I scraped slashdot for all comments score 3+, thinking it would have typos and usage representative of quality typing. It mostly is, but the narrative thread doesn't match well. Most comments are responses to other comments, and have missing conceptual bits that need the previous comment for context, and the tree-structure allows multiple responses missing the context and son on.)
Re: (Score:2)
Re: Good datasets are hard to find (Score:1)
Re: (Score:3)
Re: (Score:3)
No Benifit, Only Risk (Score:2)
Bad Smell (Score:2)
Good for watchlist service providers (Score:2)
Sounds like a project desgined (Score:1)
Re: (Score:2)
Sounds like a project desgined to provide a conduit to leak government info to marketers so it can be sold legally.
Also sounds like a project designed to disseminate database information about the population and develop tools for analyzing it - creating a surveillance state.
Something like the rules change Obama made on his way out, expanding access to raw surveillance from "NSA only and anonymizes it before feeding it to other agencies" to "NSA gives raw feeds to 16 other agencies". (This was alleged to be
Re: (Score:2)