Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Government Data Storage

Carl Malamud Answers: Goading the Government To Make Public Data Public 21

You asked Carl Malamud about his experiences and hopes in the gargantuan project he's undertaken to prod the U.S. government into scanning archived documents, and to make public access (rather than availability only through special dispensation) the default for newly created, timely government data. (Malamud points out that if you have comments on what the government should be focusing on preserving, and how they should go about it, the National Archives would like to read them.) Below find answers with a mix of heartening and disheartening information about how the vast project is progressing.


LoC?
by an Anonymous Reader

So how many GB/TB is a Library of Congress? :)

Or, more seriously, how big are you estimating? Are you using raw scans or some sort of compression (JPG, PNG, etc)? What resolution are you using? Do you vary the resolution depending on the document?

What sort of meta data are you putting in?


CM: The reason John Podesta and I suggested a Federal Scanning Commission in our letter at YesWeScan.Org is we really don't know how big the holdings of the government are. I can tell you that the Library of Congress is about 32 million cataloged books (a significant increase from the 6,487 books Thomas Jefferson donated to get them started). But, this is about more than books, it is about paper records, microfilmed technical papers, video, audio, photographs, and much more.

The scale is fairly vast. The Smithsonian has 137 million objects, including about 13 million images. David Ferriero, the Archivist of the United States estimates he has over 10 billion pages of text documents, 7.2 million maps, and 40 million photographs including everything from past census records to presidential dinner menus, and that includes about 7.5 million motion pictures and sound recordings. The Government Printing Office distributes their documents to the Federal Depository Library Program, and that includes over 60 million pages of collections including the Official Journals of Government such as the Federal Register. That's just scratching the surface, and we recommended a Federal Scanning Commission to begin the process of understanding what we have (and what is worth digitizing).

As to standards? There are lots of pretty good standards on how to digitize. NARA, Library of Congress, GPO all spec out document scans at 400 dpi, for example. For photographs, moving images, and other objects, there are some pretty good and pretty detailed standards at www.digitizationguidelines.gov. I know Brewster Kahle's operation and my own tend to work off those specifications (in fact Brewster does quite a bit of scanning for the government).

As to compression? Well, I've found people tend to overcompress things. That said, sometimes the initial quality isn't that great, so a 600 dpi uncompressed scan would be silly in some cases. But, for photographs I try very hard to keep the TIFF images around and not rely on JPEG. Likewise, for audio it is really nice to keep a nice 48 khz version of your file around if you can simply because if you screw up the compression maybe somebody else can do a better job in a few years. Disk space is relatively cheap, so that isn't the barrier it used to be. For video, I rip MPEG2 at whatever it is on a DVD, when I'm actually digitizing I try to get the video bitrate up to 8-10 mbps when ripping a Betacam or Umatic. Some people think that is overkill, but I'd rather be safe than sorry.

Metadata? Well, you got to have it or you're not going to get very far when it comes to access. Many librarians have made perfect the enemy of the good when it comes to metadata and have resisted any attempt at digitization because we don't have the very best metadata we might have. I'm more in the camp of scan what you have and get as much of the metadata as you can into it. For example, we have 3,200 1000-page volumes of briefs from the 9th Circuit of the U.S. Court of Appeals. We didn't have good metadata, but we had the Internet Archive scan them anyway. Then, after we got our PDF files, I shipped those off to a double-key team in India and they broke the briefs up into individual documents and typed the metadata into a spreadsheet for me, which we hope to release soon.

My point is that sometimes you can shoehorn the metadata in after the fact or you can use a variety of techniques to pull the metadata out of the documents (e.g., smart OCR). In theory, you can use crowdsourcing to get the metadata, but so far I've not had a lot of luck persuading thousands of people to spend their time doing that kind of work. A captcha is a quick thing to do and is between you and something you want, whereas entering metadata in for videos or documents is one of those civic duty things that everybody thinks everybody else should be doing.

Total size? Brewster says a book is about 400 Mbytes (though he's very quick to point out that you could put the words in all the books in the library into a terabyte and if you're distributing PDFs, you can easily throw 130,000 full-color, searchable PDFs onto a 4 TB drive). But, you were probably asking about raw data. Here's some raw numbers:

32 million books at 400 Mbytes each is 12.8 petabytes 50 million photos at 150 Mbytes each is 7.5 petabytes 10 billion pieces of paper ("records") at 100 Kbytes each is 1 petabyte 20 years of video at 8 mbps is only 630 Tbytes.

(Somebody check my math?)

If you're talking a decade-long federal digitization initiative, we're looking at well south of 50 petabytes, which seems pretty doable in this day and age!



Can the rare books collections be digitized?
by autophile

Three closely related questions about the rare books collections at the Library of Congress:

1. I know there is some kind of effort going on to digitize the rare books collections, but can it be sped up? There are many high-quality low-cost archival book scanners out there (such as the ones developed at diybookscanner.org).

2. It gets really annoying to have to receive paper copies of books when copies are requested. Why not DVDs of high-quality images?

3. Why is there no outreach by the LoC to smaller, cheaper book scanning efforts? The Internet Archive, DIYBookscanner.org, and Decapod all come to mind.


CM: In reverse order. I don't know why we aren't distributing and decentralizing our scanning efforts. The Internet Archive is a heavy-duty production shop and they do an amazing job, as do folks like Google Books and the folks digitizing things the Mormon Church. But, there are a bunch of DIY solutions and it would be really nice if we could get more people pitching in. The biggest problem on distributing the digitization efforts is quality control. I know when it comes to ripping video, I can easily teach other people how to grab an MPEG2 off a DVD, but when it comes to things like digitizing a Betacam, that takes some training. But, we're all trainable and I wish we could all do more.

Getting back paper copies of books and papers when they're doing a copy anyway is just plain dumb. Likewise with things like FOIA results. John Podesta testified before the Senate about FOIA and said if an agency answers a FOIA request, they should also post their result online so others can see it. That seems pretty obvious.

As far as digitizing rare book collections, there are some amazing pockets throughout the government but there is no real coordination and there certainly is no effort to scan at scale or to come up with a realistic national digitization strategy. That is why we called on the White House to lead the effort. Within the Library of Congress there are some amazing collections, but if you look around to places like the National Agricultural Library or the National Library of Medicine or the libraries in the service academies you'll find lots more. Some have argued that digitizing rare books is silly because the audience is just a few academics, but I can tell you from my own experience helping host the network site for the Archimedes Palimpsest that when you make this kind of information available, there is an amazing long tail.

If you scan it, they will come. And, to answer your question, if we all scan it, they will come much sooner.



Real time legislation drafting
by kerskine

Would it be possible to implement a system that would allow real-time and continuous review of legislation while it's being drafted? Much has been made over the past three years about legislation being available for review before voting by the House or Senate. The final draft for review usually is huge PDF that makes it near impossible for citizens, interest groups, and the media to thoroughly analysis in time.

CM: You want to see the sausage being made not just buy the hot dog! I'll comment on the U.S. Congress since that's the system I know best. Thomas is a pretty good system if you happen to be stuck in 1994. It does have all the amendments and the actions and the various stages that legislation go through. But, it isn't real time, more like "pretty quick." As Van Jacobson once quipped, "Same day service in a nanosecond world." And, Thomas isn't really machine processable, it is final form, usually formatted ASCII text (shades of NROFF!). People like Josh Tauberer who built GovTrack.US have spent considerable time crawling those systems and trying to get the data into regularized formats and make it available to others to reuse via APIs, but that isn't the same as exposing the inner working of the sausage factory.

Majority Leader Cantor's staff has been pushing a system to make the raw data all available in XML from the Clerk's office and I think that is a very promising initiative which hopefully will bear fruit. (They're having a February 2 conference to discuss their plans if you are interested. I have no idea if it will be streamed for those of who aren't Inside the Beltway and I don't know their schedule for moving past conferences and into production.)

Congress is a pretty complicated beast. I know some folks like Sean McGrath have had better luck with some of the state legislatures. The problem is you need to dig deep into the inner working of a legislature. In the Congress, that means you're changing things like authoring tools that are used in the Clerk's office and by all the staff members, so you have to be careful or you get a bunch of really angry Congressman yelling at you because their staff can't crank out the flavor-of-the-week in the form of a bill or amendment.

There's also a bit of an issue of will. My work with the Congress to put hearings on-line showed that you could take the official transcripts of a hearing and use those to generate closed captions on the video. All you need is the official transcript of the hearing, but in order to get those I had to execute a special Memorandum of Understanding with the House Oversight Committee. Other committees guard their transcripts jealously and won't let them out for several when. When I started processing a bunch of historical videos we purchased from C-SPAN, I went to the Government Printing Office and found that many committees never deliver their transcripts, even a decade after the fact!



How to keep track of legislative activity about open access?
by oneiros27

Recently in the federal register, there were two calls for comments about access to data and research from federally funded research:

http://federalregister.gov/a/2011-28623 [federalregister.gov] http://federalregister.gov/a/2011-28621 [federalregister.gov]

I didn't hear about these until ~4 weeks after the original announcement, and with the holidays, it was too late to try to get the societies I'm involved with to prepare and vote on official statements. Are there any places where people can get/post notices of these sorts of things so that we can stay informed and try to help influence policies?


CM: The Federal Register is getting a lot better now that it is a much more open system. The idea of "Federal Register 2.0" was a paper I wrote for the Obama transition, so it is an issue I've tracked pretty closely and frankly, I've been amazed at how much better it is now. What they did is instead of selling the raw data feed for the Federal Register for $17,000/year, they went from SGML to XML and then released the data in bulk for free. A few guys out in San Francisco were looking for something to do to enter a contest and they took that bulk data and dreamed up GovPulse.US. That was such a better version of the Federal Register that the Office of the Federal Register switched the official site over to their open source platform. My point is the tools are there to do better notification mechanisms, and I'm sure the government would welcome somebody grabbing the GovPulse.US code out of Github and making it even better.

That's the technical answer. But, the substantive answer is that there is a huge boatload of stuff in the Federal Register and it is pretty hard to figure out what to pay attention to. I also missed that particular call for comment, and I've even missed several Requests for Information coming out of places I try and pay attention to, like the White House's Office of Science and Technology Policy. And, I do this stuff full-time! Perhaps better targeted notification mechanisms are the answer. Maybe it is a social media solution, where you pay attention to things your friends are paying attention to. I hope the answer is not that the only way to pay attention is to be employed with a beltway bandit which can afford hundreds of minions that do nothing but pay attention to Washington. Indeed, there are some very fancy for-pay services from folks like Congressional Quarterly and Bloomberg that cost an arm and a leg, but I can't help but think there has to be a better way that is also open.



What do you think of corporate partnerships?
by mhh5

I'd like to know what you think about corporate partnerships in the process to get public data released. (I'm not sure if Google Patents existed before the USPTO released its databases.) Do corporations that get involved in the process tend to make the process better without question, or are there tradeoffs in some areas because the corporations always want to help but then try to retain a proprietary version of the data for themselves?

CM: The theory is that the government gets some kind of valuable service (like digitization) that the government wouldn't get otherwise so it is a "win-win." But, the reality is all too often the government gets snookered and what we do is give some corporation exclusive access to some pot of data and the government doesn't get much of anything. The deal between Amazon and the National Archives was a good example of that kind of a private fence around the public domain. With a help from Boing Boing, I started systematically purchasing those public domain videos and re-releasing them in the wild. I have no problem with Amazon selling public domain video, I just hate it when they get a de facto or a contractual exclusive. (My testimony before Congress on this subject is here.)

There are lots of other examples of government getting snookered. For example, the Government Accountability Office let Thomson West get access to 60 million or so pages of federal legislative histories. At great cost to the government, they were all packed up and dispatched to West which digitized them all and then sent them back to the government. West now sells access to his amazing database. What did the government get for it's trouble? A few logins for GAO staffers. Even members of Congress need to pay to access the database! (We have an interesting paper trail on this issue.)

I'm glad you brought up the Google Patent system because I was personally involved in making that happen and I can tell you that this one is totally legit. Jon Orwant is the lead developer on this for Google and I played a small part in helping convince the White House and the Patent Office they ought to give Jon access to their data (the heavy lifting on that deal was by Beth Noveck who was the Deputy CTO at the time). Google makes all the data they got from the Patent Office available for bulk access with no strings attached. I can vouch for that because I did a mirror of their system. Last I heard Google was sending out anywhere from 1 to 10 terabytes of data PER DAY to external sources and even normally very critical folks who work in this arena have been really happy.

The big problem in the Patent Office is their computing infrastructure is a real catastrophe. Their power plant is over 95% capacity (e.g., plug in a computer, bring the building down!) and even though the Under Secretary knew that selling DVD subscriptions was silly, he wasn't able to switch over to an FTP service. He cut the deal with Google Patent and it worked out well for the government, for Google, and for everybody else.

What's the difference between the Google deal and the Amazon deal? In the case of the Amazon and GAO/West deals, the government lawyers did all the negotiating and they were totally outsmarted by some sharks in industry. But, when government has people like Under Secretary Kappos and Beth Noveck doing the negotiating, these things can work out just fine. The key is government should partner with people who want to do public service, not people who want to service the public.

Encouraging Governments?
by theNAM666

In a city such as Nashville, things as basic as business ownership and property records are not available online. In states such as New Jersey, public records such as basic corporate filings (officers, operating address/address for service of process) are accessible only for a fee.

What concrete actions can citizens confronting such situations, take to encourage accessibility and accountability?


CM: I find you need a carrot and a stick to make this stuff happen, especially at the local level. Folks like Everyblock.Com and CodeForAmerica.Org have done great working prying some of these databases loose, but there is still lots to do.

The first thing you should do is pick up the phone (or pick up your email client) and write/call the people who run the system. Ask them if you can have access to the data. Sometimes, it is as simple as that.

Other times, though, it isn't quite as simple since they want the money (or they want the control or they think this should be done by "private industry" by which they mean some buddy who is a contractor). The nice thing about any government system is somebody usually has oversight responsibilities. So, the next step is to find a city council member of state legislator who has oversight on the agency in question and ask them.

Again, life isn't usually that simple, but sometimes you win! If you can't get anywhere that way, what I usually end up doing is basically competing with the government system. Build a proxy system like RECAPtheLaw.Org did to recycle paid documents. Or, get a sponsor and buy a reasonable number of docs and build a web site that looks like it is going to be a real production system.

Then, go back again and ask. Maybe if you have eyeballs or at least have a nice web site, that is enough to get the government moving. But, if that doesn't do the job, you may have no choice but to compete with them for real, which of course requires a big commitment in time and energy and not everybody can do that. I know in the case of the Patent Office, I started pestering them in 1993, including several times when I spent 6-figure sums purchasing their data, and it still took until 2011 to crack that nut.

The real trick is focus/obsession. Pick one thing you really care about and just keep pestering them until you crack it open. If you're surfing from one opengov problem to another, showing up for a 1-day hackathon then moving on to something else, you're not going to get anywhere. Pick something real and make it your thing.



Privately Owned, Copyrighted Law
by AdamnSelene

I think I have read that the law itself cannot be copyrighted and it should be possible to make it available available to everyone. But as a techie who drafts standards and specifications, I was wondering about how far this goes--especially since Congress recently proposed enacting some of our standards into law. (They decided not to, but they read some parts into the committee records as they debated.) Can you still accomplish your project if a governmental body adopts (or considers adopting) a privately owned, copyrighted technical reference manual or set of safety standards as administrative law (or regulations that carry the force of law)? Or would such obstacles keep you from being able to digitize all of the government's laws (and archives of proposed laws)?

CM: The idea that the law has no copyright is a fundamental part of the American system of government. That applies to states and municipalities as well. The basic decision is Wheaton v. Peters from 1834 but that decision has been reaffirmed over and over. The law is sacred in the American system. You can't have equal protection under the law or due process under the law if there is a poll tax on access to justice.

When we get to a privately developed standards however, it turns into a very interesting issue. The basic mechanism is called Incorporation by Reference. The government will take some external document (such as a model building code) and incorporate the entire text to make it the law of the land. A guy named Peter Veeck was responsible for a landmark decision in 2002 when he published the Texas Building Code which was an incorporation of a privately-developed and very expensive model code. The court ruled that while the model code had copyright, the law of the land did not.

Based on the Veeck decision, my group went and posted many of the public safety codes enacted by the states. We started by purchasing model codes, finding the incorporating legislation, and concatenating the two pieces together and posting the resulting PDFs. More recently, we've done some extensive reworking of the California public safety codes, known as Title 24, converting the entire text into valid XHTML, recoding the graphics as SVG graphics, the formulas as MathML, and regenerating the PDF documents as nicely typeset documents instead of low-quality scans. You can see this work on the web but it is also available as Google Code project.

The federal government also uses this mechanism intensively, with over 2,000 standards incorporated into the Code of Federal Regulations. This is non-trivial stuff, things like all the OSHA safety regulations. The issue was recently considered by a federal group called the Administrative Conference of the U.S. which basically rolled over and endorsed the idea that it is ok for important parts of the law to cost money. (Read EFF's protest letter if you want a good critique of what they did.)

I'm not necessarily saying that government should be able to appropriate any privately-developed standard and make it available. And, I'm not necessarily saying you want OSHA bureaucrats drafting the standards. But, I do think the big standards establishment and the government regulators have cut a deal that results in the law not being available and the costs forked off on private citizens and small business with extortionate monopoly prices. I just paid $847 for a 48-page safety standard from Underwriters Labs and $60 for 2-page safety standard from the Society of Automotive Engineers, both of which are mandated by law in the CFR. They do need money to run their operations, but let me just point out that in 2009 the 501(c)(3) nonprofit Underwriters Labs paid their CEO $2,138,984 and the nonprofit SAE paid their CEO $412,578.

Ancestry.com
by An Anonymous Reader

What is your opinion about websites like Ancestry.com which make use of public records and charge a subscription fee for access? What is the incentive for the government to migrate old documents into digital form when services like these exist? Do you think Ancestry.com should be a 501(c)(3)?

CM: I'm not a big fan of for-profit corporations that have a business model of monetizing the public domain. I'm fine if they exist and fine if they make billions of dollars, but if they are the only game in town they've taken something that belongs to all of us and and turned it into their private property.

The government got snookered on the Ancestry.Com deal. They could have insisted that the raw data be available in bulk for anybody else to use. The folks that approach the government to cut these sweetheart deals argue that is unreasonable because they need a "return on investment" and the argue that if they don't get the return on investment they won't do the deal (and by extension nobody else will do the deal).

But, government can argue much harder! For example, instead of negotiating some exclusive thing with Ancestry.Com, how come they didn't ask the Internet Archive to grab the data? Or put together something creative with a couple of foundations that would pay for the digitization in return for the kind of payback the foundations like to see (e.g., good press, photo opportunity with the President, or other tools of the trade)?

You asked if Ancestry.Com should be a 501(c)(3)? Not all nonprofits do something that I think which should be an essential part of their mission, which is allow others to compete with them. I believe providing open access to all data ought to be a precondition to getting nonprofit status (an idea that Gil Elbaz has been pushing for quite some time). A good example of a nonprofit that builds walls is Guidestar which wants to be the place where you go for all your nonprofit information. The IRS should be making all Form 990 returns of nonprofits available in bulk for anybody to use, which would knock the bottom out of Guidestar's attempts to build walls and force them to stay innovative and provide value.

Pacer Problems
by onyxruby

How much difficulty do you anticipate in getting and publishing records in Pacer? If there's one system that should be free it the decisions that our courts make and yet you are charged by the page just to view the results. Are you concerned about a court taking an unkind view on your archiving what is in Pacer?

CM: PACER is an abomination. Do they take a dim view of our efforts? Well, the Administrative Office of the U.S. Courts reacted so strongly to our efforts to make their data available that they called the FBI on Aaron Swartz and cancelled the only meaningful public access system they had, which consisted of one terminal in each of 17 public libraries around the country. In this era of rapidly decreasing costs, they just boosted their access charges from 8 cents a page to 10 cents a page, arguing that this is a bargain compared to 25 cents a page for a copy machine.

What I find so disturbing about PACER is that when we did get 20 million pages of docs, we were able to conduct a comprehensive analysis of privacy violations in the courts, an analysis that led to a nice thank-you letter from the Judicial Conference and changes in their privacy rules. In other words, only when public interest groups got access to the data did we begin to address privacy issues. Public access is not just about pro se prisoners defending themselves from a jail cell, which is the view of many in the Administrative Office of the Courts. Public access is about attempts like ours (and many other folks) to make our system of justice function better. When we say we are "an empire of laws not a nation of men" that means we write down what we are doing in our courts so that it is no longer the arbitrary decisions of individuals. The paper trail is there so we can make sure the system is functioning properly. When you limit that access to those that only have a Gold Card, you pervert democracy and you pervert justice.

This principle that access to justice shouldn't hide behind a cash register goes back to the Greeks. Theseus in Euripedes' Suppliants said "when there are no public laws, one man holds power by keeping the law all for himself, and there is no more equality. But when the laws are written, the weak man and the rich man have equal justice." The PACER system is justice for the rich man.

Steve Schultze and the team at Princeton did a lot of the heavy lifting on this issue, including the very nice RECAPtheLaw.Org system they built. They've also done a lot of financial analysis that shows that the courts are not only recovering their costs for operating the expensive PACER system, they're making a huge profit (to the tune of $100 million/year) and using their excess profits to do things like buy big-screen TVs in direct violation of the E-Government act.

The basic problem on PACER is the Judicial Conference has delegated the issue to a few techie judges who think what they've built is something great. But, PACER is a hairball of bad PERL code and the result has not served the judges, the bar, or the American people very well. My only hope is that eventually, the Judicial Conference will see that their information technology is 30 years behind the rest of the Internet and feel ashamed at the travesty they have wrought. Until then, we have RECAP.

If you're interested in the issue, a couple of resources to look at are the PACER paper trail and a bit of a rant that I delivered at the Gov 2.0 summit.



How to visualize opened data?
by hardwarejunkie9

The amount of information you're trying to free is entirely staggering and consists, largely, of tables of numbers. These numbers are incredibly significant, but people generally can't see them.

After you free all of this information and make it available to the public (as it should be), then what? What do you expect for the public to do with these numbers? Tables of information are not nearly as useful as graphs. This data needs to be seen, but, more importantly, it needs to be understood.

Do you have any ideas for how to disseminate this information? Perhaps a team-up with someone like gapminder.org's Hans Rosling might be particularly valuable for all of us.


CM: Actually, most of the data I'm looking at is not tables of numbers, it is video, images, textual documents, technical papers, maps, and books.

But, I definitely get what you're saying and there are a lot of numbers. For example, the IRS Form 990s should be structured data instead of PDF documents, so extracting the data from the mass of paper is the initial challenge. There are lots of other examples of this kind of initial extraction, getting what were printed paper docs into structured data. There are some interesting tools, such as OCRopus which does layout analysis, but there needs to be much more. One of the reason we called for a Federal Scanning Commission is that we think there is a lot of directed R&D that could not only scale up mass digitization but could also work on the important value-added of extraction of structured data and handling some of the tricky issues like detecting the presence of Social Security Numbers.

Once you have the data, as you say, then what? I'm a big fan of the idea that the government starts by providing bulk data, then they provide an API, and then maybe they also build web sites and apps and other things along with everybody else out there. That's a 3-part hierarchy that Ed Felten and some of his students developed and it should be a law that applies to all government information systems that are externally facing.

The issue here is that all too often people look at a problem like "digitize all government information" and they want to see the whole stack of the solution from one place. But, I think you can do a layered approach and count on the fact that there is always somebody smarter out there and our job is to reduce the barriers to entry. So, how would I visualize the data? I have no idea, but I'd make damned sure that folks like Martin Wattenberg at Many Eyes and Hans Rosling at Gapminder knew the data was out there and then I'd sit back and be amazed at whatever they come up with. How's that for pushing the problem downstream?



Why is data access so hard?
by CanHasDIY

Can you provide any explanation as to why it is so difficult and cost-prohibitive to obtain records from the government, especially considering the abundance of laws requiring government compliance with requests for information (AKA "Sunshine Laws")?

Is it simply a matter of government employee ineptitude, or have you found evidence of a more nefarious rationale?


CM: I get that question a lot. Why would a member of Congress take deliberate steps to stop public hearings from being available? Why would a court administrator deliberately restrict access to public court documents? Usually the answer is, as Heinlein said, "you have attributed conditions to villainy that simply result from stupidity." When I'm explaining why something is so broken on a big government system, my usual answer is that there are a lot of people still stuck in the 1970s and 1980s, when information dissemination was really, really hard and it took men in white lab coats and computers the size of freight trains to process data. In other words, the problem with a lot of folks who are government gatekeepers is they just don't get the Internet and they don't get computers. In fact, usually when some senior bureaucrat is throwing stones at me, you can find younger staffers working for them rolling their eyes.

That's an optimistic view, and if I'm right things will get better. But, I'm often wrong on my predictions of the future. (I was the guy who saw TimBL demo the web in 1992 and thought to myself "interesting, but it won't scale.")

But, there is also some more nefarious stuff happening, often the accumulation of power by being able to cut exclusive deals with contractor buddies. If your life in government consists of receiving emissaries from Lockheed Martin, maybe you think you're making everybody happy by letting them build you a $1 billion computer system. Often, you think your problems are so unique that the $1 billion solution is the only answer.

And, in some cases, as we've seen from numerous GAO reports, Inspector General reports, Congressional hearings, and newspaper articles, there are some really evil people out there who think the public domain and the government is their personal business opportunity. Looting the federal government is the kind of civic crime that ranks right up there in my book with stealing cookies from Girl Scouts and selling fake medicines to sick people.



Who is the worst?
by TheBrez

Which government agency is the worst to get information from?

CM: I don't know who the worst are (there's a lot of competition for that slot), but the ones that piss me off the most are the ones that should know better.

Public.Resource.Org is a really small operation. I'm the only staff member. My part-time sysadmin is @mdkail who is pretty busy with his day job as CIO at NetFlix. My ISP is Jim Martin and his team at ISC who are kind of busy running the F-Root. My office net is supported by the amazing systems team at O'Reilly which rents me office space at below-market rates.

I'll grant you government would have a tough time getting that kind of help. But, I'm a one-man shop and we run the 4th most popular U.S. government video channel on YouTube, we're the source for a lot of the on-line presence of the U.S. Court of Appeals, and we've supported efforts for the U.S. Congress, the White House, and the National Archives. If we can do this out of Northern California, couldn't the vast resources of the federal government in Washington, D.C. do a whole lot better than they're doing now?

For me, my current bete noir is the U.S. Congress. We got half-way through processing their archives of video from congressional hearings, publishing about 31 terabytes of data. Then, a couple of staffers decided this was a bad idea and pulled the rug out from under us. They actually decided it was a bad idea to publish video from public congressional hearings.

Like any agency, Congress is a mixed bag. We had tons of support from Darrell Issa, for example, and ran a very successful pilot project for him for a year. We talked to all sorts of people on committees and in the various agencies that support the Congress. But, at the end of the day, a couple of staff members were able to decide that the public archive shouldn't be public and they terminated our project. (If you have some time, you might like to read our rather surreal paper trail.)

So, rather than the worst, I think we need to look for the most shameful, the ones that have the privilege and the power and could easily do better. I know it is in vogue to throw stones at government in general and Washington in particular, but there are times when government can be so useful and so awe inspiring it takes your breath away. Government can be that shining city on the hill but we all have to take an active part in our government to keep those lights shining bright.
This discussion has been archived. No new comments can be posted.

Carl Malamud Answers: Goading the Government To Make Public Data Public

Comments Filter:
  • And all this ma-mouth of data will be used by AI's systems like IBM's Watson. Imagine the power someone will have with all this information and the ability to use it all at once!
  • by CanHasDIY ( 1672858 ) on Monday January 23, 2012 @03:02PM (#38796255) Homepage Journal
    Pretty much what I expected, but it's nice to hear it coming from someone who actually deals with government record keeping on a regular basis.

    The Heinlein shout out was noted and appreciated, as well.
  • Tracking Regulations (Score:5, Informative)

    by PeregrinatorSF ( 2559277 ) on Monday January 23, 2012 @03:15PM (#38796437)

    Awesome write up Carl!

    I'd like to expand a little on Carl's answer to, "How to keep track of legislative activity about open access? by oneiros27". FederalRegister.gov has exposed the ability to subscribe to any search via email or RSS. While this still suffers from the problem of what to search for, even a naive search can really help to filter the Federal Register content into some thing more digestible. And having it show up in my inbox definitely gets it in front of me if only briefly.

    Additionally the FederalRegister.gov [federalregister.gov] section pages introduced the idea of 'Suggested Searches' not that long ago. They are searches that are centered around a particular topic or sub-topic and also allow a user to further refine them. For example, the "Science & Technology" [federalregister.gov] section has 10 suggested searches among which are "Broadband Policy" [federalregister.gov] and "Patent, Trademark, and Copyright" [federalregister.gov] which you can subscribe to.

    And personally I have a search on Twitter (for "Federal Register" OR "federalregister.gov") which falls inline with Carl's suggestion of a social media solution. This at least filters to top articles that are garnering attention (and as it happens, the mentioned articles around access to data and research from federally funded research showed up in).

    As a developer on FederalRegister.gov I can say we are always open to suggestions and ideas that would allow the FR content to be more easily consumed!

    Bob

    • Well, as you're open to suggestions -- have you considered trying to do synonym/equivalence/broader expansion? And the ability to turn off the stemmed search?

      I think part of the problem is that what I call something might not what it's called when it's listed in the Federal Register. My work revolves around 'physics data' ... but if you search for that term, you get lots of false positives on 'physical data'... but of course, the broader terms 'science data' and 'research data' also apply. It's almost lik

  • (If you have some time, you might like to read our rather surreal paper trail [resource.org].)

    Forbidden

    I don't have permission to access /hbs/pub/ on this server.

    • Whoops. we had a disk issue, were running off a backup. we're back to the main system, should be working now. Sorry about that!

      • I'm not sure which I'm more impressed by:

        • The thorough responses
        • The speed at which you replied to the problem report
        • That your UID is only 4 off from the combination on my luggage.
  • The US government doesn't always do the right thing, doesn't always consider the long term effects, doesn't always have arms length transactions which happen with Solyndras and don't always treat people fairly or disclose information which would refute lawmakers claims trying to justify their programs.

    Disclosure makes 100% sense in a democracy.

    • "You can always trust America to do the right thing - after they have exhausted all other possibilities." -- Winston Churchill
  • Perhaps I am missing something - but there appears to be only a single document that mentions any sort of trouble (https://house.resource.org/hbs/pub/gov.house.20111214.pdf) Are there other documents that illustrate the problem more completely?

    Thank you for taking the time to answer all these questions. This was a fascinating read.

    • Re: (Score:3, Interesting)

      by Carl Malamud ( 12349 )

      There were a few issues that we never got resolved. You'll notice much of the communications is one-way from me to them and while there are a few answers, but some things just get ignored. The big one was an offer we made to do 7 mbps multicast of all 24 channels of House video onto the Internet2 backbone so everybody could get into the archiving business. They decided they'd rather go with a unicast ("webcast") solution provided by the Library of Congress at lower resolutions with no access by the public t

  • I totally expected more comments than this, shameful slashdot!

    This was a great read; very fascinating stuff. Thank you for taking the time to answer that many questions in depth. And you've opened up a whole new area of interest for me as I love digitization and archiving and open government and was unaware of these efforts.

    One last thing, anything a normal citizen can do to help?
  • wow (Score:2, Insightful)

    by Anonymous Coward

    This is the best interview on Slashdot I've seen in a while.

  • Too long, but I still read it!

    Just a shout out to thank Carl Malamud for very detailed answers. Thanks also /. Timothy for setting up the interview.

    As this article has very few comments, which I assume is because the answers were so well defined (:, it'd be great to see the number of page views.

We are Microsoft. Unix is irrelevant. Openness is futile. Prepare to be assimilated.

Working...