For the past 5 years, I’ve haunted the halls of the U.S. Congress with a geeky ask:

broadcast-quality video from all congressional hearings should be posted on the Internet.

I gave a tech talk at Google, drew up

business plans (pdf) to start a new nonprofit,

enlisted the help (pdf)

of the Public Printer, and harassed my friends in the

mainstream

media and my friends working for the former Speaker (pdf).

My motivation has been a deeply felt belief that one should not have to live inside the Washington, D.C.

beltway in order to observe the proceedings of the U.S. Congress. No matter what our political beliefs,

no matter how much we disagree on the issues, we must all agree that the business of the Congress is the

business of the People. Today, that means that business must be conducted so that it is visible on

the Internet.

Today, we are announcing a new site, House.Resource.Org.

This site contains today over 500 hearings we obtained from C-SPAN from the proceedings of the House

Committee on Oversight and Government Reform. Under an agreement reached with Chairman Darrell Issa

and Speaker of the House John A. Boehner, we are now in receipt of several

hundred more high-resolution files from 2009 and 2010 hearings that will be loaded on the site. In addition,

the Committee has agreed to furnish us with high-resolution files from all hearings in 2011, which we

will be posting on a weekly basis. Note that this is not a real-time service, we are posting big files

after-the-fact.

A letter received today from Chairman Darrell Issa and Speaker of the House John A. Boehner

states that it is their hope “that this project is only the beginning of an effort to eventually

bring all congressional committee video online.”

On a technical note, house.resource.org serves the files as HTTP, RSYNC, and FTP. We’ve also put

in place many of the official GPO transcripts as signed PDF and as raw text. If you’d like to view

the files, you’ll be able to do so on YouTube,

the Internet Archive,

and on C-SPAN. We also expect other organizations

to make use of this material. The C-SPAN video is licensed for non-commercial attribution use and the

material from the Congress is in the public domain.

We have two hacks that I think are fairly significant. First, copies of this data are being furnished

on disk drives to the Office of House Preservation and the National Archives and Records Administration,

officials of both organizations happily accepting this addition to our nation’s permanent record. It

is our hope that archivists, librarians, and historians will make good use of this material.

The second hack is something we are doing that leverages some amazing work being done by the

YouTube engineering team. In many cases, we’ve been

able to take the video of a hearing and mash it up with the official GPO transcript. Look at this

embedded video of a hearing about the AIG Collapse and Federal Rescue:

This video took the text version of the official transcript and we hacked it up by hand to

contain a version of the transcript without timecodes.

We cut out any embedded prepared statements, turned the name of the speakers from raw text into a more

cc-friendly [Speaker Name], typed in any commentary at the beginning from C-SPAN, and then the

whole thing was fed into YouTube’s magic transcription engine. What popped back out was

timecode-aligned closed captions based on the official transcript

suitable for use in your own accessibility applications, to use as a search tool into the video, or

as the basis for translation into other languages.

There are a few limits on this magical service as it is in early beta. We don’t have official

GPO transcripts yet for all the hearings, and the timecode-alignment engine is still limited to

videos that are 90 minutes or less. But, there is great hope in this technology to provide accessible

video not only to the workings of Congress but to the workings of any deliberative body that uses

official transcripts, such as courts, city councils, and state legislatures. There is an added bonus,

which is that having such a large trove of verified transcripts that we can align with video means

that this text can be used to train the machine-transcription engine to be more accurate by comparing

what the software recognized with what was actually said.

If you would like to help on the process of prepping transcriptions, please contact me. (Hint: my

email address is on the Public.Resource.Org about page, or contact me on

Twitter where I trade as @carlmalamud.) In terms of

timing, we should have the backfile fully loaded by the end of January. We’re expecting our first

shipment of current hearings by mid-month, and this service should be fully operational by end of

February. Right now, we’re just doing the House Oversight Committee, but we have the capacity

to do one or two more committees, so the service may expand quickly.