So I went down to DC this weekend to participate in the Open Data Day Hackathon! There were some tremendous projects proposed, but the one that caught my eye from the start of the day was one proposed by Jim Harper of the Cato Institute to track down the genealogy of legislation put forth in congress.
Basically, the goal is to programatically find similar passages in multiple bills. This can be used for many purposes, including looking at sections in large omnibus bills and getting an idea if the things that get shoehorned in it have been proposed previously, and what happened then.
So, our team largely consisted of myself, Alexander Furnas of the Sunlight Foundation, and John Bloch of 10up, with guidance from Jim Harper (previously mentioned, of the Cato Institute), Molly Bohmer (also of the Cato Institute), and Kirsten Gullickson providing some clarification on the way the XML data we were working with was structured.
I spent my time building a MySQL database and a PHP import script that could map all the relevant data from the XML files in to it.
Alexander worked in Python primarily fleshing out a way of doing Latent Semantic Analysis on the data we’ve extracted to sort out what is similar to what, and where can we find meaning in it.
John spent his time working on a front-end for the final dataset, to help end-users get something useful out of the data we’re building.
The data that we were pulling from can be readily accessed by anyone through the Library of Congress at the following URLs:
- http://thomas.loc.gov/home/gpoxmlc108/ [download all xmls] (6.2 mb tar & gzip)
- http://thomas.loc.gov/home/gpoxmlc109/ [download all xmls] (32 mb tar & gzip)
- http://thomas.loc.gov/home/gpoxmlc110/ [download all xmls] (69 mb tar & gzip)
- http://thomas.loc.gov/home/gpoxmlc111/ [download all xmls] (110 mb tar & gzip)
- http://thomas.loc.gov/home/gpoxmlc112/ [download all xmls] (91 mb tar & gzip)
I’m currently putting some finishing touches on the DB structure, but when that’s done, I’ll be releasing that and the import script in a subsequent post, as well as a SQL dump for the final accumulated and sorted data — ripe for data mining. As the day was wrapping up, I had someone come to me inquiring about data mining for references to money allocated in appropriations bills and the like, and I was able to very quickly do a MySQL query along the lines of
SELECT * FROM `resolution_text` WHERE `text` LIKE '$%'
to find anything that started with a dollar sign and then listed an amount over a very limited data set of three million rows or such. The final data set will be much larger.