Last night I was cleaning my computer files trying to delete everything I don't need anymore. In particular, I usually try to replace code I've written with more robust and supported alternatives I can find online. There is this Ruby-based reader and writer for the WARC (Web ARChive) file format I wrote almost a year ago. I thought that surely someone else had written a better one by now so I could just use it instead. A quick search on Github and Google suggested otherwise so I decided to release it on the intertubes since there is at least one other human being that needs it1.
For the uninitiated, WARC (Web ARChive) is a file format for storing web crawls. It is cool because it is a standardized way of decoupling data collection from analysis. This way, scientists can easily share collections of web crawl data in form of Web ARChives2 and focus on what matters: studies and analyses. I really like this "crawl now, analyze later" workflow as it lets me test and fine-tune my data extraction code offline.
Concretely, the Web ARChive format just tell you how to combine a bunch of digital resources into one aggregate archival file3. So instead of overwhelming your filesystem by storing each HTTP resource into a separate file, the idea is to combine them into large files that you can store contiguously on hard drives or tapes. It makes it much easier to manage large collections. The awesome Wayback machine which let you browse past snapshots of the Web is actually built on top of WARC. So, yes, there are big projects using this file format and it has some traction in the open source community. Unfortunately, it is not well known so a lot of crawlers don't ouptut WARC files and parsers are not available in every language.
Anyway, I'm happy to provide the Ruby community with my WARC reader and writer. The source code is hosted on a Github repository and is also available via rubygems (just "gem install warc"). Documentation and examples will follow soon. Please contribute to the code if you are interested.
For those of you with Ruby installed on your machine, let's relive a little bit of Internet history together. Download this WARC archive and replay it using my simple WARC proxy:
gem install warc
warc replay sopa-wikipedia-homepage.warc.gz
Fire up your browser and configure it to use the proxy on port 9292. HTTP requests done through the proxy will be served directly from the WARC file which was captured on January 18 2012. Note that there is a dashboard at http://warc/ which shows you the resources available.
Now if you remember, January 18 2012 was the day of protests against the SOPA. If you navigate to Wikipedia's English home page (http://en.wikipedia.org/wiki/Main_Page) you can relive Wikipedia's blackout as if it was happening today.
So this is it! A piece of Internet history you can archive, exchange and replay! Follow the Github repo for more examples and developement.
Hoping this page won't be the first hit on Google anymore for "ruby warc parser". ↩
For example, CommonCrawl and the Internet Archive offer such data. ↩
Exact details can be found in the WARC File format specification ↩