Scraping data at home and in the cloud

After evaluating some open data platforms and continuing to work on my scraper for Tulsa Health Department Restaurant Inspections, I’ve changed my approach a bit.

Problems with ScraperWiki

ScraperWiki Error

My scraper kept timing out when I tried to run it on ScraperWiki.com. I throttled it up, but if I scrape too fast, THD will block the ScraperWiki.com IP address and we’ll have problems running THD scrapers from ScraperWiki.com. (THD already blocked my local IP once when I ran too fast). It only ever got thru 3781 records before timing out. So I need to find another way to run the scraper.

The ScraperWiki project on bitbucket has some documentation for setting up your own instance, but there are lots of omissions, especially for those of us who know nothing about twisted or orbited. I set up the django component at OklahomaData.org, but even though ScraperWiki code is AGPL, ScraperWiki doesn’t seem to want to help us copy and localize their business model to Tulsa, Oklahoma. ;)

Alternative data stores for public data

So, I looked around for alternatives to ScraperWiki. Chris from Socrata commented on my last post; I took his advice to read the Socrata Publisher API docs. I also discovered that Oklahoma has a Socrata site already up at data.ok.gov! (I only noticed because their favicon is the Socrata favicon) I’m still trying to figure out how to register an app with Socrata’s OAuth 2.0 server-side flow implementation, but we will post our data to data.ok.gov if at all possible. And that got me thinking about …

A loosely-coupled approach to data scraping

If we publish to data.ok.gov, we should be able to publish to other data-stores too. And we should be able to run our scrapers anywhere. I.e., any Tulsa developer should be able to develop, run, and store public data locally or in the cloud.

So, now I’m developing my scraper code in git, running on localhost, and storing into a local couchdb server. The nice thing about this approach is I can easily move up into the cloud – i.e., github, heroku, and cloudant – but I’m not locked into a technology (except maybe git) or platform/cloud provider.

We may still need or want to build a “data-pusher” piece that manages where data should be pushed – i.e., local, cloudant, freebase, data.ok.gov, etc. But I much prefer this approach for flexibility and control.

Enhanced by Zemanta

Comments

  1. Tim Black says:

    You said “My scraper kept timing out….”

    If you’re using regular expressions, watch out for hitting the maximum recursion limit due to excessive backtracking. One way I’ve dealt with this in a scraper script is to run the actual scraper code as a separate Python module in a subprocess, and if it takes longer than a second or two to run one regular expression, kill the subprocess from its caller’s context and report to the user that the regular expression needs to be modified to reduce the recursion.

    • Tim Black says:

      …or just watch out for hitting an excessive backtracking limit; I think I hit a recursion limit in PHP, and backtracking in Python.

  2. The 3781 really stands out to me. Why that number? How many records are there in total? Perhaps, if you are using a 1 second delay between requests, 3781/60/60 = 1.05 hours. Will scraperwiki only run scrapers for an hour? How long did it run before timing out?

  3. Tim,

    I’m not explicitly using regular expressions, but I’m using pyquery which might be. So, I have little control over how pyquery works, and pyquery is included in the ScraperWiki python 3rd-party libraries so I would assume the 3rd-party libraries are vetted and/or tested for that kind of stuff?

    Randall,

    Different runs timed out at different times – 11782s, 8584s, 6585s, then 6866s.

    Heroku ran the entire script and posted 10,283 documents into cloudant.

  4. groovecoder,

    You could try timing each call to pyquery and see if in some way it is the source of the timeout.

Leave a Reply