Tulsa Data Platform options

Tulsa Data

This weekend I’ve been hacking on one of the data ideas we’ve had – scraping the Tulsa Health Department’s Restaurant Inspection data. I’m evaluating a few options for Open Data site/hosting. I’m posting my evaluations here in the hopes they’ll be useful for anyone else trying to do something similar. I’ve got a basic chart comparison and details below.

ScraperWiki DataCouch BuzzData Socrata
Open-source Yes Yes No No
Hosting Cloud or Self Cloud or Self Cloud Cloud
Data Licensing Any (free-form) ? Creative Commons Creative Commons, Public Domain
Data Formats IN Anything with a URL csv, json .csv, .tsv, .xls files .csv, .tsv, .xls files
Data Formats OUT csv, json, html, rss csv, json source csv, txt, json, xml, rdf, xls, pdf
Project Maturity Stable Pre-Alpha Stable “Enterprise”

DataCouch

VERY unstable. A couple of Tulsa Web Devs have tried to set it up without any luck. Even the datacouch.com site itself seems to go up and down or sometimes features don’t work. E.g., right now the Twitter sign-in is broken so I can’t even tell what the data licensing is.

BuzzData

BuzzData seems more like a social site for sharing data files – i.e., no url’s for data sources, nor for the data you publish on the site. It features data history, additional attachments, links to articles and visualizations, collaborators, and followers for each dataset. It seems to fit academic and research collaboration better than development.

Socrata

Socrata seems like the 800lb Gorilla of data platforms. It also uses files instead of data in http request/response, so it’s less useful as a data source for developers of applications. Socrata seems like the solution we could pitch to city agencies if we ever convince them to open and publish data themselves. They have a “Socrata for Government” white-paper and everything.

ScraperWiki

ScraperWiki is my favorite. It’s an open-source django web app, but it has lots of additional pieces – which make the initial set up a little hard. (The ScraperWiki installation instructions has some gaps too.) My favorite features:

  • It hosts both the scraper code AND the resulting data. (They gave us a scraper template that lets you host scraper code as a github gist – or you could host your code anywhere that’s url-accessible I suppose.)
  • Scraper code can be python, ruby, php, or javascript, with lots of scraping libraries for each. (especially python!)
  • Source data can be anything that’s url-accessible; lots of output formats.
  • It has features for both data developers AND data users – journalists, researchers, app developers – including “request data” (bonus: requests for non-public or non-open data are paid services), and a “get involved” dashboard.

So, I set up my own ScraperWiki server. But I still have some things to iron out – need to set up a mail server and need to find out why the scraper editor doesn’t work correctly. I’m having a skype call with some dev’s from ScraperWiki so maybe they can help out. Or, we might end up putting our data on scraperwiki.com if we can host our scrapers on github. We’ll see …

Enhanced by Zemanta

Comments

  1. ScraperWiki is a great product and I’ll admit I’ve experimented with it myself a fair amount. For the task you’re trying to accomplish – scraping a large amount of data and then sharing it up with the public, it does a commendable job.

    It pains me to say it but, compared to the other products you list, I guess we are a bit more “enterprisey”. We focus on building a cloud-hosted, turnkey solution for governments, NGOs, and non-profits that want to share their Open Data with their constituents, be they developers, journalists, or the everyday citizen. In our ideal world, you wouldn’t have to resort to scraping – that health inspection dataset would already be online, kept constantly up to date, and available via an API. That’s what the City of Chicago did with their restaurant inspection data: http://bit.ly/yBKR6y

    However, we also offer a free public site you can load your datasets onto and use to experiment with our tools. You can sign up for an account at http://opendata.socrata.com and give it a shot. It’s also worth noting that Socrata does have a full-functional API, both for open data developers and data publishers, that allows developers to build applications that interact with Socrata data through our RESTful APIs. Check out our developer portal for some more detail and some getting-started guides:

    http://dev.socrata.com

    We’re also developing the next generation of the Socrata Open Data API (SODA 2.0) which features amongst other things a MUCH more expressive and easier to learn SQL-based query syntax, simpler, cleaner data formats, and tons of other developer-focused features. We’ve gone back to the drawing board and rebuilt the API from scratch to focus on the developer and I’m very excited to be close to sharing that experience with others.

    Thanks,

    Chris Metcalf
    Developer Evangelist

    http://www.socrata.com
    chris.metcalf (at) socrata.com

  2. Thanks for the info, Chris! I poked around opendata.socrata.com and dev.socrata.com a bit but I kept running into empty docs. E.g., http://dev.socrata.com/getting-started/ says the services uses standard GET, POST, PUT, and DELETE request methods, but doesn’t have any documentation for doing the write operations.

    I also agree that ideally governments publish their data in more usable formats, and we’re hoping that our efforts will help encourage our local and state government to do more of it. Like I said – if any of our government agencies approach us about opening their data, we’re going to refer them to Socrata for doing so.

    We’ve done more with ScraperWiki now and are hitting some pain points. So we’re going to keep our scraper code in our own github repository so we can run our scrapers on any machine and load into any datastore. We’ll look forward to a new Socrata API for opendata.socrata.com!

Trackbacks

  1. [...] evaluating some open data platforms and continuing to work on my scraper for Tulsa Health Department Restaurant Inspections, I’ve [...]

Leave a Reply