NOTE: If you tried go to nationbrowse.com and ended up here, that site is now defunct. The years-old code was partially broken and the GIS bits were quite the strain the tiny little VPS server it ran on. Spiritual successors that I’ve had the pleasure of working on include the Spokesman-Review Census Center and census.ire.org (which is open source). Definitely check those out for your Census data-browsing needs.
Original post (February 22, 2010) follows:
I haven’t quite graduated yet, but I did take my “capstone” class last semester. The objective was vaguely, “do something innovative,” so I pitched (what I thought was) the data app of my dreams.
This is how it all went down. This is essentially a brain dump of all the little notes I’ve collected while working on this project. Boy, do I collect a lot of notes.
The end result
Quick note: The server running the demo is ill-equipped for the massive dataset size — I’ll talk more about this below. …If you click around and you get a timeout error, wait a minute to let the server catch up (or cache up…) and try again.
In it’s current state, nationbrowse.com is a mess, but showing it off is the easiest starting point to work from:
Warning: A lot of technical talk, from here on out.
Heavily inspired by: The Apps for America contests [1,2], ThisWeKnow, DataMasher, this Mapping L.A. Neighborhoods project from the Los Angeles Times, and EveryBlock. (ThisWeKnow and DataMasher, we actually hadn’t heard of until partway through the semester — was really great to see more reference projects show up along the way.)
The team: Graham Greenfield, Jeremy Howard, Nick Roma, and myself. While all had programming experience, none of the others had used Python, developed GIS software, or worked on a Web app with real-world data. (It went extremely well. They picked up quickly. Python is awesome.)
Source code: Here, on github.
Caching: Memcached. Using python-memcached instead of (the now unmaintained) cmemcache. Using the cache middleware along with custom caching all over the place. (There are a few notes in the next section, regarding nginx+memcached.)
Mapping: OpenLayers, for client-side shape rendering.
Graphs: Google Chart API.
Issues & things we cut
A lot of our initial ambitions were fiercely struck down by performance considerations. Last I checked, a bzip2-compressed database dump sat at over one gigabyte due to the sheer number of states, counties, and ZIP codes stored and the precision of the shapefiles and statistics. On a VPS with 256MB of RAM, pitting PostgreSQL against a set of data at this size proved to be a royal pain in the ass.
Wnated to use MatPlotLib, to generate server-side graphs: again, performance was killing the site. This was actually completely implemented [1, 2], but not strong enough for us to demo with. Instead: we built wrappers around the Google Chart API, which offloads the rendering work to some magical Google server.
nginx is being used as a reverse proxy and we’d hoped it could serve cached results, directly out of memcached. There are still some issues with corrupted/misencoded data being returned to the browser. (The classic “gibberish loads in browser” effect.) Not sure if this is due to the large size of things being stored, or what some encoding misconfiguration — if anyone has any ideas, I’d love to hear ’em. (I’m using this serve-from-cache method on this blog, and it’s working just fine, with a near-exact configuration.)
Similar to DataMasher, we wanted to develop a way to let users automatically create comparative (and inferential) statistics. Unlike DataMasher, we sought to build something statistically sound — we were talked out of this by some folks at the Social Science Statistics Center, who noted that blindly comparing Census data would create junk data in nearly every case. At this point, we just threw our all into descriptive statistics — hence a focus on maps, charts, and tables.
Pieces of note
The cacheutil library is a little “swiss army knife” that includes a few useful functions: the safe_get_cache/safe_set_cache/safecache methods and template tag, which sanitize and hash cache keys; some decorators for caching methods, class methods, and class properties; and a middleware for those wanting nginx to serve directly from the cache [1,2].
A threading shortcut function that allows you to call some function in the background, while the rest of your view moves on and gets returned to the user’s browser. (Useful for loading views or calling functions in advance, to pre-cache ’em before a user actually goes there.)
Some pluggable utilities for generating Google Graphs URLs.
If you are interested in using MatPlotLib and Django, you can split your chart generation functions and the bits that actually grab the data & generate a PNG response. While this project couldn’t use it in the end, here’s a lot of potential for dynamic awesomeness there.
Ted came up with the name a long time ago, when I first threw around the idea of a data project like this.
After repeatedly shooting down Flash-based maps and discovering that server-side map tiles were out of the question, the dynamic elements of the map are heavily inspired from staring at the source of this Los Angeles Times mapping project. (And weeding my way through the OpenLayers documentation and mailing lists.) It’s not the prettiest, but there’s a lot of dynamic flexibility to it that I haven’t yet seen in other OpenLayers implementations.
Setting up a PostGIS database is a pain. Importing the entire State, County, and ZipCode sets is even worse. I did it here — note that I had to manually import Puerto Rican municipio (equivalent to counties) by tweaking the INSERT statements and unescaping some of the characters with diacritics and forcing PostgreSQL to run it as UTF-8. Hopefully that’ll save you some pain if you try this someday.
Census data is a mess. Know how to get to raw data from the homepage? Yeah. (Try the Download Center over here.) The data was pipe-delimited (and therefore, PostgreSQL could import it directly), but turning the many, many arbitrary columns into model fields was a pain.
Oh, and mixing data from disparate sources? (Say, the FBI Uniform Crime Reports, whose data is entirely distributed in Excel spreadsheets.) Good luck.
I would really love to see a more open method to access a lot of this data. After working on this project, I have to say that there are still significant barriers to doing useful things with open government data. ThisWeKnow uses RDF/SPARQL and is — judging from their goals and execution — an excellent start.
I don’t believe NationBrowse is “complete.” It’s a nice technology demo and was a nice experiment in building a large data app can be built with very few resources. But it’s a data ghetto. It’s a standalone site, with very little context and very little use of the massive underlying dataset.
If I could have another go at this, I’d have emphasized data export functionality or some other way to get “joined” data from disparate sets and sources. Possibly create an API around the underlying data. And even then, the data still needs to go in, somehow.
But hey, if four guys in college can find a way to make something of that data, for (near-)free, maybe there’s hope.