NationBrowse

NOTE: If you tried go to nationbrowse.com and ended up here, that site is now defunct. The years-old code was partially broken and the GIS bits were quite the strain the tiny little VPS server it ran on. Spiritual successors that I’ve had the pleasure of working on include the Spokesman-Review Census Center and census.ire.org (which is open source). Definitely check those out for your Census data-browsing needs.

Original post (February 22, 2010) follows:


I haven’t quite graduated yet, but I did take my “capstone” class last semester. The objective was vaguely, “do something innovative,” so I pitched (what I thought was) the data app of my dreams.

This is how it all went down. This is essentially a brain dump of all the little notes I’ve collected while working on this project. Boy, do I collect a lot of notes.

The end result

Quick note: The server running the demo is ill-equipped for the massive dataset size — I’ll talk more about this below. …If you click around and you get a timeout error, wait a minute to let the server catch up (or cache up…) and try again.

NationBrowse screenshot

In it’s current state, nationbrowse.com is a mess, but showing it off is the easiest starting point to work from:

Warning: A lot of technical talk, from here on out.

Background bits

Heavily inspired by: The Apps for America contests [1,2], ThisWeKnow, DataMasher, this Mapping L.A. Neighborhoods project from the Los Angeles Times, and EveryBlock. (ThisWeKnow and DataMasher, we actually hadn’t heard of until partway through the semester — was really great to see more reference projects show up along the way.)

The team: Graham Greenfield, Jeremy Howard, Nick Roma, and myself. While all had programming experience, none of the others had used Python, developed GIS software, or worked on a Web app with real-world data. (It went extremely well. They picked up quickly. Python is awesome.)

Source code: Here, on github.

The basics: Python, Django, and PostgreSQL. GeoDjango via PostGIS.

Server: Served over Apache+mod_wsgi, on an internal port. nginx sits at port 80 and proxies requests over to the Apache instance.

Caching: Memcached. Using python-memcached instead of (the now unmaintained) cmemcache. Using the cache middleware along with custom caching all over the place. (There are a few notes in the next section, regarding nginx+memcached.)

Mapping: OpenLayers, for client-side shape rendering.

Graphs: Google Chart API.

Data: U.S. Census TIGER/Line for shapefiles. U.S. Census 2000 & American Community Survey 2008 for most statistics. FBI Uniform Crime Reports for other numbers.

Issues & things we cut

A lot of our initial ambitions were fiercely struck down by performance considerations. Last I checked, a bzip2-compressed database dump sat at over one gigabyte due to the sheer number of states, counties, and ZIP codes stored and the precision of the shapefiles and statistics. On a VPS with 256MB of RAM, pitting PostgreSQL against a set of data at this size proved to be a royal pain in the ass.

Wanted to use TileCache/Mapnik, the “EveryBlock stack,” to generate maps server-side: performance was awful given the hardware/dataset circumstances. (Not to mention adding the configuration complexity of having a whole Apache mod_python instance running alongside the site’s Django wsgi instance.) Instead: we found a way render shapes in OpenLayers, on the user’s Web browser, by sending along raw WKT geo data in the Javascript for a given map. The (sometimes huge) file size increase far outweighed the (dangerously high) server load.

Wnated to use MatPlotLib, to generate server-side graphs: again, performance was killing the site. This was actually completely implemented [1, 2], but not strong enough for us to demo with. Instead: we built wrappers around the Google Chart API, which offloads the rendering work to some magical Google server.

nginx is being used as a reverse proxy and we’d hoped it could serve cached results, directly out of memcached. There are still some issues with corrupted/misencoded data being returned to the browser. (The classic “gibberish loads in browser” effect.) Not sure if this is due to the large size of things being stored, or what some encoding misconfiguration — if anyone has any ideas, I’d love to hear ’em. (I’m using this serve-from-cache method on this blog, and it’s working just fine, with a near-exact configuration.)

Similar to DataMasher, we wanted to develop a way to let users automatically create comparative (and inferential) statistics. Unlike DataMasher, we sought to build something statistically sound — we were talked out of this by some folks at the Social Science Statistics Center, who noted that blindly comparing Census data would create junk data in nearly every case. At this point, we just threw our all into descriptive statistics — hence a focus on maps, charts, and tables.

Pieces of note

The cacheutil library is a little “swiss army knife” that includes a few useful functions: the safe_get_cache/safe_set_cache/safecache methods and template tag, which sanitize and hash cache keys; some decorators for caching methods, class methods, and class properties; and a middleware for those wanting nginx to serve directly from the cache [1,2].

A threading shortcut function that allows you to call some function in the background, while the rest of your view moves on and gets returned to the user’s browser. (Useful for loading views or calling functions in advance, to pre-cache ’em before a user actually goes there.)

Some pluggable utilities for generating Google Graphs URLs.

A ton of Javascript magic, using jQuery and OpenLayers. Between the template and the static helper functions, you get that nice map with toggle-able shapes (to change which variable the map is shaded by) and the nice hover effect on the shapes — as seen on the homepage.

If you are interested in using MatPlotLib and Django, you can split your chart generation functions and the bits that actually grab the data & generate a PNG response. While this project couldn’t use it in the end, here’s a lot of potential for dynamic awesomeness there.

Credits due

Ted came up with the name a long time ago, when I first threw around the idea of a data project like this.

My team was awesome for going along with something so ridiculously ambitious. For a one-semester undergraduate capstone project, in which 75% of the team hadn’t even used the language, it really worked out. Graham and Jeremy were troopers and put a lot of work into the MatPlotLib renderers [1, 2] that weren’t fully implemented in the end product. Nick, without any prior Javascript or jQuery experience, built a GUI “query builder” (which, unfortunately, is not functional in the live demo).

After repeatedly shooting down Flash-based maps and discovering that server-side map tiles were out of the question, the dynamic elements of the map are heavily inspired from staring at the source of this Los Angeles Times mapping project. (And weeding my way through the OpenLayers documentation and mailing lists.) It’s not the prettiest, but there’s a lot of dynamic flexibility to it that I haven’t yet seen in other OpenLayers implementations.

Last complaints

Setting up a PostGIS database is a pain. Importing the entire State, County, and ZipCode sets is even worse. I did it here — note that I had to manually import Puerto Rican municipio (equivalent to counties) by tweaking the INSERT statements and unescaping some of the characters with diacritics and forcing PostgreSQL to run it as UTF-8. Hopefully that’ll save you some pain if you try this someday.

Census data is a mess. Know how to get to raw data from the homepage? Yeah. (Try the Download Center over here.) The data was pipe-delimited (and therefore, PostgreSQL could import it directly), but turning the many, many arbitrary columns into model fields was a pain.

Oh, and mixing data from disparate sources? (Say, the FBI Uniform Crime Reports, whose data is entirely distributed in Excel spreadsheets.) Good luck.

I would really love to see a more open method to access a lot of this data. After working on this project, I have to say that there are still significant barriers to doing useful things with open government data. ThisWeKnow uses RDF/SPARQL and is — judging from their goals and execution — an excellent start.

Epilogue

I don’t believe NationBrowse is “complete.” It’s a nice technology demo and was a nice experiment in building a large data app can be built with very few resources. But it’s a data ghetto. It’s a standalone site, with very little context and very little use of the massive underlying dataset.

If I could have another go at this, I’d have emphasized data export functionality or some other way to get “joined” data from disparate sets and sources. Possibly create an API around the underlying data. And even then, the data still needs to go in, somehow.

But hey, if four guys in college can find a way to make something of that data, for (near-)free, maybe there’s hope.

I implore you to dig around in the repository and especially check out the notable bits.

You can comment on this post via Google Buzz. Or, you can contact me directly.

This is a quote I love to come back to, time and again.

Even as a Web developer — a person who gets paid to go out and build up the great expanses of the Internet — I love this quote. And, to a great extent, I believe in it.

Electronic communities build nothing. You wind up with nothing. We are dancing animals. How beautiful it is to get up and go out and do something. We are here on Earth to fart around. Don’t let anybody tell you any different.

— Kurt Vonnegut, in A Man Without a Country


Google Buzz was released earlier this week. Facebook redesigned it’s main page. A lot of people paid a whole lot of attention to these things.

I had a good conversation with Carolina a couple nights ago, about the substitution of real social interaction for social networks. (Her friend Amanda expressed dismay at the whole thing, which is what got us on the subject.) And while I concede, there are plenty of uses for these communities — reconnecting with distant folks, planning events, having non-live conversations in comment streams — I can’t help but notice:

There are an increasing number of people I speak to that believe we’re placing far too much collective importance on these things. Me? I fear the people to young to remember dial-up Internet and earlier. And seriously, think about it: I’m sure there are some kids who communicate through these networking sites more than any other medium — text, phone, or in-person. This is all they’ll have ever known. (In practice, I’m sure the reality lies somewhere between texting and the Internet.)

In my wildest dreams, I imagine we’ll get to a point where this dawns on everyone and we have a large cultural push back. Maybe, like the whole/organic food fad, it’ll only be a minority. But sometimes I feel like the undercurrents are there.

Does anybody even remember Google Wave? Friendster? Xanga?

The iPad & Game Consoles

A quick thought or two on the iPad hubbub and the “casual computing vs. tinkering” conversation that’s been happening as of late. But first:

ThinkPad, anyone?

I concede “iPad” is a terrible name simply because of the similarity to Apple’s existing “iPod,” but I really don’t understand the fascination with “pad” jokes. A “-Pad” name has been pretty successful — without the toilet humor — for about 18 years now.

There are examples of names like this in recent history — take the Nintendo Wii, for example. Like the Wii, I’m pretty sure we’ll move on from picking on nomenclature once we start using the damn thing.

Which sort of leads into my main point

One of the general arguments against the iPad being successful is that it’s more expensive than a netbook, it’s not as full featured, and it doesn’t even multitask, etc.…

Who cares? Between my brother and I, we own several high-end computers that, by default, are closed systems. They don’t multitask. You can’t easily make your own content for them. You can’t really mess around with a lot of the performance-oriented settings.

They are: a Playstation 3, an Xbox 360, a Wii, and a few other systems.

For the most part, direct comparisons between these devices and “general computers” tend to be “apples to oranges” comparisons. (The classic “console vs. PC gaming” argument is probably the best example.)

They’re purpose-built machines, they’re in a different league and that’s that.

There are lots folks who own Macs or older PCs that want a way to play the latest games — and many of them own game consoles because that provides the easiest out-of-the-box experience as opposed to buying and maintaining a PC gaming rig. And it’s much easier than trying to play Crysis on a PC whose hardware is dated four or five years or one at a sub-$500 price tag.

My point is, there is a place for the iPad and people will buy it even if it is (several orders of magnitude) less versatile and far more expensive than a netbook. It doesn’t have to be a netbook to succeed. As long as the iPad gives the user enough of what they want (presumably: Web content, books, and apps) and wraps that up in an enjoyable experience, then Apple has a legitimate competitor against the netbook market.

Another point of reference: Some folks will go out and buy an Xbox 360 because of the platform-exclusive titles, like Halo. I could try to talk about how technologically superior the PlayStation 3 is to the Xbox 360, but I can’t specifically dissuade someone who loves Halo. Some folks will go the iPhone/iPad route specifically for the exclusive apps and features, too.

On hackers and tinkerers

On the other hand, there is the “tinkering argument” — that the spread and adoption of these “closed systems” will bring an end to the days of tinkerers.

Video game consoles also provide great analogue to the iPad’s “closedness” in this regard: they come “closed,” of course. But my Xbox 360 is modified to play burned games and doing the same to the Wii is, supposedly, a piece of cake. You don’t have to look far to find people willing to do the same with Apple’s closed systems.

(My brother and I do live on the far end of the tinkering range — in both PCs and game consoles — so my experiences are obviously a tiny bit skewed.)

Interestingly enough, I do notice that a great percentage of the PC gamers I know do tinker with the settings, update their drivers, upgrade their parts, etc., on a normal basis — or at the very least, know how to perform those tasks. And while I know of primarily-console folks who’ve modified or hacked their systems, they are a much rarer breed. This is exactly what the fear is: tinkering falling to the wayside because the closed-off systems inherently have fewer things to tinker with.[1]

While I have no reservations on the “closed” nature of the iPad specifically, I am one of the people that will be concerned if this truly is the “future of computing.”

At best, some console hacks are merely inconvenient[2], while at worst there are those that are outright illegal. I strongly believe that those who want to do more with their computing devices will inevitably find a way to do it. I just think it will play out better for everyone if we encourage and facilitate rather than criminalize curiosity and innovation.


[1] Alex Payne & Jim Stogdill both have excellent points on this, which inspired me to write a bit about it.
[2] Older PS3 models do allow you to install Linux on an unmodified console. And as far as I know, there are no hacks for the PS3 that allow you to play burned games.

I don’t have comments set up on this site yet, but if you’d like to, you can comment on this blog post over on Facebook. You don’t even have to be my friend.

I'm a three-time (soon to be four-time) published author. When aspiring authors learn this, they invariably ask what word processor I use. It doesn't fucking matter! I happen to write in Emacs. I also code in Emacs, which is a nice bonus. Other people write and code in vi. Other people write in Microsoft Word and code in TextMate+ or TextEdit or some fancy web-based collaborative editor like EtherPad or Google Wave. Whatever. Picking the right text editor will not make you a better writer. Writing will make you a better writer. Writing, and editing, and publishing, and listening -- really listening -- to what people say about your writing. […] Just fucking write, then publish, then write some more. One day your writing will get featured on a site like Reddit and you'll go from 5 readers to 5000 in a matter of hours, and they'll all tell you how much your writing sucks. And most of them will be right! Learn how to respond to constructive criticism and filter out the trolls, and you can write the next great American novel in edlin.

Mark Pilgrim, on The Setup. (Emphasis mine.)