I love the idea of Wolfram Alpha. I haven’t used it enough to tell how the reality of it compares. Mostly what I’ve seen is easter-eggs, which are fun, but I wanted to see if I could do something more substantive.
An article I’d seen in the news lately piqued my interest about the number of PhDs by US State. It took a goodly amount of fiddling before I figured out I could get the answer by submitting this query:
‘How many phds in California?’
Take a look at what this gives you: http://www.wolframalpha.com/input/?i=how+many+phds+in+california%3F
Lots of great information, including the specific answer I want, 331,101.
If I want to do this for all fifty states, I’m going to need to use the API not the interface. I registered for an API account which was quick and easy and checked out the bindings available on their site. The bindings left a lot to be desired, so I decided to just use urllib2.
Lately I can’t abide dealing directly with XML, so I found this nice library xmltodict. I’m sure [insert xml parsing technique] would work just as well or better.
As I was working I realized I also wanted to pull in total population, total adult population, etc. so I could talk about percentages.
I did this all in an IPython Notebook (if you’re not using this, you need to start, it’s totally awesome. Check out http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html).
Here is my notebook (created using Python 2.7): https://gist.github.com/4052828
Here is a rendered version of the notebook: http://nbviewer.ipython.org/4052828/
All in all it was a fun exercise but it still felt like page scraping. The one advantage is that you can ask the same question with slight modifications easily (e.g., how many people in California vs how many adults in Idaho) and get back essentially the same response so that’s handy. And although Wolfram Alpha returns a machine readable data structure (XML), it’s not exactly richly semantically tagged. There are plain text bits that have to be parsed. For example, sometimes a population will be expressed as its raw number, sometimes as 1.2 million. So I had to put special handling in my code for that case. It would be nice if such quantities were available with no plain-text parsing required.
One other adjustment that would be good would be to thread-out the calls to the API. It takes a good amount of time to process all of these serially and there’s no reason they couldn’t be threaded. I’ll definitely take a look at doing this.
If you’re just interested in results, here you go. This is the percentage of adults with PhDs and percentage of adults with at least an associates degree ranked by state from highest to lowest.
EDIT: After writing this, I found http://pypi.python.org/pypi/wolframalpha/1.0. This looks to be a nicer wrapper, I’ll have to take a look and see if it works as advertised. Let me know if you’ve used it successfully.
---------------------------------------------------------------------------------------------------- phds #1: Washington DC: 3.68% #2: Maryland: 2.32% #3: Massachusetts: 2.30% #4: New Mexico: 1.75% #5: Vermont: 1.66% #6: Connecticut: 1.60% #7: Delaware: 1.56% #8: Virginia: 1.54% #9: California: 1.42% #10: New Jersey: 1.41% #11: Rhode Island: 1.39% #12: Colorado: 1.37% #13: Oregon: 1.35% #14: Washington State: 1.34% #15: Pennsylvania: 1.31% #16: New York: 1.30% #17: Hawaii: 1.25% #18: New Hampshire: 1.24% #19: Arizona: 1.18% #20: Utah: 1.17% #21: Minnesota: 1.16% #22: Montana: 1.14% #23: North Carolina: 1.13% #24: Illinois: 1.13% #25: Nebraska: 1.12% #26: Maine: 1.12% #27: Florida: 1.10% #28: Alaska: 1.10% #29: Kansas: 1.08% #30: Tennessee: 1.05% #31: Missouri: 1.05% #32: Georgia: 1.04% #33: Idaho: 1.02% #34: Wyoming: 1.02% #35: Iowa: 1.02% #36: Wisconsin: 0.99% #37: Michigan: 0.98% #38: South Dakota: 0.97% #39: Ohio: 0.97% #40: Texas: 0.94% #41: South Carolina: 0.93% #42: Alabama: 0.92% #43: Indiana: 0.91% #44: North Dakota: 0.87% #45: Mississippi: 0.82% #46: Oklahoma: 0.81% #47: Louisiana: 0.81% #48: Kentucky: 0.80% #49: Nevada: 0.78% #50: Arkansas: 0.77% #51: West Virginia: 0.77% ---------------------------------------------------------------------------------------------------- college graduates #1: Washington DC: 50.47% #2: Massachusetts: 48.28% #3: Connecticut: 45.71% #4: New Hampshire: 44.77% #5: Colorado: 44.07% #6: Vermont: 43.98% #7: New Jersey: 43.76% #8: Maryland: 43.53% #9: Minnesota: 42.94% #10: Hawaii: 41.90% #11: Washington State: 41.71% #12: Virginia: 41.58% #13: New York: 40.38% #14: Rhode Island: 39.72% #15: North Dakota: 39.42% #16: Maine: 39.06% #17: Illinois: 39.05% #18: Oregon: 39.04% #19: Florida: 38.78% #20: Nebraska: 38.50% #21: Montana: 38.27% #22: Kansas: 38.26% #23: California: 38.13% #24: Delaware: 37.30% #25: South Dakota: 37.15% #26: Iowa: 36.75% #27: Wisconsin: 36.73% #28: Pennsylvania: 36.66% #29: Utah: 36.56% #30: Arizona: 36.26% #31: North Carolina: 35.96% #32: Michigan: 34.98% #33: Wyoming: 34.34% #34: New Mexico: 34.09% #35: Georgia: 33.89% #36: South Carolina: 33.78% #37: Idaho: 33.74% #38: Ohio: 33.65% #39: Missouri: 33.54% #40: Alaska: 33.04% #41: Texas: 31.96% #42: Indiana: 31.04% #43: Oklahoma: 30.72% #44: Tennessee: 30.29% #45: Alabama: 30.18% #46: Nevada: 30.18% #47: Kentucky: 28.44% #48: Mississippi: 27.99% #49: Arkansas: 26.74% #50: Louisiana: 26.32% #51: West Virginia: 25.49%