Posts Tagged 'python'

Playing with Wolfram Alpha and Python

I love the idea of Wolfram Alpha. I haven’t used it enough to tell how the reality of it compares. Mostly what I’ve seen is easter-eggs, which are fun, but I wanted to see if I could do something more substantive.

An article I’d seen in the news lately piqued my interest about the number of PhDs by US State. It took a goodly amount of fiddling before I figured out I could get the answer by submitting this query:

‘How many phds in California?’

Take a look at what this gives you: http://www.wolframalpha.com/input/?i=how+many+phds+in+california%3F

Lots of great information, including the specific answer I want, 331,101.

If I want to do this for all fifty states, I’m going to need to use the API not the interface. I registered for an API account which was quick and easy and checked out the bindings available on their site. The bindings left a lot to be desired, so I decided to just use urllib2.

Lately I can’t abide dealing directly with XML, so I found this nice library xmltodict. I’m sure [insert xml parsing technique] would work just as well or better.

As I was working I realized I also wanted to pull in total population, total adult population, etc. so I could talk about percentages.

I did this all in an IPython Notebook (if you’re not using this, you need to start, it’s totally awesome. Check out http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html).

Here is my notebook (created using Python 2.7): https://gist.github.com/4052828

Here is a rendered version of the notebook: http://nbviewer.ipython.org/4052828/

All in all it was a fun exercise but it still felt like page scraping. The one advantage is that you can ask the same question with slight modifications easily (e.g., how many people in California vs how many adults in Idaho) and get back essentially the same response so that’s handy. And although Wolfram Alpha returns a machine readable data structure (XML), it’s not exactly richly semantically tagged. There are plain text bits that have to be parsed. For example, sometimes a population will be expressed as its raw number, sometimes as 1.2 million. So I had to put special handling in my code for that case. It would be nice if such quantities were available with no plain-text parsing required.

One other adjustment that would be good would be to thread-out the calls to the API. It takes a good amount of time to process all of these serially and there’s no reason they couldn’t be threaded. I’ll definitely take a look at doing this.

If you’re just interested in results, here you go. This is the percentage of adults with PhDs and percentage of adults with at least an associates degree ranked by state from highest to lowest.

EDIT: After writing this, I found http://pypi.python.org/pypi/wolframalpha/1.0. This looks to be a nicer wrapper, I’ll have to take a look and see if it works as advertised. Let me know if you’ve used it successfully.

----------------------------------------------------------------------------------------------------
phds
#1: Washington DC: 3.68%
#2: Maryland: 2.32%
#3: Massachusetts: 2.30%
#4: New Mexico: 1.75%
#5: Vermont: 1.66%
#6: Connecticut: 1.60%
#7: Delaware: 1.56%
#8: Virginia: 1.54%
#9: California: 1.42%
#10: New Jersey: 1.41%
#11: Rhode Island: 1.39%
#12: Colorado: 1.37%
#13: Oregon: 1.35%
#14: Washington State: 1.34%
#15: Pennsylvania: 1.31%
#16: New York: 1.30%
#17: Hawaii: 1.25%
#18: New Hampshire: 1.24%
#19: Arizona: 1.18%
#20: Utah: 1.17%
#21: Minnesota: 1.16%
#22: Montana: 1.14%
#23: North Carolina: 1.13%
#24: Illinois: 1.13%
#25: Nebraska: 1.12%
#26: Maine: 1.12%
#27: Florida: 1.10%
#28: Alaska: 1.10%
#29: Kansas: 1.08%
#30: Tennessee: 1.05%
#31: Missouri: 1.05%
#32: Georgia: 1.04%
#33: Idaho: 1.02%
#34: Wyoming: 1.02%
#35: Iowa: 1.02%
#36: Wisconsin: 0.99%
#37: Michigan: 0.98%
#38: South Dakota: 0.97%
#39: Ohio: 0.97%
#40: Texas: 0.94%
#41: South Carolina: 0.93%
#42: Alabama: 0.92%
#43: Indiana: 0.91%
#44: North Dakota: 0.87%
#45: Mississippi: 0.82%
#46: Oklahoma: 0.81%
#47: Louisiana: 0.81%
#48: Kentucky: 0.80%
#49: Nevada: 0.78%
#50: Arkansas: 0.77%
#51: West Virginia: 0.77%
----------------------------------------------------------------------------------------------------
college graduates
#1: Washington DC: 50.47%
#2: Massachusetts: 48.28%
#3: Connecticut: 45.71%
#4: New Hampshire: 44.77%
#5: Colorado: 44.07%
#6: Vermont: 43.98%
#7: New Jersey: 43.76%
#8: Maryland: 43.53%
#9: Minnesota: 42.94%
#10: Hawaii: 41.90%
#11: Washington State: 41.71%
#12: Virginia: 41.58%
#13: New York: 40.38%
#14: Rhode Island: 39.72%
#15: North Dakota: 39.42%
#16: Maine: 39.06%
#17: Illinois: 39.05%
#18: Oregon: 39.04%
#19: Florida: 38.78%
#20: Nebraska: 38.50%
#21: Montana: 38.27%
#22: Kansas: 38.26%
#23: California: 38.13%
#24: Delaware: 37.30%
#25: South Dakota: 37.15%
#26: Iowa: 36.75%
#27: Wisconsin: 36.73%
#28: Pennsylvania: 36.66%
#29: Utah: 36.56%
#30: Arizona: 36.26%
#31: North Carolina: 35.96%
#32: Michigan: 34.98%
#33: Wyoming: 34.34%
#34: New Mexico: 34.09%
#35: Georgia: 33.89%
#36: South Carolina: 33.78%
#37: Idaho: 33.74%
#38: Ohio: 33.65%
#39: Missouri: 33.54%
#40: Alaska: 33.04%
#41: Texas: 31.96%
#42: Indiana: 31.04%
#43: Oklahoma: 30.72%
#44: Tennessee: 30.29%
#45: Alabama: 30.18%
#46: Nevada: 30.18%
#47: Kentucky: 28.44%
#48: Mississippi: 27.99%
#49: Arkansas: 26.74%
#50: Louisiana: 26.32%
#51: West Virginia: 25.49%
Advertisements

Get your Wu-name in six lines of Python

I recently had a need to generate screen-names for a bunch of test users. I remembered the wonderful Wu-name generator and it was too tempting to pass up.

import urllib
from lxml.html import fromstring
def get_wu_name(first_name, last_name):
    """
    >>> get_wu_name("Jeff", "Elmore")
    'Ultra-Chronic Monstah'
    """
    w = urllib.urlopen("http://www.recordstore.com/wuname/wuname.pl",
                       urllib.urlencode({'fname':first_name, 'sname': last_name}))
    doc = fromstring(w.read())
    return [d for d in doc.iter('span') if d.get('class') == 'newname'][0].text

Okay, technically it’s more than six if you count the docstring and linewrap but still. Also, I felt like there should be a better way to do the last line, but this works well enough.

Smarter caching of Django QuerySets

Let me start by saying I love the Django ORM.  It does an excellent job of “just working” and hiding all the dirty SQL business required to get what you need out of your database.  There are times however when it doesn’t behave exactly like you’d want.  One of these that I ran into recently was the behavior of caching (or more specifically pickling) QuerySets.  By default Django evaluates QuerySets fully whenever they are pickled.  I often find myself displaying the first few dozen rows from QuerySets that contain tens of thousands of rows.  Each request only fetches the rows it needs but we have to redo the query on each page load. Caching the QuerySet and sharing it between requests would solve this problem but if we only need the first few rows, caching all 10,000 doesn’t make sense.

What we need is a QuerySet that can be partially evaluated, cached, and shared between requests. There are a few technical challenges to overcome, so let’s get started by understanding how QuerySets work.

The User model is a convenient one to work with, so let’s create some users. We need enough that they won’t all be pulled down in one chunk.

>>> from django.contrib.auth.models import User
>>> for i in range(1000):
>>>     User.objects.create(username="test%d"% i)
<User: test0>
<User: test1>
<User: test2>
...
<User: test999>
>>>

Now we have some users to work with, let’s see what’s happening with the QuerySet and its result_cache. We can use connection.queries to see what’s happening with the db (note, DEBUG must be set to true).

Let’s see what happens when we try to get a specific record:

>>> from django.db import connection
>>> connection.queries = [] #Clear user creation queries from before
>>> qs = User.objects.all()
>>> qs[50]
<User: test49@example.com>
>>> connection.queries
[{'time': '0.001', 'sql': u'SELECT `auth_user`.`id`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`password`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`is_superuser`, `auth_user`.`last_login`, `auth_user`.`date_joined` FROM `auth_user` LIMIT 1 OFFSET 50'}]
>>> qs._result_cache is None
True

We see that Django executed just enough query to get back the single result and didn’t populate its result_cache at all. Now let’s see what happens when we start iterating on a QuerySet

>>> connection.queries = [] #Clear the queries again.
>>> for i in qs:
>>>    break
>>> connection.queries
[{'time': '0.038', 'sql': u'SELECT `auth_user`.`id`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`password`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`is_superuser`, `auth_user`.`last_login`, `auth_user`.`date_joined` FROM `auth_user`'}]
>>> len(qs._result_cache)
100

This time there’s nothing in the SQL that limits how many results we get, but we’ve only fetched 100 records and we know there are 1000. Let’s see what’s actually going on:

class QuerySet(object):
    ...
    def __iter__(self):
        if self._result_cache is None:
            self._iter = self.iterator()
            self._result_cache = []
        if self._iter:
            return self._result_iter()
        # Python's list iterator is better than our version when we're just
        # iterating over the cache.
        return iter(self._result_cache)

    def _result_iter(self):
        pos = 0
        while 1:
            upper = len(self._result_cache)
            while pos < upper:
                yield self._result_cache[pos]
                pos = pos + 1
            if not self._iter:
                raise StopIteration
            if len(self._result_cache) <= pos:
                self._fill_cache()

Since the cache has not yet been populated, __iter__ will set self._iter to self.iterator() (this is the method that gets data from the database and converts each row into a model object, more below). _results_iter calls self._fill_cache() to populate result_cache, then loops back around and yields objects from the cache it just filled. Let’s take a look at what _fill_cache does:

    ...
    def _fill_cache(self, num=None):
        """
        Fills the result cache with 'num' more entries (or until the results
        iterator is exhausted).
        """
        if self._iter:
            try:
                for i in range(num or ITER_CHUNK_SIZE):
                    self._result_cache.append(self._iter.next())
            except StopIteration:
                self._iter = None

QuerySet._fill_cache fills up self._result_cache from self._iter (which as we saw earlier was set to self.iterator()). We’re finally getting close to actually hitting the database but we’re not quite there yet.

A lot goes on in QuerySet.iterator, but suffice it to say it is the mechanism whereby database rows are converted into model instances:

class QuerySet(object):
    ...
    def iterator(self):
        ...
        compiler = self.query.get_compiler(using=self.db)
        for row in compiler.results_iter():
            ...
            yield obj

The QuerySet class has a Query object which in turn has a SQLCompiler object. SQLCompiler.results_iter returns a generator that spits out database rows it gets from execute_sql.

class SQLCompiler(object):
    ...
    def results_iter(self):
        ...
        for rows in self.execute_sql(MULTI):
            for row in rows:
                ....
                yield row

SQLCompiler.execute_sql returns an iterator that does successive calls to cursor.fetchmany() to pull data in chunks of 100 (by default anyway)

    ....
    def execute_sql(self, result_type=MULTI):    
        ...
        result = iter((lambda:
                  cursor.fetchmany(GET_ITERATOR_CHUNK_SIZE)), 
                  self.connection.features.empty_fetchmany_value)

If we iterate past the objects available in the cache, more rows are yielded from cursor.fetchmany() and converted to models objects in QuerySet.iterator().

>>> connection.queries = []
>>> for i, o in enumerate(qs):
>>>   if i == 101:
>>>      break
>>> connection.queries
[]
>>> len(qs._result_cache)
200

Note that it doesn’t need to do any more queries, instead it yields another 100 results with fetchmany, converts those rows into model instances, and populates _result_cache.

Note: QuerySets have special handling for slicing. If instead of iterating up to 101, we had sliced up to 101, the query_cache would only have been filled to the point necessary to fulfill the slice request.

Things are looking good at this point. The QuerySet already handles partial evaluation and evaluating more rows as required. We just need to share this QuerySet that’s already partially evaluated between between requests. This means we either need to put it in the session or put it in the cache. Either way, the QuerySet will have to be pickled. So let’s try that and see what happens.

>>> import cPickle
>>> pickle_str = cPickle.dumps(qs)
>>> qs = cPickle.loads(pickle_str)
>>> len(qs._result_cache)
1000

Unfortunately, as I mentioned earlier, Django has decided to completely evaluate the QuerySet when we pickled it:

class QuerySet(object):
    ...
    def __getstate__(self):
        """
        Allows the QuerySet to be pickled.
        """
        # Force the cache to be fully populated.
        len(self)

        obj_dict = self.__dict__.copy()
        obj_dict['_iter'] = None
        return obj_dict

There are two reasons why it’s tricky to pickle a partially evaluated QuerySet.  First, you cannot pickle generators and the QuerySet actually relies on two generators (one in the QuerySet itself and one in the SQLCompiler).  Second, even if you could it still wouldn’t work because fetching more rows assumes that the database connection is still there and remembers what row we were at, etc. To avoid these issues Django fully evaluates the QuerySet and stops iterating upon pickling.  What we have to do is stop iterating with an only partially populated cache and allow further records to be fetched with a second query instead of fetching more rows from the same query. 

This requires the QuerySet to be in a somewhat unnatural state. Normally QuerySets fall into one of three states:

  • Completely unevaluated (i.e. no cache and not iterating)
  • Partially cached and currently iterating
  • Fully cached and not iterating

What we want is for it to be partially cached and not iterating.  There are a bunch of places where this causes problems. An easy way to get around these problems is to make the QuerySet start iterating as soon as it is unpickled.  This confuses Django less and doesn’t make the QuerySet behave any differently.  The behavior we want is that iterating on the QuerySet retrieved from cache should start from the beginning pulling from the precached values and only going back to the database when it needs to.  The default behavior is very close to what we want: If the QuerySet has enough cached data to fulfill the request it returns that; if it doesn’t have enough, it fills up its cache a bit more in the fill_cache method. The only difference is that a QuerySet retrieved from the cache cannot have any active generators or remember from earlier where it was in fetching rows from the generator in the SQLCompiler. Instead we can use set_limits on the QuerySet’s query– set_limits would normally be used for slicing, but it serves this purpose just fine.  The only thing that we need to worry about when using set_limits is the count().  For convenience I decided to cache count before pickling and return that instead of trying to get fancy.

After all that preamble, here’s the money-shot:

from django.db.models.query import QuerySet
from django.db.models import Q

class SmartCachingQuerySet(QuerySet):
    """
    This works pretty much exactly like a regular queryset
    except that when you pickle it, it does not fully evaluate
    itself, instead it evaluates DEFAULT_PREFETCH_COUNT
    records.
    """
    DEFAULT_PREFETCH_COUNT = 100
    def __init__(self, *args, **kwargs):
        self.cached_count = None
        super(SmartCachingQuerySet, self).__init__(*args, **kwargs)
        
    def __getstate__(self):
        """
        Override the default behavior of the Django QuerySet
        to only partially evaluate before pickling.
        """
        if not self._result_cache:
            self.prefetch()

        #We need to cache the count. Note, we might be
        #able to be smarter about this but why bother?
        self.count()
        
        obj_dict = self.__dict__.copy()
        obj_dict['_iter'] =  None
        return obj_dict

    def __setstate__(self, in_dict):
        """
        Restore the iterator and set the the limits.
        """
        self.__dict__ = in_dict
        self.__dict__['_iter'] = self.iterator()
        self.__dict__['query'].set_limits(low=len(self.__dict__['_result_cache']))        

    def prefetch(self, num_to_prefetch=None):
        """
        Use Django's built in functionality to prefetch
        a number of content models from the db
        """
        self.__iter__()
        self._fill_cache(num_to_prefetch or self.DEFAULT_PREFETCH_COUNT)

    def count(self):
        if self.cached_count:
            return self.cached_count

        self.cached_count = self.query.get_count(using=self.db)
        return self.cached_count

    def _filter_or_exclude(self, negate, *args, **kwargs):
	"""
	Override the normal behavior of filter_or_exclude which 
	would check if limits has been set and return an error.
	Since we're handling this issue in _clone, we don't
	need to do the check.
	"""
        clone = self._clone()
        if negate:
            clone.query.add_q(~Q(*args, **kwargs))
        else:
            clone.query.add_q(Q(*args, **kwargs))
        
        return clone
            
    def order_by(self, *field_names):
	"""
	Override order_by for the same reason we overrode _filter_or_exclude
	"""
        qs = self._clone()
        return super(SmartCachingQuerySet, qs).order_by(*field_names)

    def _clone(self, klass=None, setup=False, **kwargs):
        #Store these values and clear them so we can safely clone.
        query_limits = (self.query.low_mark, self.query.high_mark)
        result_cache = self._result_cache
        self.query.clear_limits()
        self._result_cache = None
        cloned_qs = super(SmartCachingQuerySet, self)._clone(klass, setup, **kwargs)

        #Restore stuff.
        self.query.set_limits(*query_limits)
        self._result_cache = result_cache
        return cloned_qs

Let’s take a quick look at how this works:

>>> from django.db import connection
>>> from django.core.cache import cache
>>> from django.contrib.auth.models import User
>>> qs = SmartCachingQuerySet(model=User)
>>> cache.set('test',qs)
>>> connection.queries
[{'time': '0.026', 'sql': u'SELECT `auth_user`.`id`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`password`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`is_superuser`, `auth_user`.`last_login`, `auth_user`.`date_joined` FROM `auth_user`'}, {'time': '0.000', 'sql': u'SELECT COUNT(*) FROM `auth_user`'}]
>>> connection.queries = [] #Clear the queries
>>> cached_qs = cache.get('test')
>>> len(cached_qs._result_cache)
100
>>> cached_qs.count()
1000
>>> connection.queries
[]
>>> for i,v in enumerate(cached_qs):
...     if i == 101:
...        break
... 
>>> connection.queries
[{'time': '0.016', 'sql': u'SELECT `auth_user`.`id`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`password`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`is_superuser`, `auth_user`.`last_login`, `auth_user`.`date_joined` FROM `auth_user` LIMIT 18446744073709551615 OFFSET 100'}]
>>> len(cached_qs._result_cache)
200

We can see that the SmartCachingQuerySet provides the same behavior as a normal QuerySet, but what happens behind the scenes is different. After pulling cached_qs from the cache and iterating past the number of items in _result_cache, it will hit the database again to get the next 100 objects using OFFSET 100 to avoid fetching the same items it already has.

The last thing we need to make sure we can do is filter, exclude, and order cached QuerySets and make sure they behave appropriately. What we want is for them to create a new QuerySet with no cache.

>>> filtered_qs = cached_qs.filter(id__gt=10)
>>> filtered_qs._result_cache is None
True
>>> ordered_qs = qs.order_by("-id")
>>> ordered_qs._result_cache is None
True
>>> (cached_qs._result_cache)
200

This effect is achieved by overriding _filter_or_exclude and order_by to avoid the normal check that would prevent you from cloning a Query that had limits set on it. This does not affect the _result_cache on the original QuerySet.

So there you have it: smarter behavior for caching QuerySets. Obviously the same caveats apply for caching in general. If the underlying data changes, a SmartCachingQuerySet pulled from cache will not only be stale but may provide duplicate results or miss results that would have been returned by a new query. Apply invalidation appropriately.

I’d like to give a shout out to my friend Michaux for his assistance. I haven’t done so yet, but I plan on putting together a package with sample app, etc and making it available. I welcome any feedback!


Twitter-feed