Archive for the 'Python' Category

Playing with Wolfram Alpha and Python

I love the idea of Wolfram Alpha. I haven’t used it enough to tell how the reality of it compares. Mostly what I’ve seen is easter-eggs, which are fun, but I wanted to see if I could do something more substantive.

An article I’d seen in the news lately piqued my interest about the number of PhDs by US State. It took a goodly amount of fiddling before I figured out I could get the answer by submitting this query:

‘How many phds in California?’

Take a look at what this gives you: http://www.wolframalpha.com/input/?i=how+many+phds+in+california%3F

Lots of great information, including the specific answer I want, 331,101.

If I want to do this for all fifty states, I’m going to need to use the API not the interface. I registered for an API account which was quick and easy and checked out the bindings available on their site. The bindings left a lot to be desired, so I decided to just use urllib2.

Lately I can’t abide dealing directly with XML, so I found this nice library xmltodict. I’m sure [insert xml parsing technique] would work just as well or better.

As I was working I realized I also wanted to pull in total population, total adult population, etc. so I could talk about percentages.

I did this all in an IPython Notebook (if you’re not using this, you need to start, it’s totally awesome. Check out http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html).

Here is my notebook (created using Python 2.7): https://gist.github.com/4052828

Here is a rendered version of the notebook: http://nbviewer.ipython.org/4052828/

All in all it was a fun exercise but it still felt like page scraping. The one advantage is that you can ask the same question with slight modifications easily (e.g., how many people in California vs how many adults in Idaho) and get back essentially the same response so that’s handy. And although Wolfram Alpha returns a machine readable data structure (XML), it’s not exactly richly semantically tagged. There are plain text bits that have to be parsed. For example, sometimes a population will be expressed as its raw number, sometimes as 1.2 million. So I had to put special handling in my code for that case. It would be nice if such quantities were available with no plain-text parsing required.

One other adjustment that would be good would be to thread-out the calls to the API. It takes a good amount of time to process all of these serially and there’s no reason they couldn’t be threaded. I’ll definitely take a look at doing this.

If you’re just interested in results, here you go. This is the percentage of adults with PhDs and percentage of adults with at least an associates degree ranked by state from highest to lowest.

EDIT: After writing this, I found http://pypi.python.org/pypi/wolframalpha/1.0. This looks to be a nicer wrapper, I’ll have to take a look and see if it works as advertised. Let me know if you’ve used it successfully.

----------------------------------------------------------------------------------------------------
phds
#1: Washington DC: 3.68%
#2: Maryland: 2.32%
#3: Massachusetts: 2.30%
#4: New Mexico: 1.75%
#5: Vermont: 1.66%
#6: Connecticut: 1.60%
#7: Delaware: 1.56%
#8: Virginia: 1.54%
#9: California: 1.42%
#10: New Jersey: 1.41%
#11: Rhode Island: 1.39%
#12: Colorado: 1.37%
#13: Oregon: 1.35%
#14: Washington State: 1.34%
#15: Pennsylvania: 1.31%
#16: New York: 1.30%
#17: Hawaii: 1.25%
#18: New Hampshire: 1.24%
#19: Arizona: 1.18%
#20: Utah: 1.17%
#21: Minnesota: 1.16%
#22: Montana: 1.14%
#23: North Carolina: 1.13%
#24: Illinois: 1.13%
#25: Nebraska: 1.12%
#26: Maine: 1.12%
#27: Florida: 1.10%
#28: Alaska: 1.10%
#29: Kansas: 1.08%
#30: Tennessee: 1.05%
#31: Missouri: 1.05%
#32: Georgia: 1.04%
#33: Idaho: 1.02%
#34: Wyoming: 1.02%
#35: Iowa: 1.02%
#36: Wisconsin: 0.99%
#37: Michigan: 0.98%
#38: South Dakota: 0.97%
#39: Ohio: 0.97%
#40: Texas: 0.94%
#41: South Carolina: 0.93%
#42: Alabama: 0.92%
#43: Indiana: 0.91%
#44: North Dakota: 0.87%
#45: Mississippi: 0.82%
#46: Oklahoma: 0.81%
#47: Louisiana: 0.81%
#48: Kentucky: 0.80%
#49: Nevada: 0.78%
#50: Arkansas: 0.77%
#51: West Virginia: 0.77%
----------------------------------------------------------------------------------------------------
college graduates
#1: Washington DC: 50.47%
#2: Massachusetts: 48.28%
#3: Connecticut: 45.71%
#4: New Hampshire: 44.77%
#5: Colorado: 44.07%
#6: Vermont: 43.98%
#7: New Jersey: 43.76%
#8: Maryland: 43.53%
#9: Minnesota: 42.94%
#10: Hawaii: 41.90%
#11: Washington State: 41.71%
#12: Virginia: 41.58%
#13: New York: 40.38%
#14: Rhode Island: 39.72%
#15: North Dakota: 39.42%
#16: Maine: 39.06%
#17: Illinois: 39.05%
#18: Oregon: 39.04%
#19: Florida: 38.78%
#20: Nebraska: 38.50%
#21: Montana: 38.27%
#22: Kansas: 38.26%
#23: California: 38.13%
#24: Delaware: 37.30%
#25: South Dakota: 37.15%
#26: Iowa: 36.75%
#27: Wisconsin: 36.73%
#28: Pennsylvania: 36.66%
#29: Utah: 36.56%
#30: Arizona: 36.26%
#31: North Carolina: 35.96%
#32: Michigan: 34.98%
#33: Wyoming: 34.34%
#34: New Mexico: 34.09%
#35: Georgia: 33.89%
#36: South Carolina: 33.78%
#37: Idaho: 33.74%
#38: Ohio: 33.65%
#39: Missouri: 33.54%
#40: Alaska: 33.04%
#41: Texas: 31.96%
#42: Indiana: 31.04%
#43: Oklahoma: 30.72%
#44: Tennessee: 30.29%
#45: Alabama: 30.18%
#46: Nevada: 30.18%
#47: Kentucky: 28.44%
#48: Mississippi: 27.99%
#49: Arkansas: 26.74%
#50: Louisiana: 26.32%
#51: West Virginia: 25.49%
Advertisement

Get your Wu-name in six lines of Python

I recently had a need to generate screen-names for a bunch of test users. I remembered the wonderful Wu-name generator and it was too tempting to pass up.

import urllib
from lxml.html import fromstring
def get_wu_name(first_name, last_name):
    """
    >>> get_wu_name("Jeff", "Elmore")
    'Ultra-Chronic Monstah'
    """
    w = urllib.urlopen("http://www.recordstore.com/wuname/wuname.pl",
                       urllib.urlencode({'fname':first_name, 'sname': last_name}))
    doc = fromstring(w.read())
    return [d for d in doc.iter('span') if d.get('class') == 'newname'][0].text

Okay, technically it’s more than six if you count the docstring and linewrap but still. Also, I felt like there should be a better way to do the last line, but this works well enough.

Automatic Downcasting of Inherited Models in Django

I’ve been working with non-abstract model inheritance in Django and one of the problems I’ve run into is that what you get back from querying the superclass is superclass instances when generally what you want is subclass instances. Of course you can query the subclasses directly, but in cases where you want to operate on several different kinds of subclasses at once, querying the superclass is the obvious choice. If you need an introduction on model inheritance (or just a refresher) check out this article by Charles Leifer.

There are a few solutions out there such as Carl Meyer’s django-model-utils and django-polymorphic-models. These didn’t quite work as I wanted. I definitely don’t want to downcast individual instances, causing n additional queries for n objects. Also, although django-model-utils offers a cast method that does avoid a query per instance, it didn’t feel right to me. For one thing, it returns a list, not a QuerySet which means you lose lazy evaluation which is critical when working with large datasets. There are a few other differences that I will get into later, but for now let’s get back to the task at hand.

For this to work, we need our superclass to be able to find its subclasses automatically. Luckily, since non-abstract inheritance is handled by a OneToOneField, our superclass already knows about its subclasses, since an attribute is added for each subclass by the OneToOne. Let’s see what this looks like.

Say we have these models:

class Base(models.Model):
    name = models.CharField(max_length=255)
   
class SubA(Base):
    sub_field_a = models.CharField(max_length=255)
    
class SubB(Base):
    sub_field_b = models.CharField(max_length=255)

Then in the shell, let’s create some objects and see what’s happening in the database as we access the subclasses.

>>> from django.db import connection
>>> SubA.objects.create(name='Sub A Test', sub_field_a="A")
<SubA: SubA object>
>>> SubB.objects.create(name='Sub B Test', sub_field_b="B")
<SubB: SubB object>
>>> connection.queries = [] # clear the queries.
>>> base_objs = Base.objects.all()
>>> base_objs = list(base_objs) #Let's force it to evaluate.
>>> base_objs[0]
<Base: Base object>
>>> base_objs[0].suba
<SubA: SubA object
>>>> base_objs[1].subb
<SubB: SubB object>
>>> connection.queries
[{'sql': 'SELECT "testing_base"."id", "testing_base"."name" FROM "testing_base"',
  'time': '0.001'},
 {'sql': 'SELECT "testing_base"."id", "testing_base"."name", "testing_suba"."base_ptr_id", "testing_suba"."sub_field_a" FROM "testing_suba" INNER JOIN "testing_base" ON ("testing_suba"."base_ptr_id" = "testing_base"."id") WHERE "testing_suba"."base_ptr_id" = 1 ',
  'time': '0.000'},
 {'sql': 'SELECT "testing_base"."id", "testing_base"."name", "testing_subb"."base_ptr_id", "testing_subb"."sub_field_b" FROM "testing_subb" INNER JOIN "testing_base" ON ("testing_subb"."base_ptr_id" = "testing_base"."id") WHERE "testing_subb"."base_ptr_id" = 2 ',
  'time': '0.000'}]

We can see that although we can access the subclass instance directly from the superclass instance it has to make another trip to the database to get each subclass. All it’s using to do this is a join, which could’ve easily been done on the first query.

Enter select_related(), which will tell the queryset to go ahead and get the related information in the first query. Let’s do the same thing, but use select_related first.

>>> base_objs = Base.objects.select_related('suba','subb').all()
>>> base_objs = list(base_objs) #Let's force it to evaluate.
>>> base_objs[0].suba
<SubA: SubA object>
>>> base_objs[1].subb
<SubA: SubB object>
>>> connection.queries
[{'sql': 'SELECT "testing_base"."id", "testing_base"."name", "testing_suba"."base_ptr_id", "testing_suba"."sub_field_a", "testing_subb"."base_ptr_id", "testing_subb"."sub_field_b" FROM "testing_base" LEFT OUTER JOIN "testing_suba" ON ("testing_base"."id" = "testing_suba"."base_ptr_id") LEFT OUTER JOIN "testing_subb" ON ("testing_base"."id" = "testing_subb"."base_ptr_id")',
  'time': '0.001'}]

Now we see that we were able to access each different subclass with only one query. So this takes care of the meat of the problem, now we just need to make it more convenient.

To this end, I have three goals:

  • You should be able to specify which subclasses to automatically downcast but you should not be required to do so.
  • You should get subclass instances automatically
  • This mixed QuerySet should be filterable, clonable, and in all respects interchangeable for a standard QuerySet.

To achieve the first goal, we need to introspect the model a bit to find the subclasses. We can find all the OneToOne relationships to the model that could be from subclasses: they will be instances of a SingleRelatedObjectDescriptor. We can then filter through those further to ensure that they relate to a subclass.

For the second goal, we need to override the iterator method on QuerySet. Since each superclass instance will now have a prepopulated attribute for its appropriate subclass we just need to iterate through the subclasses and return instead the one that happens to be non null (falling back of course to just returning the superclass if no subclass instances exist).

Finally, for the third goal, we really just need to make sure we haven’t broken anything. Since we’ve subclassed QuerySet and only made very slight modifications, we should have no problems. The only thing we need to do is override _clone to pass along our extra information about subclasses.

from django.db.models.fields.related import SingleRelatedObjectDescriptor
from django.db.models.query import QuerySet

class InheritanceQuerySet(QuerySet):
    def select_subclasses(self, *subclasses):
        if not subclasses:
            subclasses = [o for o in dir(self.model)
                          if isinstance(getattr(self.model, o), SingleRelatedObjectDescriptor)\
                          and issubclass(getattr(self.model,o).related.model, self.model)]
        new_qs = self.select_related(*subclasses)
        new_qs.subclasses = subclasses
        return new_qs

    def _clone(self, klass=None, setup=False, **kwargs):
        try:
            kwargs.update({'subclasses': self.subclasses})
        except AttributeError:
            pass
        return super(InheritanceQuerySet, self)._clone(klass, setup, **kwargs)
        
    def iterator(self):
        iter = super(InheritanceQuerySet, self).iterator()
        if getattr(self, 'subclasses', False):
            for obj in iter:
                obj = [getattr(obj, s) for s in self.subclasses if getattr(obj, s)] or [obj]
                yield obj[0]
        else:
            for obj in iter:
                yield obj

Let’s take a look at how this works.

>>> qs = InheritanceQuerySet(model=Base)
>>> qs
[<Base: Base object>, <Base: Base object>]
>>> qs.select_subclasses()
[<SubA: SubA object>, <SubB: SubB object>]
>>> qs.select_subclasses('suba')
[<SubA: SubA object>, <Base: Base object>]
>>> qs.select_subclasses('subb').exclude(name__icontains="a")
[<SubA: SubB object>]

By default the InheritanceQuerySet works the same as a regular QuerySet, but if you call select_subclasses (the same way you’d call select_related), you get the subclasses. You can specify a subset of the subclasses you wish to automatically get or if you do not specify, you’ll get all of them. You can filter, exclude, or do any other QuerySet operations before or after calling select_subclasses. This behavior satisfies all of the desired goals.

One drawback of this approach is that it will only handle one level of inheritance. Some of the other tools out there do handle longer inheritance chains, and it would certainly be possible to modify this to do that as well, but honestly, I’ve never had occasion to do more than one level of non-abstract model inheritance and I struggle to imagine a case where that would be the optimal approach.

A word about performance

I was curious what the impact to performance would be when joining to perhaps several different tables with lots of rows. I decided to run some simulations to get a feel for the effect. This seemed like a good time to compare with some of the existing tools as well. I selected django-model-utils since it offers two approaches to downcasting: a cast() call on a model instance and a cast() call on a QuerySet. I assume that django-polymorphic-models will perform similarly to the first approach emplyed by django-model-utils.

So here’s the plan. We’ll reuse our same models from earlier and see how performance looks when trying to fetch the first 100 objects as we increase the total number of objects in the database. I’ve also included as a baseline a superclass query that doesn’t fetch the subclasses at all.


Obviously, the QuerySet cast() call is the clear loser here. That method must fully evaluate the queryset to yield subclass instances so you lose lazy evaluation. Let’s drop that method from the picture and see what’s happening with the other players:


We can see that the impact of individual queries is immediate and while it does rise with the number of rows in the database that rise more or less mirrors the increase for the standard query. This makes sense given that it does the same base query and then does 100 straight PK lookups. The select_subclasses query has much better performance with less data but degrades at a greater rate. I do wonder what could be done database-tuning-wise to improve performance for a query like this, but that’s a blogpost for another day.

Something else to consider is that if you’re working with a very large queryset then you’ll probably be using pagination or slicing if you’re trying to get a small subset of results, so let’s see how these methods perform when we use slicing (which will use the LIMIT clause in the SQL if the QuerySet is unevaluated).

Letting the database handle limiting the number of rows returned has a huge effect on performance. In this case, individual queries no longer make any sense since all the other methods are now dealing with so many fewer rows. An interesting note is the excellent performance of the QuerySet cast() method. Apparently joins are just that expensive.

Overall, the select_subclasses query seems like a reasonable approach. The performance hit over the standard query is significant but not overwhelming, especially if you don’t have lots and lots data. Also, while performance of the QuerySet call() method is excellent in the slicing case, the fact that you lose lazy evaluation is troubling. This means that the last thing you do with the queryset must be to call cast(). For something like pagination you would have to call the cast() method AFTER paginating, probably inside your template. This feels wrong to me, I’d rather not have to think about queries involving subclasses any differently than other queries. Another concern I had with django-model-utils is that it automatically adds a field to any model you want to use it with. This means you can’t just bolt it on without db modifications and it means that it will do two additional queries each time you create an object.

The performance for individual cast() calls might seem appealing with huge amounts of data but proper use of slicing and pagination eliminate its advantage.

Obviously, it’s all about your specific problem. If you need killer performance, have lots of data, and don’t mind having to work around making the cast() call as your last step, then the django-model-utils QuerySet.cast() method is an excellent choice. If you rarely need access to subclasses, but you want it to be convenient when you do, then the individual cast() call or django-polymorphic-models is right for you. I believe the select_subclasses approach fills a niche as well: a manageable hit to performance, no database modification required, and a convenient and familiar interface that doesn’t affect behavior unless you need it to.

At the end of the day, the fact is that if you need subclass instances, you’re going to have to request more data from the database which causes a performance hit. To me this is comparable to select_related: by default additional data requires another trip to the database, but if you know you’re going to need it you can get it ahead of time more efficiently. That’s why I modeled select_subclasses after select_related.

Smarter caching of Django QuerySets

Let me start by saying I love the Django ORM.  It does an excellent job of “just working” and hiding all the dirty SQL business required to get what you need out of your database.  There are times however when it doesn’t behave exactly like you’d want.  One of these that I ran into recently was the behavior of caching (or more specifically pickling) QuerySets.  By default Django evaluates QuerySets fully whenever they are pickled.  I often find myself displaying the first few dozen rows from QuerySets that contain tens of thousands of rows.  Each request only fetches the rows it needs but we have to redo the query on each page load. Caching the QuerySet and sharing it between requests would solve this problem but if we only need the first few rows, caching all 10,000 doesn’t make sense.

What we need is a QuerySet that can be partially evaluated, cached, and shared between requests. There are a few technical challenges to overcome, so let’s get started by understanding how QuerySets work.

The User model is a convenient one to work with, so let’s create some users. We need enough that they won’t all be pulled down in one chunk.

>>> from django.contrib.auth.models import User
>>> for i in range(1000):
>>>     User.objects.create(username="test%d"% i)
<User: test0>
<User: test1>
<User: test2>
...
<User: test999>
>>>

Now we have some users to work with, let’s see what’s happening with the QuerySet and its result_cache. We can use connection.queries to see what’s happening with the db (note, DEBUG must be set to true).

Let’s see what happens when we try to get a specific record:

>>> from django.db import connection
>>> connection.queries = [] #Clear user creation queries from before
>>> qs = User.objects.all()
>>> qs[50]
<User: test49@example.com>
>>> connection.queries
[{'time': '0.001', 'sql': u'SELECT `auth_user`.`id`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`password`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`is_superuser`, `auth_user`.`last_login`, `auth_user`.`date_joined` FROM `auth_user` LIMIT 1 OFFSET 50'}]
>>> qs._result_cache is None
True

We see that Django executed just enough query to get back the single result and didn’t populate its result_cache at all. Now let’s see what happens when we start iterating on a QuerySet

>>> connection.queries = [] #Clear the queries again.
>>> for i in qs:
>>>    break
>>> connection.queries
[{'time': '0.038', 'sql': u'SELECT `auth_user`.`id`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`password`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`is_superuser`, `auth_user`.`last_login`, `auth_user`.`date_joined` FROM `auth_user`'}]
>>> len(qs._result_cache)
100

This time there’s nothing in the SQL that limits how many results we get, but we’ve only fetched 100 records and we know there are 1000. Let’s see what’s actually going on:

class QuerySet(object):
    ...
    def __iter__(self):
        if self._result_cache is None:
            self._iter = self.iterator()
            self._result_cache = []
        if self._iter:
            return self._result_iter()
        # Python's list iterator is better than our version when we're just
        # iterating over the cache.
        return iter(self._result_cache)

    def _result_iter(self):
        pos = 0
        while 1:
            upper = len(self._result_cache)
            while pos < upper:
                yield self._result_cache[pos]
                pos = pos + 1
            if not self._iter:
                raise StopIteration
            if len(self._result_cache) <= pos:
                self._fill_cache()

Since the cache has not yet been populated, __iter__ will set self._iter to self.iterator() (this is the method that gets data from the database and converts each row into a model object, more below). _results_iter calls self._fill_cache() to populate result_cache, then loops back around and yields objects from the cache it just filled. Let’s take a look at what _fill_cache does:

    ...
    def _fill_cache(self, num=None):
        """
        Fills the result cache with 'num' more entries (or until the results
        iterator is exhausted).
        """
        if self._iter:
            try:
                for i in range(num or ITER_CHUNK_SIZE):
                    self._result_cache.append(self._iter.next())
            except StopIteration:
                self._iter = None

QuerySet._fill_cache fills up self._result_cache from self._iter (which as we saw earlier was set to self.iterator()). We’re finally getting close to actually hitting the database but we’re not quite there yet.

A lot goes on in QuerySet.iterator, but suffice it to say it is the mechanism whereby database rows are converted into model instances:

class QuerySet(object):
    ...
    def iterator(self):
        ...
        compiler = self.query.get_compiler(using=self.db)
        for row in compiler.results_iter():
            ...
            yield obj

The QuerySet class has a Query object which in turn has a SQLCompiler object. SQLCompiler.results_iter returns a generator that spits out database rows it gets from execute_sql.

class SQLCompiler(object):
    ...
    def results_iter(self):
        ...
        for rows in self.execute_sql(MULTI):
            for row in rows:
                ....
                yield row

SQLCompiler.execute_sql returns an iterator that does successive calls to cursor.fetchmany() to pull data in chunks of 100 (by default anyway)

    ....
    def execute_sql(self, result_type=MULTI):    
        ...
        result = iter((lambda:
                  cursor.fetchmany(GET_ITERATOR_CHUNK_SIZE)), 
                  self.connection.features.empty_fetchmany_value)

If we iterate past the objects available in the cache, more rows are yielded from cursor.fetchmany() and converted to models objects in QuerySet.iterator().

>>> connection.queries = []
>>> for i, o in enumerate(qs):
>>>   if i == 101:
>>>      break
>>> connection.queries
[]
>>> len(qs._result_cache)
200

Note that it doesn’t need to do any more queries, instead it yields another 100 results with fetchmany, converts those rows into model instances, and populates _result_cache.

Note: QuerySets have special handling for slicing. If instead of iterating up to 101, we had sliced up to 101, the query_cache would only have been filled to the point necessary to fulfill the slice request.

Things are looking good at this point. The QuerySet already handles partial evaluation and evaluating more rows as required. We just need to share this QuerySet that’s already partially evaluated between between requests. This means we either need to put it in the session or put it in the cache. Either way, the QuerySet will have to be pickled. So let’s try that and see what happens.

>>> import cPickle
>>> pickle_str = cPickle.dumps(qs)
>>> qs = cPickle.loads(pickle_str)
>>> len(qs._result_cache)
1000

Unfortunately, as I mentioned earlier, Django has decided to completely evaluate the QuerySet when we pickled it:

class QuerySet(object):
    ...
    def __getstate__(self):
        """
        Allows the QuerySet to be pickled.
        """
        # Force the cache to be fully populated.
        len(self)

        obj_dict = self.__dict__.copy()
        obj_dict['_iter'] = None
        return obj_dict

There are two reasons why it’s tricky to pickle a partially evaluated QuerySet.  First, you cannot pickle generators and the QuerySet actually relies on two generators (one in the QuerySet itself and one in the SQLCompiler).  Second, even if you could it still wouldn’t work because fetching more rows assumes that the database connection is still there and remembers what row we were at, etc. To avoid these issues Django fully evaluates the QuerySet and stops iterating upon pickling.  What we have to do is stop iterating with an only partially populated cache and allow further records to be fetched with a second query instead of fetching more rows from the same query. 

This requires the QuerySet to be in a somewhat unnatural state. Normally QuerySets fall into one of three states:

  • Completely unevaluated (i.e. no cache and not iterating)
  • Partially cached and currently iterating
  • Fully cached and not iterating

What we want is for it to be partially cached and not iterating.  There are a bunch of places where this causes problems. An easy way to get around these problems is to make the QuerySet start iterating as soon as it is unpickled.  This confuses Django less and doesn’t make the QuerySet behave any differently.  The behavior we want is that iterating on the QuerySet retrieved from cache should start from the beginning pulling from the precached values and only going back to the database when it needs to.  The default behavior is very close to what we want: If the QuerySet has enough cached data to fulfill the request it returns that; if it doesn’t have enough, it fills up its cache a bit more in the fill_cache method. The only difference is that a QuerySet retrieved from the cache cannot have any active generators or remember from earlier where it was in fetching rows from the generator in the SQLCompiler. Instead we can use set_limits on the QuerySet’s query– set_limits would normally be used for slicing, but it serves this purpose just fine.  The only thing that we need to worry about when using set_limits is the count().  For convenience I decided to cache count before pickling and return that instead of trying to get fancy.

After all that preamble, here’s the money-shot:

from django.db.models.query import QuerySet
from django.db.models import Q

class SmartCachingQuerySet(QuerySet):
    """
    This works pretty much exactly like a regular queryset
    except that when you pickle it, it does not fully evaluate
    itself, instead it evaluates DEFAULT_PREFETCH_COUNT
    records.
    """
    DEFAULT_PREFETCH_COUNT = 100
    def __init__(self, *args, **kwargs):
        self.cached_count = None
        super(SmartCachingQuerySet, self).__init__(*args, **kwargs)
        
    def __getstate__(self):
        """
        Override the default behavior of the Django QuerySet
        to only partially evaluate before pickling.
        """
        if not self._result_cache:
            self.prefetch()

        #We need to cache the count. Note, we might be
        #able to be smarter about this but why bother?
        self.count()
        
        obj_dict = self.__dict__.copy()
        obj_dict['_iter'] =  None
        return obj_dict

    def __setstate__(self, in_dict):
        """
        Restore the iterator and set the the limits.
        """
        self.__dict__ = in_dict
        self.__dict__['_iter'] = self.iterator()
        self.__dict__['query'].set_limits(low=len(self.__dict__['_result_cache']))        

    def prefetch(self, num_to_prefetch=None):
        """
        Use Django's built in functionality to prefetch
        a number of content models from the db
        """
        self.__iter__()
        self._fill_cache(num_to_prefetch or self.DEFAULT_PREFETCH_COUNT)

    def count(self):
        if self.cached_count:
            return self.cached_count

        self.cached_count = self.query.get_count(using=self.db)
        return self.cached_count

    def _filter_or_exclude(self, negate, *args, **kwargs):
	"""
	Override the normal behavior of filter_or_exclude which 
	would check if limits has been set and return an error.
	Since we're handling this issue in _clone, we don't
	need to do the check.
	"""
        clone = self._clone()
        if negate:
            clone.query.add_q(~Q(*args, **kwargs))
        else:
            clone.query.add_q(Q(*args, **kwargs))
        
        return clone
            
    def order_by(self, *field_names):
	"""
	Override order_by for the same reason we overrode _filter_or_exclude
	"""
        qs = self._clone()
        return super(SmartCachingQuerySet, qs).order_by(*field_names)

    def _clone(self, klass=None, setup=False, **kwargs):
        #Store these values and clear them so we can safely clone.
        query_limits = (self.query.low_mark, self.query.high_mark)
        result_cache = self._result_cache
        self.query.clear_limits()
        self._result_cache = None
        cloned_qs = super(SmartCachingQuerySet, self)._clone(klass, setup, **kwargs)

        #Restore stuff.
        self.query.set_limits(*query_limits)
        self._result_cache = result_cache
        return cloned_qs

Let’s take a quick look at how this works:

>>> from django.db import connection
>>> from django.core.cache import cache
>>> from django.contrib.auth.models import User
>>> qs = SmartCachingQuerySet(model=User)
>>> cache.set('test',qs)
>>> connection.queries
[{'time': '0.026', 'sql': u'SELECT `auth_user`.`id`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`password`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`is_superuser`, `auth_user`.`last_login`, `auth_user`.`date_joined` FROM `auth_user`'}, {'time': '0.000', 'sql': u'SELECT COUNT(*) FROM `auth_user`'}]
>>> connection.queries = [] #Clear the queries
>>> cached_qs = cache.get('test')
>>> len(cached_qs._result_cache)
100
>>> cached_qs.count()
1000
>>> connection.queries
[]
>>> for i,v in enumerate(cached_qs):
...     if i == 101:
...        break
... 
>>> connection.queries
[{'time': '0.016', 'sql': u'SELECT `auth_user`.`id`, `auth_user`.`username`, `auth_user`.`first_name`, `auth_user`.`last_name`, `auth_user`.`email`, `auth_user`.`password`, `auth_user`.`is_staff`, `auth_user`.`is_active`, `auth_user`.`is_superuser`, `auth_user`.`last_login`, `auth_user`.`date_joined` FROM `auth_user` LIMIT 18446744073709551615 OFFSET 100'}]
>>> len(cached_qs._result_cache)
200

We can see that the SmartCachingQuerySet provides the same behavior as a normal QuerySet, but what happens behind the scenes is different. After pulling cached_qs from the cache and iterating past the number of items in _result_cache, it will hit the database again to get the next 100 objects using OFFSET 100 to avoid fetching the same items it already has.

The last thing we need to make sure we can do is filter, exclude, and order cached QuerySets and make sure they behave appropriately. What we want is for them to create a new QuerySet with no cache.

>>> filtered_qs = cached_qs.filter(id__gt=10)
>>> filtered_qs._result_cache is None
True
>>> ordered_qs = qs.order_by("-id")
>>> ordered_qs._result_cache is None
True
>>> (cached_qs._result_cache)
200

This effect is achieved by overriding _filter_or_exclude and order_by to avoid the normal check that would prevent you from cloning a Query that had limits set on it. This does not affect the _result_cache on the original QuerySet.

So there you have it: smarter behavior for caching QuerySets. Obviously the same caveats apply for caching in general. If the underlying data changes, a SmartCachingQuerySet pulled from cache will not only be stale but may provide duplicate results or miss results that would have been returned by a new query. Apply invalidation appropriately.

I’d like to give a shout out to my friend Michaux for his assistance. I haven’t done so yet, but I plan on putting together a package with sample app, etc and making it available. I welcome any feedback!

Productivity Monitor with Subversion and Arduino

I’m a big fan of metrics. We use Scrumworks at the office and we put our burndowns on a whiteboard so we can see how each sprint is going day-to-day. Having this feedback always available is both informative and motivating. But we only update the burndown once a day; what about a more instantaneous feedback mechanism?  I figure there are two sources for a measure of our productivity: Scrumworks and Subversion.  I decided that Subversion would provide a more immediate indicator of our progress than Scrumworks.  So I have a datasource, now I just needed a display mechanism.

I’ve been playing a lot lately with the Arduino and specifically with RGB LEDs. It’s super easy to communicate with the Arduino over its USB cable, so some of my first experiments with the RGB LEDs involved controlling them with Python. Now, the last time we were at Ikea I bought this weird little rubber lamp creature thing named Spöka.  It’s just two green LEDs inside with a rechargeable battery and some limited circuitry.  There’s actually room enough for an Arduino Pro Mini and the power cord appears to provide 4.8V so it would be perfect.  Of course, I don’t have a Pro Mini so for now I can prototype it with my Duemilanove.

Here’s the Spöka cover and the box I’ll be putting the Arduino/wiring in.  I cut a hole in the top to fit the little creature on to.
dsc_0072

I’m really terrible at soldering but I did manage to get my RGB LED soldered onto a little board with some wires connected out.  Note: I actually ended up redoing this with a brighter LED I got from superbrightleds.com, I also soldered in a resistor.

The wiring is very simple, just the USB cable in and the few wires from the LED connected to the PWM pins on the Arduino.

dsc_0077

The Arduino reads from the serial connection with one byte for the red, green, and blue LEDs.  There is a bit of fanciness with the red LED since it operates at a different voltage.  I’ve had some difficulty getting good color mixing with the red.  I imagine this could be resolved with wiring a different resistor for the red LED, but I haven’t tried that yet plus I’d need three resistors instead of one.  I’ve also added some “pulsating” action that makes the LEDs fade in and out a bit.  It’s especially cool with the color-mixing.

Here are some photos of the device in action.

dsc_0078

dsc_0081

dsc_0082

Now we just need to connect this up to Subversion.  I used the pysvn and pyserial libraries.  Basically, the program monitors a local copy of a repo;  when it starts it gets all the commits that have occurred within the last three hours, then every 30 seconds it updates the repo and incorporates any new commits into the calculations.   To calculate the “productivity score” it sums up an exponential term decaying with time for each commit scaled by the “size” of the commit as determined by the length of its diff with the prevision revision.  It then uses this productivity score to set the colors on the Arduino: green is good, blue is okay, red is bad.

I’ve tried it out at the office and so far the response from the team has been really positive.  I’ve been playing a lot with the coefficients to get a nice effect from the system.  There are a number of factors to balance.  First, we want it to decay quickly enough that it provides good feedback but not so quickly that it’s too difficult to “keep it green.”  Also we want small commits to contribute a reasonable amount to the score but we want to ensure that large commits (for example adding a bunch of test data or something) don’t kick the score up so high that it stays green for hours.  There are still adjustments to be made but for now it works reasonably well; small to medium sized commits will keep the monster (this is the name that’s stuck) happy for about 20 to 30 minutes and pretty much after an hour and a half or two hours, it’s angry no matter what we’ve done.  With 3 or 4 people all working on a project we really shouldn’t go 2 hours without a commit and it’s not unreasonable to expect at least a small commit every 30 minutes.

There have been a few unanticpated benefits so far.  First, the system rewards more frequent commits even if they’re small.  This is a better practice in general, so that’s been nice.  This has encouraged more small refactoring work.  It’s more rewarding to make something like a small naming convention correction if you get some nice visual feedback.  Also, since it turns green (or at least blue) within 30 seconds of a commit, it’s a nice way to notify other team members that a commit has been made.  This leads to more frequent updating and fewer conflicts.  Finally, it can serve as an indicator that people aren’t sure what to do.  If the monster stays red for a while that’s a good time to get up, walk around and see what’s going on.

Since the color changes no matter who makes a commit, it is a team metric instead of a personal metric.  I think this is important for keeping it light-hearted and not something used to pressure people to feel like they have to be making commits every 30 minutes.  So far we haven’t had any problem with this but in the wrong hands or implemented poorly it could end up being something that would reduce morale instead of improving it.

Code is available on GitHub.  It’s not as clean as I’d like it to be, but it works well enough.