Posts Tagged 'django python inheritance downcasting orm polymorphism queryset'

Automatic Downcasting of Inherited Models in Django

I’ve been working with non-abstract model inheritance in Django and one of the problems I’ve run into is that what you get back from querying the superclass is superclass instances when generally what you want is subclass instances. Of course you can query the subclasses directly, but in cases where you want to operate on several different kinds of subclasses at once, querying the superclass is the obvious choice. If you need an introduction on model inheritance (or just a refresher) check out this article by Charles Leifer.

There are a few solutions out there such as Carl Meyer’s django-model-utils and django-polymorphic-models. These didn’t quite work as I wanted. I definitely don’t want to downcast individual instances, causing n additional queries for n objects. Also, although django-model-utils offers a cast method that does avoid a query per instance, it didn’t feel right to me. For one thing, it returns a list, not a QuerySet which means you lose lazy evaluation which is critical when working with large datasets. There are a few other differences that I will get into later, but for now let’s get back to the task at hand.

For this to work, we need our superclass to be able to find its subclasses automatically. Luckily, since non-abstract inheritance is handled by a OneToOneField, our superclass already knows about its subclasses, since an attribute is added for each subclass by the OneToOne. Let’s see what this looks like.

Say we have these models:

class Base(models.Model):
    name = models.CharField(max_length=255)
   
class SubA(Base):
    sub_field_a = models.CharField(max_length=255)
    
class SubB(Base):
    sub_field_b = models.CharField(max_length=255)

Then in the shell, let’s create some objects and see what’s happening in the database as we access the subclasses.

>>> from django.db import connection
>>> SubA.objects.create(name='Sub A Test', sub_field_a="A")
<SubA: SubA object>
>>> SubB.objects.create(name='Sub B Test', sub_field_b="B")
<SubB: SubB object>
>>> connection.queries = [] # clear the queries.
>>> base_objs = Base.objects.all()
>>> base_objs = list(base_objs) #Let's force it to evaluate.
>>> base_objs[0]
<Base: Base object>
>>> base_objs[0].suba
<SubA: SubA object
>>>> base_objs[1].subb
<SubB: SubB object>
>>> connection.queries
[{'sql': 'SELECT "testing_base"."id", "testing_base"."name" FROM "testing_base"',
  'time': '0.001'},
 {'sql': 'SELECT "testing_base"."id", "testing_base"."name", "testing_suba"."base_ptr_id", "testing_suba"."sub_field_a" FROM "testing_suba" INNER JOIN "testing_base" ON ("testing_suba"."base_ptr_id" = "testing_base"."id") WHERE "testing_suba"."base_ptr_id" = 1 ',
  'time': '0.000'},
 {'sql': 'SELECT "testing_base"."id", "testing_base"."name", "testing_subb"."base_ptr_id", "testing_subb"."sub_field_b" FROM "testing_subb" INNER JOIN "testing_base" ON ("testing_subb"."base_ptr_id" = "testing_base"."id") WHERE "testing_subb"."base_ptr_id" = 2 ',
  'time': '0.000'}]

We can see that although we can access the subclass instance directly from the superclass instance it has to make another trip to the database to get each subclass. All it’s using to do this is a join, which could’ve easily been done on the first query.

Enter select_related(), which will tell the queryset to go ahead and get the related information in the first query. Let’s do the same thing, but use select_related first.

>>> base_objs = Base.objects.select_related('suba','subb').all()
>>> base_objs = list(base_objs) #Let's force it to evaluate.
>>> base_objs[0].suba
<SubA: SubA object>
>>> base_objs[1].subb
<SubA: SubB object>
>>> connection.queries
[{'sql': 'SELECT "testing_base"."id", "testing_base"."name", "testing_suba"."base_ptr_id", "testing_suba"."sub_field_a", "testing_subb"."base_ptr_id", "testing_subb"."sub_field_b" FROM "testing_base" LEFT OUTER JOIN "testing_suba" ON ("testing_base"."id" = "testing_suba"."base_ptr_id") LEFT OUTER JOIN "testing_subb" ON ("testing_base"."id" = "testing_subb"."base_ptr_id")',
  'time': '0.001'}]

Now we see that we were able to access each different subclass with only one query. So this takes care of the meat of the problem, now we just need to make it more convenient.

To this end, I have three goals:

  • You should be able to specify which subclasses to automatically downcast but you should not be required to do so.
  • You should get subclass instances automatically
  • This mixed QuerySet should be filterable, clonable, and in all respects interchangeable for a standard QuerySet.

To achieve the first goal, we need to introspect the model a bit to find the subclasses. We can find all the OneToOne relationships to the model that could be from subclasses: they will be instances of a SingleRelatedObjectDescriptor. We can then filter through those further to ensure that they relate to a subclass.

For the second goal, we need to override the iterator method on QuerySet. Since each superclass instance will now have a prepopulated attribute for its appropriate subclass we just need to iterate through the subclasses and return instead the one that happens to be non null (falling back of course to just returning the superclass if no subclass instances exist).

Finally, for the third goal, we really just need to make sure we haven’t broken anything. Since we’ve subclassed QuerySet and only made very slight modifications, we should have no problems. The only thing we need to do is override _clone to pass along our extra information about subclasses.

from django.db.models.fields.related import SingleRelatedObjectDescriptor
from django.db.models.query import QuerySet

class InheritanceQuerySet(QuerySet):
    def select_subclasses(self, *subclasses):
        if not subclasses:
            subclasses = [o for o in dir(self.model)
                          if isinstance(getattr(self.model, o), SingleRelatedObjectDescriptor)\
                          and issubclass(getattr(self.model,o).related.model, self.model)]
        new_qs = self.select_related(*subclasses)
        new_qs.subclasses = subclasses
        return new_qs

    def _clone(self, klass=None, setup=False, **kwargs):
        try:
            kwargs.update({'subclasses': self.subclasses})
        except AttributeError:
            pass
        return super(InheritanceQuerySet, self)._clone(klass, setup, **kwargs)
        
    def iterator(self):
        iter = super(InheritanceQuerySet, self).iterator()
        if getattr(self, 'subclasses', False):
            for obj in iter:
                obj = [getattr(obj, s) for s in self.subclasses if getattr(obj, s)] or [obj]
                yield obj[0]
        else:
            for obj in iter:
                yield obj

Let’s take a look at how this works.

>>> qs = InheritanceQuerySet(model=Base)
>>> qs
[<Base: Base object>, <Base: Base object>]
>>> qs.select_subclasses()
[<SubA: SubA object>, <SubB: SubB object>]
>>> qs.select_subclasses('suba')
[<SubA: SubA object>, <Base: Base object>]
>>> qs.select_subclasses('subb').exclude(name__icontains="a")
[<SubA: SubB object>]

By default the InheritanceQuerySet works the same as a regular QuerySet, but if you call select_subclasses (the same way you’d call select_related), you get the subclasses. You can specify a subset of the subclasses you wish to automatically get or if you do not specify, you’ll get all of them. You can filter, exclude, or do any other QuerySet operations before or after calling select_subclasses. This behavior satisfies all of the desired goals.

One drawback of this approach is that it will only handle one level of inheritance. Some of the other tools out there do handle longer inheritance chains, and it would certainly be possible to modify this to do that as well, but honestly, I’ve never had occasion to do more than one level of non-abstract model inheritance and I struggle to imagine a case where that would be the optimal approach.

A word about performance

I was curious what the impact to performance would be when joining to perhaps several different tables with lots of rows. I decided to run some simulations to get a feel for the effect. This seemed like a good time to compare with some of the existing tools as well. I selected django-model-utils since it offers two approaches to downcasting: a cast() call on a model instance and a cast() call on a QuerySet. I assume that django-polymorphic-models will perform similarly to the first approach emplyed by django-model-utils.

So here’s the plan. We’ll reuse our same models from earlier and see how performance looks when trying to fetch the first 100 objects as we increase the total number of objects in the database. I’ve also included as a baseline a superclass query that doesn’t fetch the subclasses at all.


Obviously, the QuerySet cast() call is the clear loser here. That method must fully evaluate the queryset to yield subclass instances so you lose lazy evaluation. Let’s drop that method from the picture and see what’s happening with the other players:


We can see that the impact of individual queries is immediate and while it does rise with the number of rows in the database that rise more or less mirrors the increase for the standard query. This makes sense given that it does the same base query and then does 100 straight PK lookups. The select_subclasses query has much better performance with less data but degrades at a greater rate. I do wonder what could be done database-tuning-wise to improve performance for a query like this, but that’s a blogpost for another day.

Something else to consider is that if you’re working with a very large queryset then you’ll probably be using pagination or slicing if you’re trying to get a small subset of results, so let’s see how these methods perform when we use slicing (which will use the LIMIT clause in the SQL if the QuerySet is unevaluated).

Letting the database handle limiting the number of rows returned has a huge effect on performance. In this case, individual queries no longer make any sense since all the other methods are now dealing with so many fewer rows. An interesting note is the excellent performance of the QuerySet cast() method. Apparently joins are just that expensive.

Overall, the select_subclasses query seems like a reasonable approach. The performance hit over the standard query is significant but not overwhelming, especially if you don’t have lots and lots data. Also, while performance of the QuerySet call() method is excellent in the slicing case, the fact that you lose lazy evaluation is troubling. This means that the last thing you do with the queryset must be to call cast(). For something like pagination you would have to call the cast() method AFTER paginating, probably inside your template. This feels wrong to me, I’d rather not have to think about queries involving subclasses any differently than other queries. Another concern I had with django-model-utils is that it automatically adds a field to any model you want to use it with. This means you can’t just bolt it on without db modifications and it means that it will do two additional queries each time you create an object.

The performance for individual cast() calls might seem appealing with huge amounts of data but proper use of slicing and pagination eliminate its advantage.

Obviously, it’s all about your specific problem. If you need killer performance, have lots of data, and don’t mind having to work around making the cast() call as your last step, then the django-model-utils QuerySet.cast() method is an excellent choice. If you rarely need access to subclasses, but you want it to be convenient when you do, then the individual cast() call or django-polymorphic-models is right for you. I believe the select_subclasses approach fills a niche as well: a manageable hit to performance, no database modification required, and a convenient and familiar interface that doesn’t affect behavior unless you need it to.

At the end of the day, the fact is that if you need subclass instances, you’re going to have to request more data from the database which causes a performance hit. To me this is comparable to select_related: by default additional data requires another trip to the database, but if you know you’re going to need it you can get it ahead of time more efficiently. That’s why I modeled select_subclasses after select_related.


Twitter-feed

  • Ah that special feeling when a single plot is so elegant and concise that you can delete three other plots and about 1,000 words. 3 weeks ago
  • I can tell that my analysis just completed because my CPU fan finally ceased its high-pitched whine. That is surprisingly satisfying. 4 weeks ago
  • @mpbtownsend cool cool. I'll send something soon 1 month ago
  • @mpbtownsend either way I'm happy to share my AERA paper with you once I write it hahaha. I'd love to get your feedback! 1 month ago
  • @mpbtownsend will you be at AERA this year? I'll be presenting much of the same material 1 month ago