DjangoCon Europe 2019: How (and why!) to build a Django based project with SQLAlchemy Core for data analysis

Writeup of the DjangoCon Europe 2019 talk »How (and why!) to build a Django based project with SQLAlchemy Core for data analysis« by Gleb Pushkov

Gleb Pushkov: A passionate developer from Ukraine, Kyiv. The first time he got acquainted with Python was 7 years ago.

Scenario

You're building some kind of data analysis app, and

  • your app mostly works with aggregations
  • you have a lot of data
  • you need precise and performant queries
  • you're building advanced queries dynamically
  • you're transforming complex queries from SQL to python
  • your database is not natively supported by Django (e.g. SQL Azure, Sybase, Firebird)

Django is good and with Subqueries, WIndow functions, FilteredRelations, Conditional Expressions, functions for Date, Math, and ext, and Custom DB constraints, you have less reasons to use SQL than a few years ago!

Problems

The ORM has a very specific way of generating queries. It tends to run LEFT OUTER JOIN before the LIMIT clause, for example, if you're slicing an annotation. Instead you'd want nested selects, then the limit, and in the very end the outer join. This is doable in SQLAlchemy, but not in Django ORM, because we can't add Subqueries to FROM. We can do a similar query if we hurt ourselves a bit.

Django ORM and SQLAlchemy have idfferent layers and different goals. Django builds on a non-public API on top of raw SQL, and the ORM on top, while SQLAlchemy provides a lower-level API, that responds to actual SQL. SQLAlchemy offers way more freedom than the ORM.

In SQLAlchemy, it is also normal and doable to build your queries in several layers, referencing and wrapping each other. This is something Django has only just started on with Subquery/Outerref, but it doesn't permit nearly the depth and freedom SQLAlchemy does.

Filtering by an aggregation. The naive Django implementation does multiple queries, so we're looking at Subqueries, but you can't do min aggregations there. You can add annotations, but the docs tell you to be careful/not do it. Or, of course, write raw SQL. SQLAlchemy allows you to include queries in other queries directly, solving this problem.

Joins of non-related models are impossible, as well as RIGHT OUTER JOINs, and Django takes the decision of which JOIN to run (INNER or LEFT OUTER) away from you. And it is very hard to modify the JOIN expression to use OR except by using FilteredRelations.

Recursive CTEs are not supported by Django yet either. Those can be done by raw SQL or by using django-cte-forest (which has limitations since it uses extra, which will be deprecated).

Aggregations over multiple tables are hard to get correct in Django, too – you can use distinct=True on COUNT, but not on other aggregations.

To sum up: Django queries

  • are hard to read in advanced queries,
  • hard to understand what's going on on the SQL level
  • not performant on complex use cases.

Usually this is not a problem in 95% of cases, but in your analysis app as above relies on behaviour Django cannot provide.

How to move to SQLAlchemy

Integrate

Inject SQLAlchemy, and disable pooling by setting the pool to the NullPool. Every uWSGI worker will have n threads, and each thread would have an SQLAlchemy pool, but now has a single Django connection, so it's recommended to add pgbouncer in front of your DB.

Add tales

If you have models for tables, use aldjemy or django-sabridge. If you don't have models yet, use table reflections by running Table('name', meta, autoload=True), or define them explicitly (bu building SQLAlchemy models including relations) or via inline expressions in your queries (you might not want this due to the repetition required).

Tests

engine is a global variable that will be evaluated before django.test.override_settings can chime in, so you'll have to override it depending on your environment. Yuck.

There's an issue with pytest dealing with ResultProxy, which lets pytest hang forever if an exception occurs with an active ResultProxy open, so you'll have to add a decorator that adds connection handling on raised exceptions.

Since you have two connections, Django and SQLAlchemy, but since fixtures will be done via the Django connections, and never committed, your SQLAlchemy connection will not see that data – so using TestCase with fixtures is tricky. If you then use the SQLAlchemy connection to create test data, you'll have to clean it up yourself. Or you hack around READ_COMMITTED, but be sure you know what you're doing.

You can also use TransactionTestCase, which doesn't have the commit/transaction issue, but it will be slower, and models of related tables will be flushed automatically.

Drawbacks

There is little documentation on this integration, so it's hard to start. Tests are hard to set up, and connections are tricky, and you'll be unable to use libraries which work with querysets, such as django-filters, or pagination.

Benefits

You'll have full control over SQL, and you'll be faster to express SQL in your Python code. It's easier to build application-specific in an SQL-generation layer since you can use the built in abstractions. It's readable, maintainable, and performant.