Writeup of the DjangoCon Europe 2019 talk »How (and why!) to build a Django based project with SQLAlchemy Core for data analysis« by Gleb Pushkov
Gleb Pushkov: A passionate developer from Ukraine, Kyiv. The first time he got acquainted with Python was 7 years ago.
You're building some kind of data analysis app, and
- your app mostly works with aggregations
- you have a lot of data
- you need precise and performant queries
- you're building advanced queries dynamically
- you're transforming complex queries from SQL to python
- your database is not natively supported by Django (e.g. SQL Azure, Sybase, Firebird)
Django is good and with Subqueries, WIndow functions, FilteredRelations, Conditional Expressions, functions for Date, Math, and ext, and Custom DB constraints, you have less reasons to use SQL than a few years ago!
The ORM has a very specific way of generating queries. It tends to run
LEFT OUTER JOIN before the
for example, if you're slicing an annotation. Instead you'd want nested selects, then the limit, and in the very end the
outer join. This is doable in SQLAlchemy, but not in Django ORM, because we can't add Subqueries to FROM. We can
do a similar query if we hurt ourselves a bit.
Django ORM and SQLAlchemy have idfferent layers and different goals. Django builds on a non-public API on top of raw SQL, and the ORM on top, while SQLAlchemy provides a lower-level API, that responds to actual SQL. SQLAlchemy offers way more freedom than the ORM.
In SQLAlchemy, it is also normal and doable to build your queries in several layers, referencing and wrapping each other. This is something Django has only just started on with Subquery/Outerref, but it doesn't permit nearly the depth and freedom SQLAlchemy does.
Filtering by an aggregation. The naive Django implementation does multiple queries, so we're looking at Subqueries, but you can't do min aggregations there. You can add annotations, but the docs tell you to be careful/not do it. Or, of course, write raw SQL. SQLAlchemy allows you to include queries in other queries directly, solving this problem.
Joins of non-related models are impossible, as well as
RIGHT OUTER JOINs, and Django takes the decision of which
JOIN to run (INNER or LEFT OUTER) away from you. And it is very hard to modify the JOIN expression to use OR except by
Recursive CTEs are not supported by Django yet either. Those can be done by raw SQL or by using
(which has limitations since it uses
extra, which will be deprecated).
Aggregations over multiple tables are hard to get correct in Django, too – you can use
but not on other aggregations.
To sum up: Django queries
- are hard to read in advanced queries,
- hard to understand what's going on on the SQL level
- not performant on complex use cases.
Usually this is not a problem in 95% of cases, but in your analysis app as above relies on behaviour Django cannot provide.
How to move to SQLAlchemy
Inject SQLAlchemy, and disable pooling by setting the pool to the
NullPool. Every uWSGI worker will have n threads,
and each thread would have an SQLAlchemy pool, but now has a single Django connection, so it's recommended to add
pgbouncer in front of your DB.
If you have models for tables, use
If you don't have models yet, use table reflections by running
Table('name', meta, autoload=True), or define them
explicitly (bu building SQLAlchemy models including relations) or via inline expressions in your queries (you might not
want this due to the repetition required).
engine is a global variable that will be evaluated before
django.test.override_settings can chime in, so you'll
have to override it depending on your environment. Yuck.
There's an issue with pytest dealing with ResultProxy, which lets pytest hang forever if an exception occurs with an active ResultProxy open, so you'll have to add a decorator that adds connection handling on raised exceptions.
Since you have two connections, Django and SQLAlchemy, but since fixtures will be done via the Django connections, and
never committed, your SQLAlchemy connection will not see that data – so using
TestCase with fixtures is tricky. If
you then use the SQLAlchemy connection to create test data, you'll have to clean it up yourself. Or you hack around
READ_COMMITTED, but be sure you know what you're doing.
You can also use
TransactionTestCase, which doesn't have the commit/transaction issue, but it will be slower, and
models of related tables will be flushed automatically.
There is little documentation on this integration, so it's hard to start. Tests are hard to set up, and connections are
tricky, and you'll be unable to use libraries which work with querysets, such as
You'll have full control over SQL, and you'll be faster to express SQL in your Python code. It's easier to build application-specific in an SQL-generation layer since you can use the built in abstractions. It's readable, maintainable, and performant.