DjangoCon Europe 2019: How (and why!) to build a Django based project with SQLAlchemy Core for data analysis
Writeup of the DjangoCon Europe 2019 talk »How (and why!) to build a Django based project with SQLAlchemy Core for data analysis« by Gleb Pushkov
Gleb Pushkov: A passionate developer from Ukraine, Kyiv. The first time he got acquainted with Python was 7 years ago.
Scenario
You're building some kind of data analysis app, and
- your app mostly works with aggregations
- you have a lot of data
- you need precise and performant queries
- you're building advanced queries dynamically
- you're transforming complex queries from SQL to python
- your database is not natively supported by Django (e.g. SQL Azure, Sybase, Firebird)
Django is good and with Subqueries, WIndow functions, FilteredRelations, Conditional Expressions, functions for Date, Math, and ext, and Custom DB constraints, you have less reasons to use SQL than a few years ago!
Problems
The ORM has a very specific way of generating queries. It tends to run LEFT OUTER JOIN
before the LIMIT
clause,
for example, if you're slicing an annotation. Instead you'd want nested selects, then the limit, and in the very end the
outer join. This is doable in SQLAlchemy, but not in Django ORM, because we can't add Subqueries to FROM. We can
do a similar query if we hurt ourselves a bit.
Django ORM and SQLAlchemy have idfferent layers and different goals. Django builds on a non-public API on top of raw SQL, and the ORM on top, while SQLAlchemy provides a lower-level API, that responds to actual SQL. SQLAlchemy offers way more freedom than the ORM.
In SQLAlchemy, it is also normal and doable to build your queries in several layers, referencing and wrapping each other. This is something Django has only just started on with Subquery/Outerref, but it doesn't permit nearly the depth and freedom SQLAlchemy does.
Filtering by an aggregation. The naive Django implementation does multiple queries, so we're looking at Subqueries, but you can't do min aggregations there. You can add annotations, but the docs tell you to be careful/not do it. Or, of course, write raw SQL. SQLAlchemy allows you to include queries in other queries directly, solving this problem.
Joins of non-related models are impossible, as well as RIGHT OUTER JOIN
s, and Django takes the decision of which
JOIN
to run (INNER or LEFT OUTER) away from you. And it is very hard to modify the JOIN expression to use OR except by
using FilteredRelations.
Recursive CTEs are not supported by Django yet either. Those can be done by raw SQL or by using django-cte-forest
(which has limitations since it uses extra
, which will be deprecated).
Aggregations over multiple tables are hard to get correct in Django, too – you can use distinct=True
on COUNT
,
but not on other aggregations.
To sum up: Django queries
- are hard to read in advanced queries,
- hard to understand what's going on on the SQL level
- not performant on complex use cases.
Usually this is not a problem in 95% of cases, but in your analysis app as above relies on behaviour Django cannot provide.
How to move to SQLAlchemy
Integrate
Inject SQLAlchemy, and disable pooling by setting the pool to the NullPool
. Every uWSGI worker will have n threads,
and each thread would have an SQLAlchemy pool, but now has a single Django connection, so it's recommended to add
pgbouncer
in front of your DB.
Add tales
If you have models for tables, use aldjemy
or django-sabridge
.
If you don't have models yet, use table reflections by running Table('name', meta, autoload=True
), or define them
explicitly (bu building SQLAlchemy models including relations) or via inline expressions in your queries (you might not
want this due to the repetition required).
Tests
engine
is a global variable that will be evaluated before django.test.override_settings
can chime in, so you'll
have to override it depending on your environment. Yuck.
There's an issue with pytest dealing with ResultProxy, which lets pytest hang forever if an exception occurs with an active ResultProxy open, so you'll have to add a decorator that adds connection handling on raised exceptions.
Since you have two connections, Django and SQLAlchemy, but since fixtures will be done via the Django connections, and
never committed, your SQLAlchemy connection will not see that data – so using TestCase
with fixtures is tricky. If
you then use the SQLAlchemy connection to create test data, you'll have to clean it up yourself. Or you hack around
READ_COMMITTED, but be sure you know what you're doing.
You can also use TransactionTestCase
, which doesn't have the commit/transaction issue, but it will be slower, and
models of related tables will be flushed automatically.
Drawbacks
There is little documentation on this integration, so it's hard to start. Tests are hard to set up, and connections are
tricky, and you'll be unable to use libraries which work with querysets, such as django-filters
, or pagination
.
Benefits
You'll have full control over SQL, and you'll be faster to express SQL in your Python code. It's easier to build application-specific in an SQL-generation layer since you can use the built in abstractions. It's readable, maintainable, and performant.