December 7: Python Data Persistence Modules
Strap in, this is a long one: As part of the Python Standard Library
traversal, after yesterday's short post about binary
data services, today we're going to look at Python data type modules.
There's a lot of them:
- You can use
shelveto store pickled objects in a persistent key-value store.
- The Unix
dbmlibrary is a thing that exists and has Python standard library support.
pickle module to serialize and deserialize Python objects. Only deserialize data that you trust, as it's
emphatically not safe. Consider signing your data or choosing safer serialization formats.
Data stream format
The data format is Python-specific, and uses a compact binary representation. You can compress it further, if need be.
pickletools to analyze the data generated. There are six protocol versions to be used when pickling.
Just like with
dumps to store data and
loads to retrieve data. These functions use
Unpickler classes internally, which you can also use and further modify.
What to pickle
Use pickle on basically all kinds of built-in types, plus on functions and classes defined on the top level.
Additionally, instances of classes whose
__dict__ representation can be pickled, can be pickled themselves.
Implementing your own pickling process is not trivial, so read up on it carefully. The primary interface is the
__reduce__() method. You can also use
__set_state__() to modify and restore state, eg. to
work around parts of the object state that cannot be pickled, like file objects.
pickle supports references to objects that have not been pickled. They are referred to by a persistent ID. The docs
have a code example of how to use this ID.
You can modify the unpickling process to restrict itself to a named set of globals, instead of unpickling just everything. This might make things slightly more safe.
copyreg.pickle to register a function as converter for a specific type. This function will be used by the
shelve module provides a persistent key-value store for arbitrary Python objects, using pickling.
marshal is an insecure internal Python object serialization protocol. It is not meant to be used as general
persistence module, and serves mainly to read and write the pseudo-compiled
.pyc files. Use
pickle if you're not a
Python core developer.
dbm module provides an interface to the unix
dbm database, which is apparently a thing that exists. It provides
a key-value data store. There are different versions (a GNU one, of course, and also a Berkeley one and a "new" one),
that are naturally incompatible. Now you know it exists and has a Python implementation. If you want to store arbitrary
Python objects (aka have pickle support) in a key-value store, use the
shelve module instead.
sqlite3 is a C library, and next to
pathlib my favourite part of the standard library. You can use it with files and
with in-memory storage. You will often want something like the excellent
sqlite-utils in addition.
connections and cursors
The basic workflow is to create a connection and a cursor, to execute statements with the cursor and commit it on the
connect() can change some parameters (like the isolation level), and takes either a path-like or
":memory:" as connection string.
On a connection object, you can get a
commit() statements and
close() the connection. Use
to stop running queries. With
create_collation(), you can provide user-created functionality and use
it in later queries.
load_extension loads handy extensions. You can change the
text_factory attributes of a connection to change how database responses get translated to Python.
total_changes tells you how many rows have been modified during this connection. Use
systematically retrieve all data in the database.
You can also set some callback functions.
set_authorizer sets a callback to be called every time a column is about to
be accessed, and can allow or deny progress.
set_progress_handler will trigger the callback every
n instructions of
the SQLite virtual machine (this is useful to execute stuff during long-running queries, even if it's just a progress
set_trace_callback allows you to receive a call every time a statement is executed.
On cursors, use
executemany() to execute queries. Please remember to use prepared statements and pass
parameters instead of string formatting. You can also use
executescript() to execute multiple statements in one go.
fetchall() to retrieve result rows.
lastrowid is really nice to keep in mind.
complete_statement() will tell you if a string contains one or multiple complete SQL statements. There's also a set of
exception classes that you will want to take a look at.
executescript methods are also available on the
connection objects, so you can
maybe avoid dealing with
cursor objects completely.
When you set
connection.row_factory = sqlite3.Row, you can access result rows by name, and not only by index.
connection objects as context managers to trigger rollbacks on exceptions.
Some Python types are automatically translated, such as
BLOB. All classes that have a
__conform__(self, protocol) method will use this method to
interact with the database. Alternatively, you can also register adapters for specific types with
sqlite3.register_adapter(). If you also register a converter, you can completely roundtrip your custom values. The
default adapters and converters take care of dates and datetimes.