December 7: Python Data Persistence Modules

rixx

2020-12-07

Strap in, this is a long one: As part of the Python Standard Library traversal, after yesterday's short post about binary data services, today we're going to look at Python data type modules. There's a lot of them: datetime, calendar, collections, heapq, bisect, array, weakref, types, copy, pprint, reprlib and enum.

Highlights

You can use shelve to store pickled objects in a persistent key-value store.
The Unix dbm library is a thing that exists and has Python standard library support.

pickle

Use the pickle module to serialize and deserialize Python objects. Only deserialize data that you trust, as it's emphatically not safe. Consider signing your data or choosing safer serialization formats.

Data stream format

The data format is Python-specific, and uses a compact binary representation. You can compress it further, if need be. Use pickletools to analyze the data generated. There are six protocol versions to be used when pickling.

Module interface

Just like with json, use dump and dumps to store data and load and loads to retrieve data. These functions use the Pickler and Unpickler classes internally, which you can also use and further modify.

What to pickle

Use pickle on basically all kinds of built-in types, plus on functions and classes defined on the top level. Additionally, instances of classes whose __dict__ representation can be pickled, can be pickled themselves.

Pickling instances

Implementing your own pickling process is not trivial, so read up on it carefully. The primary interface is the __reduce__() method. You can also use __get_state__() and __set_state__() to modify and restore state, eg. to work around parts of the object state that cannot be pickled, like file objects.

pickle supports references to objects that have not been pickled. They are referred to by a persistent ID. The docs have a code example of how to use this ID.

Restricting globals

You can modify the unpickling process to restrict itself to a named set of globals, instead of unpickling just everything. This might make things slightly more safe.

copyreg

Use copyreg.pickle to register a function as converter for a specific type. This function will be used by the copy and pickle modules.

shelve

The shelve module provides a persistent key-value store for arbitrary Python objects, using pickling.

marshal

marshal is an insecure internal Python object serialization protocol. It is not meant to be used as general persistence module, and serves mainly to read and write the pseudo-compiled .pyc files. Use pickle if you're not a Python core developer.

dbm

The dbm module provides an interface to the unix dbm database, which is apparently a thing that exists. It provides a key-value data store. There are different versions (a GNU one, of course, and also a Berkeley one and a "new" one), that are naturally incompatible. Now you know it exists and has a Python implementation. If you want to store arbitrary Python objects (aka have pickle support) in a key-value store, use the shelve module instead.

sqlite3

sqlite3 is a C library, and next to pathlib my favourite part of the standard library. You can use it with files and with in-memory storage. You will often want something like the excellent sqlite-utils in addition.

connections and cursors

The basic workflow is to create a connection and a cursor, to execute statements with the cursor and commit it on the connection. connect() can change some parameters (like the isolation level), and takes either a path-like or ":memory:" as connection string.

On a connection object, you can get a cursor(), commit() statements and close() the connection. Use interrupt() to stop running queries. With create_function(), create_aggregate() and create_collation(), you can provide user-created functionality and use it in later queries. enable_load_extensions and load_extension loads handy extensions. You can change the row_factory and text_factory attributes of a connection to change how database responses get translated to Python. total_changes tells you how many rows have been modified during this connection. Use iterdump() and backup() to systematically retrieve all data in the database.

You can also set some callback functions. set_authorizer sets a callback to be called every time a column is about to be accessed, and can allow or deny progress. set_progress_handler will trigger the callback every n instructions of the SQLite virtual machine (this is useful to execute stuff during long-running queries, even if it's just a progress bar). set_trace_callback allows you to receive a call every time a statement is executed.

On cursors, use execute() and executemany() to execute queries. Please remember to use prepared statements and pass parameters instead of string formatting. You can also use executescript() to execute multiple statements in one go. Use fetchone(), fetchmany() or fetchall() to retrieve result rows. lastrowid is really nice to keep in mind.

helpers

complete_statement() will tell you if a string contains one or multiple complete SQL statements. There's also a set of exception classes that you will want to take a look at.

The execute, executemany and executescript methods are also available on the connection objects, so you can maybe avoid dealing with cursor objects completely.

When you set connection.row_factory = sqlite3.Row, you can access result rows by name, and not only by index.

Use connection objects as context managers to trigger rollbacks on exceptions.

types

Some Python types are automatically translated, such as None to NULL, int to INTEGER, float to REAL, str to TEXT and bytes to BLOB. All classes that have a __conform__(self, protocol) method will use this method to interact with the database. Alternatively, you can also register adapters for specific types with sqlite3.register_adapter(). If you also register a converter, you can completely roundtrip your custom values. The default adapters and converters take care of dates and datetimes.