December 7: Python Data Persistence Modules
Strap in, this is a long one: As part of the Python Standard Library
traversal, after yesterday's short post about binary
data services, today we're going to look at Python data type modules.
There's a lot of them: datetime
, calendar
, collections
, heapq
, bisect
, array
, weakref
, types
, copy
,
pprint
, reprlib
and enum
.
Highlights
- You can use
shelve
to store pickled objects in a persistent key-value store. - The Unix
dbm
library is a thing that exists and has Python standard library support.
pickle
Use the pickle
module to serialize and deserialize Python objects. Only deserialize data that you trust, as it's
emphatically not safe. Consider signing your data or choosing safer serialization formats.
Data stream format
The data format is Python-specific, and uses a compact binary representation. You can compress it further, if need be.
Use pickletools
to analyze the data generated. There are six protocol versions to be used when pickling.
Module interface
Just like with json
, use dump
and dumps
to store data and load
and loads
to retrieve data. These functions use
the Pickler
and Unpickler
classes internally, which you can also use and further modify.
What to pickle
Use pickle on basically all kinds of built-in types, plus on functions and classes defined on the top level.
Additionally, instances of classes whose __dict__
representation can be pickled, can be pickled themselves.
Pickling instances
Implementing your own pickling process is not trivial, so read up on it carefully. The primary interface is the
__reduce__()
method. You can also use __get_state__()
and __set_state__()
to modify and restore state, eg. to
work around parts of the object state that cannot be pickled, like file objects.
pickle
supports references to objects that have not been pickled. They are referred to by a persistent ID. The docs
have a code example of how to use this ID.
Restricting globals
You can modify the unpickling process to restrict itself to a named set of globals, instead of unpickling just everything. This might make things slightly more safe.
copyreg
Use copyreg.pickle
to register a function as converter for a specific type. This function will be used by the copy
and pickle
modules.
shelve
The shelve
module provides a persistent key-value store for arbitrary Python objects, using pickling.
marshal
marshal
is an insecure internal Python object serialization protocol. It is not meant to be used as general
persistence module, and serves mainly to read and write the pseudo-compiled .pyc
files. Use pickle
if you're not a
Python core developer.
dbm
The dbm
module provides an interface to the unix dbm
database, which is apparently a thing that exists. It provides
a key-value data store. There are different versions (a GNU one, of course, and also a Berkeley one and a "new" one),
that are naturally incompatible. Now you know it exists and has a Python implementation. If you want to store arbitrary
Python objects (aka have pickle support) in a key-value store, use the shelve
module instead.
sqlite3
sqlite3
is a C library, and next to pathlib
my favourite part of the standard library. You can use it with files and
with in-memory storage. You will often want something like the excellent sqlite-utils
in addition.
connections and cursors
The basic workflow is to create a connection and a cursor, to execute statements with the cursor and commit it on the
connection. connect()
can change some parameters (like the isolation level), and takes either a path-like or
":memory:"
as connection string.
On a connection object, you can get a cursor()
, commit()
statements and close()
the connection. Use interrupt()
to stop running queries. With
create_function()
, create_aggregate()
and create_collation()
, you can provide user-created functionality and use
it in later queries. enable_load_extensions
and load_extension
loads handy extensions. You can change the
row_factory
and text_factory
attributes of a connection to change how database responses get translated to Python.
total_changes
tells you how many rows have been modified during this connection. Use iterdump()
and backup()
to
systematically retrieve all data in the database.
You can also set some callback functions. set_authorizer
sets a callback to be called every time a column is about to
be accessed, and can allow or deny progress. set_progress_handler
will trigger the callback every n
instructions of
the SQLite virtual machine (this is useful to execute stuff during long-running queries, even if it's just a progress
bar). set_trace_callback
allows you to receive a call every time a statement is executed.
On cursors, use execute()
and executemany()
to execute queries. Please remember to use prepared statements and pass
parameters instead of string formatting. You can also use executescript()
to execute multiple statements in one go.
Use fetchone()
, fetchmany()
or fetchall()
to retrieve result rows. lastrowid
is really nice to keep in mind.
helpers
complete_statement()
will tell you if a string contains one or multiple complete SQL statements. There's also a set of
exception classes that you will want to take a look at.
The execute
, executemany
and executescript
methods are also available on the connection
objects, so you can
maybe avoid dealing with cursor
objects completely.
When you set connection.row_factory = sqlite3.Row
, you can access result rows by name, and not only by index.
Use connection
objects as context managers to trigger rollbacks on exceptions.
types
Some Python types are automatically translated, such as None
to NULL
, int
to INTEGER
, float
to REAL
, str
to TEXT
and bytes
to BLOB
. All classes that have a __conform__(self, protocol)
method will use this method to
interact with the database. Alternatively, you can also register adapters for specific types with
sqlite3.register_adapter()
. If you also register a converter, you can completely roundtrip your custom values. The
default adapters and converters take care of dates and datetimes.