December 9: Python File Format Modules
Combining the commonplace and the weirdly specific: As part of the Python Standard Library
traversal, after yesterday's short post about data
compression and archiving modules, today we're going to
look at Python file format modules. Those include csv
, configparser
, netrc
, xdrlib
and the forgettable
plistlib
.
Highlights
- Writing your own
csv
dialect is extremely easy. I wonder how cursed you can make it. -
configparser
has an extended interpolation mode that allows you to refer to values from other sections like%{OtherSection:other_value}
- You can configure lots of things about
configparser
, like the delimiters, comment prefix, strict mode, enable multiline strings with empty lines, permit empty/flag values, … - Why on earth is
csv
in a different documentation block fromjson
andxml
? - I learnt that the
netrc
FTP configuration format exists, and requires its own stdlib support module. - I learnt that
xdr
file format exists, and at least the Python interface sounds pretty annoying. - I learnt that
plist
Apple format exists, and it sounds not that bad.
csv
The csv
module allows you to read and write the different things that call themselves "CSV". It's extremely useful,
and its CSVDictReader
and CSVDictWriter
classes are the best (the docs don't spend nearly enough time stressing that
point).
Dialects
There are tons of CSV dialects. Python supports some out of the box, and you can add your own. It uses a sniffer to
guess the appropriate format. Standard dialects are "excel", "excel-tab" (tab-delimited), "unix" (with \n
line endings
and quoting all fields).
When building a dialect, you can specify the delimiter (,
), if doublequotes should be doubled or use the escape
character, set an escape character (defaulting to none), set a quote character ("
), instruct the parser to ignore
leading whitespace, and enable strict mode.
configparser
The configparser
module lets you read and write INI config files. It maps each file to what amounts to nested
dictionaries (and you can access values as config["section"]["value"]
), and can write them back to disk and handle
default values, too. It's great because it's readable and writable. It sucks because everything is a string until you
squint really hard (with methods like getboolean
).
You can parse several config files into the same parser, and the last one wins over earlier ones. This is useful to provide fallback values (either by a default config file or by loading a dictionary).
interpolation
The parsers support interpolation, that is, %(previously_defined_variable)
expansion. You can configure the parser to
use extended interpolation, which allows you to refer to values from other sections, like %{OtherSection:other_value}
.
customisation
You can change a bunch of things about the parsers. For one, by default, sections will be written back in the order they
were added, but you can change the dict type (and thereby the ordering). You can elect to support files with no values
for some settings (this_is_a_flag
). You could even change the comment prefix (#
, or allow inline comments) or the
delimiters (=
and :
) to something else. You can disable strict mode, allow empty lines in multiline strings, rename
the default section, and add custom data converters.
netrc
netrc
is a format of deep and unending complexity: In three lines, you tell your FTP client about the remote machine
name, username and password. The values are space separated. This is it.
With authenticators(host)
, you retrieve login data for the specified host. With hosts
, you get the complete
dictionary of host configs.
xdrlib
I learnt a lot of things today. First, there is such a thing as the XDR (external data representation) serialization
format. Second, it's in use in ZFS, R, SpiderMonkey and libvirt. Python supports "most of the data types described in
the RFC". You have to use individual pack_{type}
and unpack_{type}
methods on the Packer
and Unpacker
methods,
which sounds like a ton of fun.
plistlib
I learnt a lot of things today. First, on Apple systems, there are .plist
files (for "property lists"). Second, they
exist in binary and XML versions. Third, Python has native support for this file format.
These files are usually used for configuration, like Windows INI files. The format supports a set of types, including
numbers, strings, booleans, tuples, lists, dictionaries with string keys, bytes and datetimes. You use the load
and
loads
and dump
and dumps
functions to interface with the files. And that's it.