December 9: Python File Format Modules

Combining the commonplace and the weirdly specific: As part of the Python Standard Library traversal, after yesterday's short post about data compression and archiving modules, today we're going to look at Python file format modules. Those include csv, configparser, netrc, xdrlib and the forgettable plistlib.

Highlights

  • Writing your own csv dialect is extremely easy. I wonder how cursed you can make it.
  • configparser has an extended interpolation mode that allows you to refer to values from other sections like %{OtherSection:other_value}
  • You can configure lots of things about configparser, like the delimiters, comment prefix, strict mode, enable multiline strings with empty lines, permit empty/flag values, …
  • Why on earth is csv in a different documentation block from json and xml?
  • I learnt that the netrc FTP configuration format exists, and requires its own stdlib support module.
  • I learnt that xdr file format exists, and at least the Python interface sounds pretty annoying.
  • I learnt that plist Apple format exists, and it sounds not that bad.

csv

The csv module allows you to read and write the different things that call themselves "CSV". It's extremely useful, and its CSVDictReader and CSVDictWriter classes are the best (the docs don't spend nearly enough time stressing that point).

Dialects

There are tons of CSV dialects. Python supports some out of the box, and you can add your own. It uses a sniffer to guess the appropriate format. Standard dialects are "excel", "excel-tab" (tab-delimited), "unix" (with \n line endings and quoting all fields).

When building a dialect, you can specify the delimiter (,), if doublequotes should be doubled or use the escape character, set an escape character (defaulting to none), set a quote character ("), instruct the parser to ignore leading whitespace, and enable strict mode.

configparser

The configparser module lets you read and write INI config files. It maps each file to what amounts to nested dictionaries (and you can access values as config["section"]["value"]), and can write them back to disk and handle default values, too. It's great because it's readable and writable. It sucks because everything is a string until you squint really hard (with methods like getboolean).

You can parse several config files into the same parser, and the last one wins over earlier ones. This is useful to provide fallback values (either by a default config file or by loading a dictionary).

interpolation

The parsers support interpolation, that is, %(previously_defined_variable) expansion. You can configure the parser to use extended interpolation, which allows you to refer to values from other sections, like %{OtherSection:other_value}.

customisation

You can change a bunch of things about the parsers. For one, by default, sections will be written back in the order they were added, but you can change the dict type (and thereby the ordering). You can elect to support files with no values for some settings (this_is_a_flag). You could even change the comment prefix (#, or allow inline comments) or the delimiters (= and :) to something else. You can disable strict mode, allow empty lines in multiline strings, rename the default section, and add custom data converters.

netrc

netrc is a format of deep and unending complexity: In three lines, you tell your FTP client about the remote machine name, username and password. The values are space separated. This is it.

With authenticators(host), you retrieve login data for the specified host. With hosts, you get the complete dictionary of host configs.

xdrlib

I learnt a lot of things today. First, there is such a thing as the XDR (external data representation) serialization format. Second, it's in use in ZFS, R, SpiderMonkey and libvirt. Python supports "most of the data types described in the RFC". You have to use individual pack_{type} and unpack_{type} methods on the Packer and Unpacker methods, which sounds like a ton of fun.

plistlib

I learnt a lot of things today. First, on Apple systems, there are .plist files (for "property lists"). Second, they exist in binary and XML versions. Third, Python has native support for this file format.

These files are usually used for configuration, like Windows INI files. The format supports a set of types, including numbers, strings, booleans, tuples, lists, dictionaries with string keys, bytes and datetimes. You use the load and loads and dump and dumps functions to interface with the files. And that's it.