December 2: Python Binary Data Services

As part of the Python Standard Library traversal, today we're following up yesterday's text processing services with the binary data services: struct and codecs.

Highlights

  • codecs.encode("foo", "rot13") 😍
  • Now I know about yet another formatting language in Python. Thanks, struct.

struct

Convert between Python values and C structs (which get presented as bytes). You can pack and unpack values according to format strings that you pass on. You can also use iter_unpack and unpack_from for efficient handling of larger structs.

Format strings

By default, you get proper system-appropriate byte order and alignment, but you can change those with the first character of the format string. Other than that, you get a typical type-based placeholder language, aka yet-another formatting language yaay?

codecs

This module defines base classes for codecs: encoders and decoders. Most of the ones included with Python are for text/byte conversion. codecs.encode() and codecs.decode() do what string.encode() and bytes.decode() do. With codecs.lookup(name) you can get a CodecInfo object, which gives you direct access to a streamwriter, a streamreader, encode and decode functionality and incremental encoding and decoding. If you have codecs of your own, you can register() them. codecs.open() behaves like general open(), but is restricted to binary modes. If you need to do weird transcoding magic, use EncodedFile. codecs also defines BOM constants for when you have to meddle with platform dependent data.

Codec Base Classes

Executive summary: You can implement your own Codec subclasses, and it's neither impossible nor particularly painuful. You have to implement stateless encoding and decoding as well as stream reading and writing . Additionally, you are encouraged to support at least the two main kinds of error handling, strict and ignore, and optionally further error modes. The module provides base classes for incremental encoding and decoding, and stream encoding and decoding.

Standard Encodings

Python comes with a bunch of standard encodings, not just the usual utf-8, latin-1 etc. If you ever need weird encodings, you'll be thankful for it. Some of these are specific to Python itself and have no application outside the language domain. Naq gurer vf nyfb EBG13 fhccbeg.

idna

encodings.idna is there to transform non-ASCII characters in domain names into those xn-- strings, with ToASCII() and ToUnicode(). It also provides the nameprep() function, which normalizes domains, mostly by lowercasing them. If you're looking at this, please use the idna package from PyPI instead, as encodings.idna only supports an outdated RFC.