December 8: Pyton Data Compression and Archiving Modules

As part of the Python Standard Library traversal, after yesterday's short post about data persistence, today we're going to look at Python data compression modules. There are not too many of them, but definitely more than I knew: zlib, gzip, bz2, lzma, zipfile, tarfile.

Highlights

  • Now I know that Python supports six different main compression protocols out of the box.
  • The zipfile module has a CLI interface for compression, extraction and listing of archives.
  • Same for tarfile.
  • When opening a tarfile, you can pass the intended compression algorithm with the file mode, like w:bz2.

zlib

zlib is a data compression protocol, and the documentation encourages you to read its documentation instead of the Python one. This module provides an interface for various checksum mechanisms in addition to compression (both direct and stream-based) and decompression. You can use Compress and Decompress objects to inspect or change the process further.

gzip

The gzip module provides support for gzip and gunzip functionality in Python. You'll mostly use the convenience functions compress and decompress, but you can also use the GzipFile class, which exposes the low-level peek() method.

bz2

The bz2 provides a compress and decompress function for bzip2 compressed data. For incremental compression and decompression, use the BZ2Compressor and BZ2Decompressor classes instead. This module is threadsafe.

lzma

The lzma module supports LZMA compression and decompression, plus file support for .xz and .lzma files. This module is not threadsafe. You interact with LZMACompressor and LZMADecompressor, or with the helper functions compress() and decompress().

zipfile

The zipfile module supports creating, opening, appending and listing ZIP archives, including the ZIP64 extension that permits files larger than 4GB (you can turn this off for compatibility). You can change the compression method to zlib or bz2 or lzma (if available). You typically interact with ZipFile instances, which you can use as context managers. It also supports setting passwords on archives. With testzip(), archive checksums and file headers are tested for consistency.

Use namelist to inspect archive contents by name, or infolist for more detailed information, including modification time, compressed size and decompressed size. You can extract a single path, or extractall, both of which try to make extraction as safe as possible by stripping leading directory indicators (like ../) and changing unsafe path characters.

CLI

You can use the zipfile module as CLI interface, too, by using the -c option for compression and the -e option for extraction: python -m zipfile -c monty.zip spam.txt eggs.txt. You can list archive contents with python -m zipfile -l archive.zip.

tarfile

The tarfile module supports creating, opening, appending and listing TAR archives,. You can change the compression method to gzip or bz2 or lzma (if available). It supports several extensions and handles not only files and directories, but also links, permissions, devices and file metadata. You typically interact with TarFile objects (available as context managers). When open()-ing an archive, your file mode can indicate the compression algorithm used, like w:bz2. Not all of these support opening an archive for appending data. is_tarfile is useful to run before attempting to open such a file.

You can get a list of contained data with getnames() or get more information with getmember() and getmembers(). extract(), extractfile(), extractall() work the same way they do in zipfile, though the docs read like it's not as safe and does not try to prevent file creation outside the archive root. Add contents with add() and addfile().

CLI

You can use the tarfile module as CLI interface, like zipfile. Use the -c option for compression and the -e option for extraction: python -m tarfile -c monty.tar spam.txt eggs.txt. You can list archive contents with python -m tarfile -l archive.tar. Use -t to test if something is a valid tar archive.