December 8: Pyton Data Compression and Archiving Modules
As part of the Python Standard Library
traversal, after yesterday's short post about data
persistence, today we're going to look at Python data compression
modules. There are not too many of them, but definitely more than I knew: zlib, gzip, bz2, lzma, zipfile,
tarfile.
Highlights
- Now I know that Python supports six different main compression protocols out of the box.
- The
zipfilemodule has a CLI interface for compression, extraction and listing of archives. - Same for
tarfile. - When opening a tarfile, you can pass the intended compression algorithm with the file mode, like
w:bz2.
zlib
zlib is a data compression protocol, and the documentation encourages you to read its documentation
instead of the Python one. This module provides an interface for various checksum mechanisms in addition to compression
(both direct and stream-based) and decompression. You can use Compress and Decompress objects to inspect or change
the process further.
gzip
The gzip module provides support for gzip and gunzip functionality in Python. You'll mostly use the convenience
functions compress and decompress, but you can also use the GzipFile class, which exposes the low-level peek()
method.
bz2
The bz2 provides a compress and decompress function for bzip2 compressed data. For incremental compression and
decompression, use the BZ2Compressor and BZ2Decompressor classes instead. This module is threadsafe.
lzma
The lzma module supports LZMA compression and decompression, plus file support for .xz and .lzma files. This
module is not threadsafe. You interact with LZMACompressor and LZMADecompressor, or with the helper functions
compress() and decompress().
zipfile
The zipfile module supports creating, opening, appending and listing ZIP archives, including the ZIP64 extension that
permits files larger than 4GB (you can turn this off for compatibility). You can change the compression method to zlib
or bz2 or lzma (if available). You typically interact with ZipFile instances, which you can use as context
managers. It also supports setting passwords on archives. With testzip(), archive checksums and file headers are
tested for consistency.
Use namelist to inspect archive contents by name, or infolist for more detailed information, including modification
time, compressed size and decompressed size. You can extract a single path, or extractall, both of which try to make
extraction as safe as possible by stripping leading directory indicators (like ../) and changing unsafe path
characters.
CLI
You can use the zipfile module as CLI interface, too, by using the -c option for compression and the -e option for
extraction: python -m zipfile -c monty.zip spam.txt eggs.txt. You can list archive contents with
python -m zipfile -l archive.zip.
tarfile
The tarfile module supports creating, opening, appending and listing TAR archives,. You can change the compression
method to gzip or bz2 or lzma (if available). It supports several extensions and handles not only files and
directories, but also links, permissions, devices and file metadata. You typically interact with TarFile objects
(available as context managers). When open()-ing an archive, your file mode can indicate the compression algorithm
used, like w:bz2. Not all of these support opening an archive for appending data. is_tarfile is useful to run before
attempting to open such a file.
You can get a list of contained data with getnames() or get more information with getmember() and getmembers().
extract(), extractfile(), extractall() work the same way they do in zipfile, though the docs read like it's not
as safe and does not try to prevent file creation outside the archive root. Add contents with add() and addfile().
CLI
You can use the tarfile module as CLI interface, like zipfile. Use the -c option for compression and the -e option
for extraction: python -m tarfile -c monty.tar spam.txt eggs.txt. You can list archive contents with python -m tarfile -l archive.tar. Use -t to test if something is a valid tar archive.