December 8: Pyton Data Compression and Archiving Modules
As part of the Python Standard Library
traversal, after yesterday's short post about data
persistence, today we're going to look at Python data compression
modules. There are not too many of them, but definitely more than I knew:
- Now I know that Python supports six different main compression protocols out of the box.
zipfilemodule has a CLI interface for compression, extraction and listing of archives.
- Same for
- When opening a tarfile, you can pass the intended compression algorithm with the file mode, like
zlib is a data compression protocol, and the documentation encourages you to read its documentation
instead of the Python one. This module provides an interface for various checksum mechanisms in addition to compression
(both direct and stream-based) and decompression. You can use
Decompress objects to inspect or change
the process further.
gzip module provides support for gzip and gunzip functionality in Python. You'll mostly use the convenience
decompress, but you can also use the
GzipFile class, which exposes the low-level
bz2 provides a
decompress function for bzip2 compressed data. For incremental compression and
decompression, use the
BZ2Decompressor classes instead. This module is threadsafe.
lzma module supports LZMA compression and decompression, plus file support for
.lzma files. This
module is not threadsafe. You interact with
LZMADecompressor, or with the helper functions
zipfile module supports creating, opening, appending and listing ZIP archives, including the ZIP64 extension that
permits files larger than 4GB (you can turn this off for compatibility). You can change the compression method to
lzma (if available). You typically interact with
ZipFile instances, which you can use as context
managers. It also supports setting passwords on archives. With
testzip(), archive checksums and file headers are
tested for consistency.
namelist to inspect archive contents by name, or
infolist for more detailed information, including modification
time, compressed size and decompressed size. You can
extract a single path, or
extractall, both of which try to make
extraction as safe as possible by stripping leading directory indicators (like
../) and changing unsafe path
You can use the zipfile module as CLI interface, too, by using the
-c option for compression and the
-e option for
python -m zipfile -c monty.zip spam.txt eggs.txt. You can list archive contents with
python -m zipfile -l archive.zip.
tarfile module supports creating, opening, appending and listing TAR archives,. You can change the compression
lzma (if available). It supports several extensions and handles not only files and
directories, but also links, permissions, devices and file metadata. You typically interact with
(available as context managers). When
open()-ing an archive, your file mode can indicate the compression algorithm
w:bz2. Not all of these support opening an archive for appending data.
is_tarfile is useful to run before
attempting to open such a file.
You can get a list of contained data with
getnames() or get more information with
extractall() work the same way they do in
zipfile, though the docs read like it's not
as safe and does not try to prevent file creation outside the archive root. Add contents with
You can use the tarfile module as CLI interface, like zipfile. Use the
-c option for compression and the
python -m tarfile -c monty.tar spam.txt eggs.txt. You can list archive contents with
python -m tarfile -l archive.tar. Use
-t to test if something is a valid tar archive.