December 8: Pyton Data Compression and Archiving Modules
As part of the Python Standard Library
traversal, after yesterday's short post about data
persistence, today we're going to look at Python data compression
modules. There are not too many of them, but definitely more than I knew: zlib
, gzip
, bz2
, lzma
, zipfile
,
tarfile
.
Highlights
- Now I know that Python supports six different main compression protocols out of the box.
- The
zipfile
module has a CLI interface for compression, extraction and listing of archives. - Same for
tarfile
. - When opening a tarfile, you can pass the intended compression algorithm with the file mode, like
w:bz2
.
zlib
zlib is a data compression protocol, and the documentation encourages you to read its documentation
instead of the Python one. This module provides an interface for various checksum mechanisms in addition to compression
(both direct and stream-based) and decompression. You can use Compress
and Decompress
objects to inspect or change
the process further.
gzip
The gzip
module provides support for gzip and gunzip functionality in Python. You'll mostly use the convenience
functions compress
and decompress
, but you can also use the GzipFile
class, which exposes the low-level peek()
method.
bz2
The bz2
provides a compress
and decompress
function for bzip2 compressed data. For incremental compression and
decompression, use the BZ2Compressor
and BZ2Decompressor
classes instead. This module is threadsafe.
lzma
The lzma
module supports LZMA compression and decompression, plus file support for .xz
and .lzma
files. This
module is not threadsafe. You interact with LZMACompressor
and LZMADecompressor
, or with the helper functions
compress()
and decompress()
.
zipfile
The zipfile
module supports creating, opening, appending and listing ZIP archives, including the ZIP64 extension that
permits files larger than 4GB (you can turn this off for compatibility). You can change the compression method to zlib
or bz2
or lzma
(if available). You typically interact with ZipFile
instances, which you can use as context
managers. It also supports setting passwords on archives. With testzip()
, archive checksums and file headers are
tested for consistency.
Use namelist
to inspect archive contents by name, or infolist
for more detailed information, including modification
time, compressed size and decompressed size. You can extract
a single path, or extractall
, both of which try to make
extraction as safe as possible by stripping leading directory indicators (like ../
) and changing unsafe path
characters.
CLI
You can use the zipfile module as CLI interface, too, by using the -c
option for compression and the -e
option for
extraction: python -m zipfile -c monty.zip spam.txt eggs.txt
. You can list archive contents with
python -m zipfile -l archive.zip
.
tarfile
The tarfile
module supports creating, opening, appending and listing TAR archives,. You can change the compression
method to gzip
or bz2
or lzma
(if available). It supports several extensions and handles not only files and
directories, but also links, permissions, devices and file metadata. You typically interact with TarFile
objects
(available as context managers). When open()
-ing an archive, your file mode can indicate the compression algorithm
used, like w:bz2
. Not all of these support opening an archive for appending data. is_tarfile
is useful to run before
attempting to open such a file.
You can get a list of contained data with getnames()
or get more information with getmember()
and getmembers()
.
extract()
, extractfile()
, extractall()
work the same way they do in zipfile
, though the docs read like it's not
as safe and does not try to prevent file creation outside the archive root. Add contents with add()
and addfile()
.
CLI
You can use the tarfile module as CLI interface, like zipfile. Use the -c
option for compression and the -e
option
for extraction: python -m tarfile -c monty.tar spam.txt eggs.txt
. You can list archive contents with python -m tarfile -l archive.tar
. Use -t
to test if something is a valid tar archive.