December 6: Python File and Directory Access

There's no place where "one obvious way to do things" fails as much as it does with file and OS interaction. As part of the Python Standard Library traversal, after yesterday's post about functional programming modules, today we're going to look at the file and directory access modules. Yes, all of them: pathlib, os.path, fileinput, stat, filecmp, tempfile, glob, fnmatch, linecache and shutil.

Highlights

  • The 80% overlap between pathlib and os.path is bordering on hilarious. One obvious way indeed.
  • fileinput is really weird.
  • stat allows you to query and extract results of os.stat()
  • shutil.diskusage() returns total, used and free bytes for a given directory.
  • shutil.which() looks up executables (platform independent which)
  • shutil.make_archive supports zip, tar, gztar, bztar and xtar out of the box.

pathlib

Nearly anything you want to do with files or paths, you want to use with pathlib. Instantiate a Path("foo/bar"), then enjoy the goodness. You can concatenate paths with Path("a") / "b" / "c", which is really neat. For paths that don't access the filesystem, you can use PurePath instances. Paths are immutable, hashable, orderable.

Attributes

Paths expose their parts via the attributes parts (all of them), drive, root, anchor, parents (all parent directories), parent, name (the complete file name), suffix (file extension), suffixes (extensions as a list) and stem (name without suffix). as_posix() gets you the path with forward slashes, as_uri() as a file:// URI. is_absolute() is useful, is_reserved is extremely useful on Windows and False everywhere else. relative_to() expresses one path starting from the other. with_name() returns a new Path with the same path, but the new filename. with_suffix() does the same for the file suffix.

.match() returns True if the path matches the given pattern. You can use .glob() (or rglob() for an implicit leading **/) to find files matching a pattern starting at the path. With iterdir() you can iterate over all files in a given directory.

Methods

cwd() and home() return a new path in your current or home directory. resolve() turns any path into an absolute path with no symlinks and no relative elements.

exists() checks file or directory existence, and follows symlinks. Path.stat() returns an os.stat_result, which you can either query directly or with the stat module. chmod() changes permissions and modes. You can also use owner() and group(), is_dir(), is_file(), is_mount(), is_symlink(), is_socket(), is_fifo(), is_block_device() and is_char_device() as shortcuts. With samefile() you can compare files, just like with filecmp or os.path. expanduser() replaces ~ and ~user strings in a path.

Create new directories with mkdir() and new files with touch(), and remove them with unlink. Move files or directories with rename() and replace(). Remove empty directories with rmdir(). Create symlinks with symlink_to() and link_to, remove them with unlink. Open files with .open(), or read_bytes() and read_text() for immediate read access (and write_bytes() and write_text() for writing).

os.path

os.path is in a weird spot: Some of its functions are being replaced by the much nicer pathlib module, but nowhere near all of them. For example, basename() is available on pathlib.Path objects, but commonpath is not.

These functions, to my knowledge, are available by way of pathlib: abspath (as resolve), basename (as name), dirname (as parent), exists, expanduser, join (as joinpath)

isabs() is replaced by is_absolute() (look at the readability!), isfile by is_file (and same for links and directories), but ismount does not have an equivalent in pathlib. splitext() is replaced by Path.suffix and Path.stem. File information functions like getatime, getmtime, getctime and getsize are replaced by Path().stat() attributes.

These functions (again, to my knowledge), are still exclusive to os.path: commonpath tells you the longest common path from a list of paths, and commonprefix does the same, but from the beginning of a path. expandvars expands variables of the form $var and ${var}. normpath changes a path to lowercase on Windows and leaves it unchanged otherwise. relpath is kind of equivalent to Path.relative_to, but the signature is different enough to be annyoing when converting. samefile is available on Path objects, but sameopenfile (comparing descriptors) is not, and neither is samestat().

fileinput

In a strange set of capabilities, fileinput allows you to loop over standard input or a list of files. If you call fileinput.input() without any arguments, it uses sys.argv[1:]. You can use fileinput.filename() and fileinput.lineno() and the likes to get information about the file you're currently reading.

stat

After calling os.stat() on something, you can use the stat module to interpret the result. It contains just a ton of functions like S_ISDIR() that return a boolean when passed a stat mode. It can also be used to extract the file mode (eg as a -rwxrwxrwx string), and to extract other information from stat results, like S_SIZE().

filecmp

filecmp compares files and directories with varying speed and accuracy. For detailed content comparison, use difflib.

cmp() compares two files. If called with shallow=True it only compares the os.stat() signatures. cmpfiles() does the same for all the files in the two given directories. These functions return a comparison object which you can use to see matching and different elements in the comparison process.

tempfile

tempfile provides platform-independent temporary files and directories. All of its interface classes can be used as context managers, providing automatic cleanup once you exit the context. If you don't use them as context managers, they will be deleted when you close them or when they get gc'd.

TemporaryFile, the base class, is not visible in directory listings under Unix, but that's not a platform independent feature. Use NamedTemporaryFile to get cross-platform consistent, visible files. SpooledTemporaryFile keeps the data spooled in memory until you call fileno() or rollover() (which cause the data to be written to disk). TemporaryDirectory also removes all of its contents when it is deleted.

If you don't want the whole context manager thing, you can use mkstemp() and mkdtemp() to create temporary files and directories where you have to handle removal yourself. gettempdir() tells you where tempfile will create its files and directories.

glob

glob provides Unix-style wildcard expansion. glob.glob() combines os.scandir() and fnmatch.fnmatch() to return a list of files matching your input string. If you set recursive=True, you can also use ** directory wildcards. Use iglob if the result can be huge – it returns an iterator instead of a list. Both these functions raise auditing events.

You can use glob.excape() to deal with paths that contain the special characters *, ? and [.

fnmatch

fnmatch provides Unix-style wildcard matching, with a fnmatch and a filter function for easy use, and a translate() function that translates the given pattern to a regular expression to be used with re.match().

In most cases, you're going to want to use glob, since it treats / as special separator character. Or pathlib.Path.glob, of course.

linecache

linecache is used to retrieve the file contents printed in tracebacks. You call it with getline(filename, lineno).

shutil

shutil provides high level access to files, particularly for copying files: But be warned that even shutil.copy2() cannot transfer all metadata, and the metadata transfered is different depending on your OS. All of these methods raise auditing events.

Directories and files

copyfileobj copies a file-like object to another file-like object, and copyfile copies a file from one path-like to another. copy does the same as copyfile, but also accepts a directory as destination. copymode copies permission bits, and copystat additionally also copies access times and flags. copy2 combines copy behaviour with copystat attribute copying. chown changes user and group ownership of a given path.

copytree copies an entire directory tree. You can ignore files, and you can use ignore_patterns to create an ignore function from a list of ignore patterns. rmtree removes an entire directory tree. move moves either files or directory trees with a copying function of your choice.

diskusage returns a tuple of total, used and free bytes in a given directory. which looks up executables, using the current PATH.

Archiving

make_archive creates an archive. As all archiving tools, it has an unholy amount of options, and the directory options are particularly confusing. Highly recommend to play around with them a bit. It supports several formats: zip, tar, gztar, bztar and xtar (depending on available modules – use get_archive_formats). unpack_archive does the same in reverse.

terminal size

To make sure shutil isn't too intuitive to use, it also houses get_terminal_size(), which returns a tuple of rows and columns.