December 6: Python File and Directory Access
There's no place where "one obvious way to do things" fails as much as it does with file and OS interaction. As part of
the Python Standard Library traversal, after
yesterday's post about functional programming modules, today
we're going to look at the file and directory access modules. Yes, all of them: pathlib
, os.path
, fileinput
,
stat
, filecmp
, tempfile
, glob
, fnmatch
, linecache
and shutil
.
Highlights
- The 80% overlap between
pathlib
andos.path
is bordering on hilarious. One obvious way indeed. -
fileinput
is really weird. -
stat
allows you to query and extract results ofos.stat()
-
shutil.diskusage()
returns total, used and free bytes for a given directory. -
shutil.which()
looks up executables (platform independentwhich
) -
shutil.make_archive
supports zip, tar, gztar, bztar and xtar out of the box.
pathlib
Nearly anything you want to do with files or paths, you want to use with pathlib
. Instantiate a Path("foo/bar")
,
then enjoy the goodness. You can concatenate paths with Path("a") / "b" / "c"
, which is really neat. For paths that
don't access the filesystem, you can use PurePath
instances. Paths are immutable, hashable, orderable.
Attributes
Paths expose their parts via the attributes parts
(all of them), drive
, root
, anchor
, parents
(all parent
directories), parent
, name
(the complete file name), suffix
(file extension), suffixes
(extensions as a list)
and stem
(name without suffix). as_posix()
gets you the path with forward slashes, as_uri()
as a file://
URI.
is_absolute()
is useful, is_reserved
is extremely useful on Windows and False
everywhere else. relative_to()
expresses one path starting from the other. with_name()
returns a new Path with the same path, but the new filename.
with_suffix()
does the same for the file suffix.
Search
.match()
returns True
if the path matches the given pattern. You can use .glob()
(or rglob()
for an implicit
leading **/
) to find files matching a pattern starting at the path. With iterdir()
you can iterate over all files in
a given directory.
Methods
cwd()
and home()
return a new path in your current or home directory. resolve()
turns any path into an absolute
path with no symlinks and no relative elements.
exists()
checks file or directory existence, and follows symlinks. Path.stat()
returns an os.stat_result
, which
you can either query directly or with the stat
module. chmod()
changes permissions and modes. You can also use
owner()
and group()
, is_dir()
, is_file()
, is_mount()
, is_symlink()
, is_socket()
, is_fifo()
,
is_block_device()
and is_char_device()
as shortcuts. With samefile()
you can compare files, just like with
filecmp
or os.path
. expanduser()
replaces ~
and ~user
strings in a path.
Create new directories with mkdir()
and new files with touch()
, and remove them with unlink
. Move files or
directories with rename()
and replace()
. Remove empty directories with rmdir()
. Create symlinks with
symlink_to()
and link_to
, remove them with unlink
. Open files with .open()
, or read_bytes()
and read_text()
for immediate read access (and write_bytes()
and write_text()
for writing).
os.path
os.path
is in a weird spot: Some of its functions are being replaced by the much nicer pathlib
module, but nowhere
near all of them. For example, basename()
is available on pathlib.Path
objects, but commonpath
is not.
These functions, to my knowledge, are available by way of pathlib
: abspath
(as resolve
), basename
(as name
),
dirname
(as parent
), exists
, expanduser
, join
(as joinpath
)
isabs()
is replaced by is_absolute()
(look at the readability!), isfile
by is_file
(and same for links and
directories), but ismount
does not have an equivalent in pathlib
. splitext()
is replaced by Path.suffix
and
Path.stem
. File information functions like getatime
, getmtime
, getctime
and getsize
are replaced by
Path().stat()
attributes.
These functions (again, to my knowledge), are still exclusive to os.path
: commonpath
tells you the longest common
path from a list of paths, and commonprefix
does the same, but from the beginning of a path. expandvars
expands
variables of the form $var
and ${var}
. normpath
changes a path to lowercase on Windows and leaves it unchanged
otherwise. relpath
is kind of equivalent to Path.relative_to
, but the signature is different enough to be annyoing
when converting. samefile
is available on Path
objects, but sameopenfile
(comparing descriptors) is not, and
neither is samestat()
.
fileinput
In a strange set of capabilities, fileinput
allows you to loop over standard input or a list of files. If you call
fileinput.input()
without any arguments, it uses sys.argv[1:]
. You can use fileinput.filename()
and
fileinput.lineno()
and the likes to get information about the file you're currently reading.
stat
After calling os.stat()
on something, you can use the stat
module to interpret the result. It contains just a ton of
functions like S_ISDIR()
that return a boolean when passed a stat mode. It can also be used to extract the file mode
(eg as a -rwxrwxrwx
string), and to extract other information from stat results, like S_SIZE()
.
filecmp
filecmp
compares files and directories with varying speed and accuracy. For detailed content comparison, use
difflib
.
cmp()
compares two files. If called with shallow=True
it only compares the os.stat()
signatures. cmpfiles()
does
the same for all the files in the two given directories. These functions return a comparison object which you can use to
see matching and different elements in the comparison process.
tempfile
tempfile
provides platform-independent temporary files and directories. All of its interface classes can be used as
context managers, providing automatic cleanup once you exit the context. If you don't use them as context managers, they
will be deleted when you close them or when they get gc'd.
TemporaryFile
, the base class, is not visible in directory listings under Unix, but that's not a platform independent
feature. Use NamedTemporaryFile
to get cross-platform consistent, visible files. SpooledTemporaryFile
keeps the data
spooled in memory until you call fileno()
or rollover()
(which cause the data to be written to disk).
TemporaryDirectory
also removes all of its contents when it is deleted.
If you don't want the whole context manager thing, you can use mkstemp()
and mkdtemp()
to create temporary files and
directories where you have to handle removal yourself. gettempdir()
tells you where tempfile
will create its files
and directories.
glob
glob
provides Unix-style wildcard expansion. glob.glob()
combines os.scandir()
and fnmatch.fnmatch()
to return
a list of files matching your input string. If you set recursive=True
, you can also use **
directory wildcards. Use
iglob
if the result can be huge – it returns an iterator instead of a list. Both these functions raise auditing
events.
You can use glob.excape()
to deal with paths that contain the special characters *
, ?
and [
.
fnmatch
fnmatch
provides Unix-style wildcard matching, with a fnmatch
and a filter
function for easy use, and a translate()
function that translates the given pattern to a regular expression to be used with re.match()
.
In most cases, you're going to want to use glob
, since it treats /
as special separator character. Or
pathlib.Path.glob
, of course.
linecache
linecache
is used to retrieve the file contents printed in tracebacks. You call it with getline(filename, lineno)
.
shutil
shutil
provides high level access to files, particularly for copying files: But be warned that even shutil.copy2()
cannot transfer all metadata, and the metadata transfered is different depending on your OS. All of these methods
raise auditing events.
Directories and files
copyfileobj
copies a file-like object to another file-like object, and copyfile
copies a file from one path-like to
another. copy
does the same as copyfile
, but also accepts a directory as destination. copymode
copies permission
bits, and copystat
additionally also copies access times and flags. copy2
combines copy
behaviour with copystat
attribute copying. chown
changes user and group ownership of a given path.
copytree
copies an entire directory tree. You can ignore files, and you can use ignore_patterns
to create an ignore
function from a list of ignore patterns. rmtree
removes an entire directory tree. move
moves either files or
directory trees with a copying function of your choice.
diskusage
returns a tuple of total, used and free bytes in a given directory. which
looks up executables, using the
current PATH
.
Archiving
make_archive
creates an archive. As all archiving tools, it has an unholy amount of options, and the directory options
are particularly confusing. Highly recommend to play around with them a bit. It supports several formats: zip, tar,
gztar, bztar and xtar (depending on available modules – use get_archive_formats
). unpack_archive
does the same in
reverse.
terminal size
To make sure shutil
isn't too intuitive to use, it also houses get_terminal_size()
, which returns a tuple of rows
and columns.