December 1: Python Text Processing Services
As part of the Python Standard Library traversal, today, we're going through the Python text processing services: string, re, difflib, textwrap, unicodedata, stringprep, readline, rlcompleter.
Highlights
-
string.Template
is kind of nice for user-facing string substitution. -
string.capwords()
runsstr.capitalize()
on every word in a string. -
difflib.get_close_matches
can help you catch user typos. -
re.split
is like"".split
, but for regular expressions. -
re.sub
can take a function instead of a replacement text. - People who don't know about
textwrap
:textwrap
is good, use it! Particularly text shortening and indent/dedent functionality is a really nice add-on.
string
String constants
Use string
constants for all your string grouping needs: ascii_lowercase
, ascii_uppercase
, digits
, whitespace
are really handy.
Custom string formatting
If the (three!) built-in ways of string formatting are not enough for you, you can use string.Formatter
to build your
very own string formatter. I assume that if you want to do that, you'll know what you're doing, but … why?
Template strings
The string
module provides the Template
class that comes with a far easier substitution mechanism than regular
string formatting: You have $variable
and ${variable}
(for when your variable is not separated from a word). You
shove those strings into the Template()
constructor and then use substitute()
with a dictionary or named arguments.
If you are fine with not substituting all placeholders, use safe_substitute()
instead. If you really want to, you can
subclass Template
and change the delimiters, brace patterns and so on. Template
strings are in use in
internationalization and similar simple external services.
Helper functions
string.capwords()
splits a string into words, capitalizes each word, and then joins them again.
re
Who doesn't love regex? This is Python trying to be Perl. Remember to use r""
notation for your regexes – those are
raw strings, which means that \
is literally a backslash, and not starting a Python escape sequence.
Regular expression syntax
Python contributes to the different regular expression languages that make up any well-running *nix system, but it generally follows the common paradigms. Be aware that there are special flags to switch to ascii matching, case insensitive matching, multi-line matching and more. Python does support backrefs (making the Python regular expression syntax a non-regular language btw).
Python supports the usual special sequences, such as \d
for digits. Try to use them, since their definition of "word"
and "whitespace" is more unicode-complete than yours is likely to be.
Module contents
You create regular expression objects with compile(pattern, flags=0)
, which is good to use when you want to re-use the
patterns. Otherwise, you can have Python create them on the fly and use normal strings with the following methods.
To use a regular expression, you use search
(for the first match), match
(to match from the start of the string),
fullmatch
(to match the entire string, all-or-nothing fashion), findall
(for all matches) and finditer
(for all
matches as an iterator for memory concerns). All of these methods exist on regular expression objects and in the
re
module.
Use sub
and subn
for search-and-replace functionality.
re.split
replicates regular string splitting but with regular expressions.
Match objects
Once you have found a match, you can do a bunch of things with it. With group()
you can retrieve subgroups from the
match – use their 1-based index or their name. With groups()
you can just get all subgroups in one big tuple, or with
groupdict
in a, you guessed it, dictionary. start
and end
give you the start and end position of the subgroup you
pass (or use span()
to get both at once). The match object also retains access to its string
and re
, and you can
debug the starting and ending positions where the regex engine tried to find a match.
difflib
You want to compare any two sequences? difflib
has got you covered. Use functions like context_diff
, ndiff
,
unified_diff
or diff_bytes
to generate a delta: a generator that spits out lines that start with +
/-
/ to
indicate if a line occurs in the first input or the second or both. With
restore
you can take a delta and extract one
of the two originating text segments.
Other fun things you can do with difflib: SequenceMatcher.quick_ratio()
can tell you the similarity of two words.
get_close_matches(word, possibilities)
can tell you if a word is really close to another one from a list, great for
user suggestions. HtmlDiff
creates a HTML table (or a complete file containing that table) showing a line-by-line,
side-by-side comparison of the input texts.
textwrap
Ever since I found out about textwrap, I've been using it a lot.
Use wrap(text, width=70)
and fill()
to wrap text into a list of lines or a single string with newlines respectively.
You can also expand tabs, replace or drop whitespace, add an initial indent, change how long words and hyphenated words
are handled, and set a maximum number of lines (and when the text is longer, it will be cut off with a separator,
defaulting to [...]
). That's pretty dreamy already, and if you want to customize this behaviour further, you can
subclass TextWrapper
.
shorten(text, width, placeholder="[...]")
removes words from a text until it hits the specified length and then adds
the placeholder. It's extremely useful, but sadly doesn't include a character-based mode.
Additionally, dedent()
removes leading whitespace from every line in the text its given, and indent()
adds a given
prefix to all lines in the given text, or only the lines for which the optional predicate function returns True
.
unicodedata
unicodedata lets you do all sorts of silly things with unicode
input, and I should use it more. The most obviously useful and most used function is normalize()
(and
is_normalized()
), which decomposes and composes unicode strings into the various forms, and deals with combining
characters and practically-equivalent characters.
The next most useful functions are probably name()
(unicodedata.name("Ⅻ")
is "ROMAN NUMERAL TWELVE") and its reverse
operation, lookup()
.
You can also use the module to analyze a letter: category()
tells you where the letter belongs, bidirectional
,
combining
and mirrored
give you further information. If the character contains a number, decimal
, digit
and
numeric
can be used to get that number – for example, unicodedata.numeric("Ⅻ")
is 12.0.
stringprep
Comparing strings is not straightforward, and RFC 3454 regulates how to
compare strings in different internet application domains. For this purpose, it separates unicode code points into
"tables". stringprep
provides lovely methods such as
in_table_a1
and in_table_c11
to help you comply with this RFC, should you really want to.
readline
Python has hooks for readline
, so you can manipulate shell history, completion functions and the like. This module is
used to write to a ~/.python_history
file, for example, using its built-in history file tools.