December 1: Python Text Processing Services

rixx

2020-12-01

As part of the Python Standard Library traversal, today, we're going through the Python text processing services: string, re, difflib, textwrap, unicodedata, stringprep, readline, rlcompleter.

Highlights

string.Template is kind of nice for user-facing string substitution.
string.capwords() runs str.capitalize() on every word in a string.
difflib.get_close_matches can help you catch user typos.
re.split is like "".split, but for regular expressions.
re.sub can take a function instead of a replacement text.
People who don't know about textwrap: textwrap is good, use it! Particularly text shortening and indent/dedent functionality is a really nice add-on.

string

String constants

Use string constants for all your string grouping needs: ascii_lowercase, ascii_uppercase, digits, whitespace are really handy.

Custom string formatting

If the (three!) built-in ways of string formatting are not enough for you, you can use string.Formatter to build your very own string formatter. I assume that if you want to do that, you'll know what you're doing, but … why?

Template strings

The string module provides the Template class that comes with a far easier substitution mechanism than regular string formatting: You have $variable and ${variable} (for when your variable is not separated from a word). You shove those strings into the Template() constructor and then use substitute() with a dictionary or named arguments. If you are fine with not substituting all placeholders, use safe_substitute() instead. If you really want to, you can subclass Template and change the delimiters, brace patterns and so on. Template strings are in use in internationalization and similar simple external services.

Helper functions

string.capwords() splits a string into words, capitalizes each word, and then joins them again.

re

Who doesn't love regex? This is Python trying to be Perl. Remember to use r"" notation for your regexes – those are raw strings, which means that \ is literally a backslash, and not starting a Python escape sequence.

Regular expression syntax

Python contributes to the different regular expression languages that make up any well-running *nix system, but it generally follows the common paradigms. Be aware that there are special flags to switch to ascii matching, case insensitive matching, multi-line matching and more. Python does support backrefs (making the Python regular expression syntax a non-regular language btw).

Python supports the usual special sequences, such as \d for digits. Try to use them, since their definition of "word" and "whitespace" is more unicode-complete than yours is likely to be.

Module contents

You create regular expression objects with compile(pattern, flags=0), which is good to use when you want to re-use the patterns. Otherwise, you can have Python create them on the fly and use normal strings with the following methods. To use a regular expression, you use search (for the first match), match (to match from the start of the string), fullmatch (to match the entire string, all-or-nothing fashion), findall (for all matches) and finditer (for all matches as an iterator for memory concerns). All of these methods exist on regular expression objects and in the re module.

Use sub and subn for search-and-replace functionality.

re.split replicates regular string splitting but with regular expressions.

Match objects

Once you have found a match, you can do a bunch of things with it. With group() you can retrieve subgroups from the match – use their 1-based index or their name. With groups() you can just get all subgroups in one big tuple, or with groupdict in a, you guessed it, dictionary. start and end give you the start and end position of the subgroup you pass (or use span() to get both at once). The match object also retains access to its string and re, and you can debug the starting and ending positions where the regex engine tried to find a match.

difflib

You want to compare any two sequences? difflib has got you covered. Use functions like context_diff, ndiff, unified_diff or diff_bytes to generate a delta: a generator that spits out lines that start with +/-/ to indicate if a line occurs in the first input or the second or both. With restore you can take a delta and extract one of the two originating text segments.

Other fun things you can do with difflib: SequenceMatcher.quick_ratio() can tell you the similarity of two words. get_close_matches(word, possibilities) can tell you if a word is really close to another one from a list, great for user suggestions. HtmlDiff creates a HTML table (or a complete file containing that table) showing a line-by-line, side-by-side comparison of the input texts.

textwrap

Ever since I found out about textwrap, I've been using it a lot. Use wrap(text, width=70) and fill() to wrap text into a list of lines or a single string with newlines respectively. You can also expand tabs, replace or drop whitespace, add an initial indent, change how long words and hyphenated words are handled, and set a maximum number of lines (and when the text is longer, it will be cut off with a separator, defaulting to [...]). That's pretty dreamy already, and if you want to customize this behaviour further, you can subclass TextWrapper.

shorten(text, width, placeholder="[...]") removes words from a text until it hits the specified length and then adds the placeholder. It's extremely useful, but sadly doesn't include a character-based mode.

Additionally, dedent() removes leading whitespace from every line in the text its given, and indent() adds a given prefix to all lines in the given text, or only the lines for which the optional predicate function returns True.

unicodedata

unicodedata lets you do all sorts of silly things with unicode input, and I should use it more. The most obviously useful and most used function is normalize() (and is_normalized()), which decomposes and composes unicode strings into the various forms, and deals with combining characters and practically-equivalent characters.

The next most useful functions are probably name() (unicodedata.name("Ⅻ") is "ROMAN NUMERAL TWELVE") and its reverse operation, lookup().

You can also use the module to analyze a letter: category() tells you where the letter belongs, bidirectional, combining and mirrored give you further information. If the character contains a number, decimal, digit and numeric can be used to get that number – for example, unicodedata.numeric("Ⅻ") is 12.0.

stringprep

Comparing strings is not straightforward, and RFC 3454 regulates how to compare strings in different internet application domains. For this purpose, it separates unicode code points into "tables". stringprep provides lovely methods such as in_table_a1 and in_table_c11 to help you comply with this RFC, should you really want to.

readline

Python has hooks for readline, so you can manipulate shell history, completion functions and the like. This module is used to write to a ~/.python_history file, for example, using its built-in history file tools.