December 3: Python Data Types
Strap in, this is a long one: As part of the Python Standard Library
traversal, after yesterday's short post about binary
data services, today we're going to look at Python data type modules.
There's a lot of them: datetime, calendar, collections, heapq, bisect, array, weakref, types, copy,
pprint, reprlib and enum.
Highlights
- Don't use
datetime.datetime.time(), usetimetz(). - I'll be happy to forget
calendarin a few days. - All of
collectionsis good and underused.- deques have a
maxlenattribute and will discard items past that length.
- deques have a
- bisect maintains sorted lists cheaply.
- Trigger callbacks on garbage collection with
weakref.finalize(obj, callback) - You can only weak ref
listanddictsubclasses, not the original types themselves. -
types.coroutine()can turn a generator function into an awaitable coroutine. - If an enum has two members with the same value, the second one is an alias for the first.
datetime
Aware and naive objects
datetime is a big reason that I have not fallen out of love with Python yet. Native support for naive and aware
immutable objects that block interactions between the two? Sign me up!
timedelta
timedeltas are made out of love. You use them to add intervals to dates or times, but you can also use all the usual math operations on it and compare them to other timedeltas.
date
dates use the idealized Gregorian calendar: going infinitely forwards and backwards – remember this if you need dates in Russia in 1900, and/or watch my talk about calendars. Note that dates on their own do not have a timezone, and as such are scary and misleading.
Interfaces include the usual attributes (day, month, year, weekday, you name it). Next to that, there's date.today(),
date.replace(), and ctime() and strftime(), of course.
datetime
datetimes combine, you guessed it, date and time. Use today() and now() (or better utcnow()) for handy
constructors. There's also utcfromtimestamp(), which you should be using instead of fromtimestamp(), because using
naive datetimes is just asking for trouble. combine() dates and times. The attribute fold (0 or 1) tells you if it's
the first or second occurrence of a time when it occurs twice in a day (due to daylight savings and similar edge cases).
You can add timedeltas, or get them by way of subtraction. You can compare them, which is pretty useful. You can extract
the date() or time(), though you should really be using timetz().
To add a timezone to a naive datetime, use replace(tzinfo=tz). To cast a datetime to a certain timezone, use
astimezone(tz). You can use tzname(), utcoffset() and dst() to retrieve timezone related data.
time
time is like datetime without the date.
tzinfo
tzinfo is an abstract class. Use instances of it, provided by other people who know what they are doing. Hopefully.
strftime and strptime
Hey, look, it's another string formatting language! There's a lot of placeholders.
zoneinfo
zoneinfo uses system timezone information, if available, and otherwise tzinfo from PyPI, if available. It implements
the tzinfo abstract base class.
calendar
Have you ever said to yourself "I wish I could build a calendar grid like unix's cal does in Python"? Well, aren't you
in luck: calendar.TextCalendar().prmonth(2020, 12) has you covered. There is also an HTMLCalendar and localised
versions of each.
collections
ChainMap
ChainMap allows you to group mappings together, and is more efficient than chaining .update() calls. You can access
the initial list of mappings in the maps attribute and update it with new_child().
Counter
Feed any iterable to a Counter, or hand it keyword arguments. You can also subtract() iterables, or update() it
repeatedly. Access results via normal dictionary access, use .elements() to get the raw elements fed into the counter,
and most_common(n) for more evaluation.
deque
Deques (pronounced "deck"?!, double ended queue) permit threadsafe and fast (O(n)) appends and pops to both sides of a
list-like. You can pass the constructor a maxlen attribute that will cause items to be discarded when the deque grows
past its limit.
It follows a list-like interface, though it adds the appendleft, extendleft and popleft methods for easier access.
Additionally, it comes with a rotate() method that rotates the deque contents n steps to the left.
defaultdict
defaultdict is supremely useful, and I'm not sure there is more to say. Pass it a factory function, and you never have
to see any KeyError ever again. My favourites are list and int (though you can use Counter for most of the int
cases).
namedtuple
namedtuples are slightly better than peppering everything with just unstructured tuples and dicts. Instantiate them
like namedtuple('Point', ['x', 'y']), then use them as a real class: Point(11, y=30). The field definition can also,
for some reason, be a single string with comma- or whitespace-separated values. Since they are full-featured classes,
you can also use them as base class for inheritance.
OrderedDict
OrderedDict is important if you're unable to upgrade to modern Python. My sympathies.
… No, sorry. It still has things like move_to_end(), popitem() with a direction indicator, and better ordering
performance characteristics.
collections.abc
Use classes provided in collections.abc to test if a collection type has certain attributes, such as Reversible,
Mapping, Sequence.
heapq
heapq implements prioritiy queues, aka binary trees. That means heap[0] is always the smallest item, which is also
returned by pop(). Initialize one with an empty array or run heapify() on an existing one. heappush() and
heappop() interact with the heap. If you want to do both, run heapreplace(). With merge() you can take several
already sorted lists into a heap, and you can retrieve the nlargest or nsmallest items. Insertion and removal, if
I understand it correctly, run in O(log n).
bisect
Maintain a sorted list without expensive resorting – uses bisecting under the hood. You can tell this one's by and for
the theorists because everything is just called x and a and lo and hi. bisect, bisect_left and
bisect_right finds the correct insertion point, while insort, insort_left, insort_right handle the insertion,
too. Check the docs to see how to use bisect to find items in a list.
array
Arrays are like lists, but constrained to one type out of a fairly small type pool. The type is specified at creation
time. Create one with array.array(), which provides list-style interfaces. Additional methods: check its byte size
with array.itemsize, use byteswap() to swap the byte order on all items (if the items are 1, 2, 4, or 8 bytes long).
You can append additional items with methods like frombytes, fromfile, fromstring, and export with tobytes,
tofile, tostring.
weakref
weakref is for people who miss playing with garbage collectors when they use Python: A weak reference lets you access
its referent, but does not keep it from being gc'd. Most types support weak references. Some built-in types, like list
and dict do not, but if you subclass them, the subclasses support weak referencing.
Usually you want to handle many weak references at a time. For this purpose, weakref provides WeakKeyDictionary,
WeakValueDictionary and WeakSet.
Finalizing
With finalize, you can register cleanup functions to be run when an object is
gc'd (but do not call the finalizer manually – it will run at most once!). Pass atexit=False if you do not want the
callback to trigger when the whole program exits. You can remove them by running the returned finalizer's detach()
method.
Usually, those functions and classes will be sufficient, but you can also manually create weak references with
weakref.ref(object[, callback]). If you create several references to the same object, the most recent garbage
collection callback will be called first. You can count and collect all weak references to an object with
getweakrefcount and getweakrefs.
types
Dynamic Type Creation
A module for wizards who create their own types dynamically. Use new_class given a name, base classes, a metaclass and
so on, and prepare_class to just generate the metaclass and namespaces.
Use resolve_bases() to replace the __mro_entries__ method with the unrolled/evaluated MRO.
Standard Interpreter Types
types includes some classes that are mostly useful for issubclass and isinstance checks, such as FunctionType,
LambdaType, GeneratorType, CoroutineType and so on.
Coroutines
With types.coroutine(), you can transform a generator function into a coroutine function.
copy
Refreshingly, copy provides exactly copy.copy for shallow copies and copy.deepcopy for deep copies. deepcopy has
your back, and typically resolves recursion gracefully. You can change copy behaviour on your own classes.
pprint
I disagree with the decision to put pprint and reprlib into the data types section, when they would be better placed
in the string services section.
Anyways. pprint serves to represent arbitrary data structures in a way that can be used as input to the Python
interpreter. Instantiate a PrettyPrinter and call PrettyPrinter.pprint. On the printer class, you can set an indent, width,
compact=False, sort_dicts=True, and most interestingly a depth. I can recommend using a higher indent value than
the default 1.
Use pformat as a shortcut to instantiating a printer and receiving its result as a string, or pp or pprint to
print to stdout or an arbitrary stream. Use saferepr to disable all recursion.
reprlib
reprlib provides an alternative reprlib.repr() that limits the size of the result. You can customize things like the
maximum array entries to be printed and the maximum recursion level. You can also decorate your __repr__() methods
with @recursive_repr(fillvalue="...") to handle how nested objects are represented.
enum
Module contents
Enums map their attributes ("members", symbolic constants) to unique constant values. They can be compared and iterated
over. You can use the classes Enum, IntEnum for numerical constants, IntFlag for numerical constants that should
support bitwise combintion (and remain subclasses of int), Flag for the same just without the int subclassing.
Combinations of Flag enums that are not enum members have a boolean evaluation of False.
You can decorate an enum class with enum.unique() to ensure that each value occurs only once (names are always unique).
enum.auto() can be called in the member definition to assign a value by function. By default you get integers starting
with 1, but you can override _generate_next_value on your enum class for different behaviour.
The type of the enum's members is the enum class itself, so you can use isinstance() to check for proper values. You
can iterate over enums to get all their members, or access __members__. Enum members have properties to access their
name and value. They are also hashable, so you can use them as dict keys and in sets.
Methods defined on the enum class are available on the enum members. You can subclass enums only when they don't define
any members. They can be pickled. If you use throwaway values (like auto() or object) for your values, change
__repr__ to hide that value.
Creating enums
You create enums by subclassing Enum, and you set members by defining class attributes.
You can also create them programmatically, like Enum('Colour', 'RED BLUE GREEN'). Add a module=__name__ if you want
to pickle them.
Programmatic access
Enums are so weird. You can retrieve enum members by name with MyEnum.NAME or MyEnum["NAME"], and you can retrieve
them by value with MyEnum(VALUE). If a value occurs multiple times, you just get the first one, because all later ones
are actually just aliases to the first member with their value.
Comparison
You should compare enum members by identity with is and is not, though use of equality comparison is also supported.
Only IntEnum members can return True when comparing to something other than an enum member.
Enums are usually not ordered – check the docs to see how to implement an OrderedEnum base class.
Notes
Enums are extremely weird when you think in normal Python classes, which is mostly due to their custom metaclass. Members are somewhat like instances, except that they are singletons.
graphlib
graphlib implements TopologicalSorter, a handy sorting and iterating class for hashable node elements. In
topological sort, elements are connected by directed edges. The linear sort order guarantees that of two connected
nodes, the originating node comes before the other in the sorted list. A complete topological ordering is only possible
in acyclic graphs.