December 3: Python Data Types

Strap in, this is a long one: As part of the Python Standard Library traversal, after yesterday's short post about binary data services, today we're going to look at Python data type modules. There's a lot of them: datetime, calendar, collections, heapq, bisect, array, weakref, types, copy, pprint, reprlib and enum.

Highlights

  • Don't use datetime.datetime.time(), use timetz().
  • I'll be happy to forget calendar in a few days.
  • All of collections is good and underused.
    • deques have a maxlen attribute and will discard items past that length.
  • bisect maintains sorted lists cheaply.
  • Trigger callbacks on garbage collection with weakref.finalize(obj, callback)
  • You can only weak ref list and dict subclasses, not the original types themselves.
  • types.coroutine() can turn a generator function into an awaitable coroutine.
  • If an enum has two members with the same value, the second one is an alias for the first.

datetime

Aware and naive objects

datetime is a big reason that I have not fallen out of love with Python yet. Native support for naive and aware immutable objects that block interactions between the two? Sign me up!

timedelta

timedeltas are made out of love. You use them to add intervals to dates or times, but you can also use all the usual math operations on it and compare them to other timedeltas.

date

dates use the idealized Gregorian calendar: going infinitely forwards and backwards – remember this if you need dates in Russia in 1900, and/or watch my talk about calendars. Note that dates on their own do not have a timezone, and as such are scary and misleading.

Interfaces include the usual attributes (day, month, year, weekday, you name it). Next to that, there's date.today(), date.replace(), and ctime() and strftime(), of course.

datetime

datetimes combine, you guessed it, date and time. Use today() and now() (or better utcnow()) for handy constructors. There's also utcfromtimestamp(), which you should be using instead of fromtimestamp(), because using naive datetimes is just asking for trouble. combine() dates and times. The attribute fold (0 or 1) tells you if it's the first or second occurrence of a time when it occurs twice in a day (due to daylight savings and similar edge cases).

You can add timedeltas, or get them by way of subtraction. You can compare them, which is pretty useful. You can extract the date() or time(), though you should really be using timetz().

To add a timezone to a naive datetime, use replace(tzinfo=tz). To cast a datetime to a certain timezone, use astimezone(tz). You can use tzname(), utcoffset() and dst() to retrieve timezone related data.

time

time is like datetime without the date.

tzinfo

tzinfo is an abstract class. Use instances of it, provided by other people who know what they are doing. Hopefully.

strftime and strptime

Hey, look, it's another string formatting language! There's a lot of placeholders.

zoneinfo

zoneinfo uses system timezone information, if available, and otherwise tzinfo from PyPI, if available. It implements the tzinfo abstract base class.

calendar

Have you ever said to yourself "I wish I could build a calendar grid like unix's cal does in Python"? Well, aren't you in luck: calendar.TextCalendar().prmonth(2020, 12) has you covered. There is also an HTMLCalendar and localised versions of each.

collections

ChainMap

ChainMap allows you to group mappings together, and is more efficient than chaining .update() calls. You can access the initial list of mappings in the maps attribute and update it with new_child().

Counter

Feed any iterable to a Counter, or hand it keyword arguments. You can also subtract() iterables, or update() it repeatedly. Access results via normal dictionary access, use .elements() to get the raw elements fed into the counter, and most_common(n) for more evaluation.

deque

Deques (pronounced "deck"?!, double ended queue) permit threadsafe and fast (O(n)) appends and pops to both sides of a list-like. You can pass the constructor a maxlen attribute that will cause items to be discarded when the deque grows past its limit.

It follows a list-like interface, though it adds the appendleft, extendleft and popleft methods for easier access. Additionally, it comes with a rotate() method that rotates the deque contents n steps to the left.

defaultdict

defaultdict is supremely useful, and I'm not sure there is more to say. Pass it a factory function, and you never have to see any KeyError ever again. My favourites are list and int (though you can use Counter for most of the int cases).

namedtuple

namedtuples are slightly better than peppering everything with just unstructured tuples and dicts. Instantiate them like namedtuple('Point', ['x', 'y']), then use them as a real class: Point(11, y=30). The field definition can also, for some reason, be a single string with comma- or whitespace-separated values. Since they are full-featured classes, you can also use them as base class for inheritance.

OrderedDict

OrderedDict is important if you're unable to upgrade to modern Python. My sympathies.

… No, sorry. It still has things like move_to_end(), popitem() with a direction indicator, and better ordering performance characteristics.

collections.abc

Use classes provided in collections.abc to test if a collection type has certain attributes, such as Reversible, Mapping, Sequence.

heapq

heapq implements prioritiy queues, aka binary trees. That means heap[0] is always the smallest item, which is also returned by pop(). Initialize one with an empty array or run heapify() on an existing one. heappush() and heappop() interact with the heap. If you want to do both, run heapreplace(). With merge() you can take several already sorted lists into a heap, and you can retrieve the nlargest or nsmallest items. Insertion and removal, if I understand it correctly, run in O(log n).

bisect

Maintain a sorted list without expensive resorting – uses bisecting under the hood. You can tell this one's by and for the theorists because everything is just called x and a and lo and hi. bisect, bisect_left and bisect_right finds the correct insertion point, while insort, insort_left, insort_right handle the insertion, too. Check the docs to see how to use bisect to find items in a list.

array

Arrays are like lists, but constrained to one type out of a fairly small type pool. The type is specified at creation time. Create one with array.array(), which provides list-style interfaces. Additional methods: check its byte size with array.itemsize, use byteswap() to swap the byte order on all items (if the items are 1, 2, 4, or 8 bytes long). You can append additional items with methods like frombytes, fromfile, fromstring, and export with tobytes, tofile, tostring.

weakref

weakref is for people who miss playing with garbage collectors when they use Python: A weak reference lets you access its referent, but does not keep it from being gc'd. Most types support weak references. Some built-in types, like list and dict do not, but if you subclass them, the subclasses support weak referencing.

Usually you want to handle many weak references at a time. For this purpose, weakref provides WeakKeyDictionary, WeakValueDictionary and WeakSet.

Finalizing

With finalize, you can register cleanup functions to be run when an object is gc'd (but do not call the finalizer manually – it will run at most once!). Pass atexit=False if you do not want the callback to trigger when the whole program exits. You can remove them by running the returned finalizer's detach() method.

Usually, those functions and classes will be sufficient, but you can also manually create weak references with weakref.ref(object[, callback]). If you create several references to the same object, the most recent garbage collection callback will be called first. You can count and collect all weak references to an object with getweakrefcount and getweakrefs.

types

Dynamic Type Creation

A module for wizards who create their own types dynamically. Use new_class given a name, base classes, a metaclass and so on, and prepare_class to just generate the metaclass and namespaces.

Use resolve_bases() to replace the __mro_entries__ method with the unrolled/evaluated MRO.

Standard Interpreter Types

types includes some classes that are mostly useful for issubclass and isinstance checks, such as FunctionType, LambdaType, GeneratorType, CoroutineType and so on.

Coroutines

With types.coroutine(), you can transform a generator function into a coroutine function.

copy

Refreshingly, copy provides exactly copy.copy for shallow copies and copy.deepcopy for deep copies. deepcopy has your back, and typically resolves recursion gracefully. You can change copy behaviour on your own classes.

pprint

I disagree with the decision to put pprint and reprlib into the data types section, when they would be better placed in the string services section.

Anyways. pprint serves to represent arbitrary data structures in a way that can be used as input to the Python interpreter. Instantiate a PrettyPrinter and call PrettyPrinter.pprint. On the printer class, you can set an indent, width, compact=False, sort_dicts=True, and most interestingly a depth. I can recommend using a higher indent value than the default 1.

Use pformat as a shortcut to instantiating a printer and receiving its result as a string, or pp or pprint to print to stdout or an arbitrary stream. Use saferepr to disable all recursion.

reprlib

reprlib provides an alternative reprlib.repr() that limits the size of the result. You can customize things like the maximum array entries to be printed and the maximum recursion level. You can also decorate your __repr__() methods with @recursive_repr(fillvalue="...") to handle how nested objects are represented.

enum

Module contents

Enums map their attributes ("members", symbolic constants) to unique constant values. They can be compared and iterated over. You can use the classes Enum, IntEnum for numerical constants, IntFlag for numerical constants that should support bitwise combintion (and remain subclasses of int), Flag for the same just without the int subclassing. Combinations of Flag enums that are not enum members have a boolean evaluation of False.

You can decorate an enum class with enum.unique() to ensure that each value occurs only once (names are always unique). enum.auto() can be called in the member definition to assign a value by function. By default you get integers starting with 1, but you can override _generate_next_value on your enum class for different behaviour.

The type of the enum's members is the enum class itself, so you can use isinstance() to check for proper values. You can iterate over enums to get all their members, or access __members__. Enum members have properties to access their name and value. They are also hashable, so you can use them as dict keys and in sets.

Methods defined on the enum class are available on the enum members. You can subclass enums only when they don't define any members. They can be pickled. If you use throwaway values (like auto() or object) for your values, change __repr__ to hide that value.

Creating enums

You create enums by subclassing Enum, and you set members by defining class attributes. You can also create them programmatically, like Enum('Colour', 'RED BLUE GREEN'). Add a module=__name__ if you want to pickle them.

Programmatic access

Enums are so weird. You can retrieve enum members by name with MyEnum.NAME or MyEnum["NAME"], and you can retrieve them by value with MyEnum(VALUE). If a value occurs multiple times, you just get the first one, because all later ones are actually just aliases to the first member with their value.

Comparison

You should compare enum members by identity with is and is not, though use of equality comparison is also supported. Only IntEnum members can return True when comparing to something other than an enum member. Enums are usually not ordered – check the docs to see how to implement an OrderedEnum base class.

Notes

Enums are extremely weird when you think in normal Python classes, which is mostly due to their custom metaclass. Members are somewhat like instances, except that they are singletons.

graphlib

graphlib implements TopologicalSorter, a handy sorting and iterating class for hashable node elements. In topological sort, elements are connected by directed edges. The linear sort order guarantees that of two connected nodes, the originating node comes before the other in the sorted list. A complete topological ordering is only possible in acyclic graphs.