December 3: Python Data Types
Strap in, this is a long one: As part of the Python Standard Library
traversal, after yesterday's short post about binary
data services, today we're going to look at Python data type modules.
There's a lot of them: datetime
, calendar
, collections
, heapq
, bisect
, array
, weakref
, types
, copy
,
pprint
, reprlib
and enum
.
Highlights
- Don't use
datetime.datetime.time()
, usetimetz()
. - I'll be happy to forget
calendar
in a few days. - All of
collections
is good and underused.- deques have a
maxlen
attribute and will discard items past that length.
- deques have a
- bisect maintains sorted lists cheaply.
- Trigger callbacks on garbage collection with
weakref.finalize(obj, callback)
- You can only weak ref
list
anddict
subclasses, not the original types themselves. -
types.coroutine()
can turn a generator function into an awaitable coroutine. - If an enum has two members with the same value, the second one is an alias for the first.
datetime
Aware and naive objects
datetime
is a big reason that I have not fallen out of love with Python yet. Native support for naive and aware
immutable objects that block interactions between the two? Sign me up!
timedelta
timedeltas are made out of love. You use them to add intervals to dates or times, but you can also use all the usual math operations on it and compare them to other timedeltas.
date
dates use the idealized Gregorian calendar: going infinitely forwards and backwards – remember this if you need dates in Russia in 1900, and/or watch my talk about calendars. Note that dates on their own do not have a timezone, and as such are scary and misleading.
Interfaces include the usual attributes (day, month, year, weekday, you name it). Next to that, there's date.today()
,
date.replace()
, and ctime()
and strftime()
, of course.
datetime
datetimes combine, you guessed it, date and time. Use today()
and now()
(or better utcnow()
) for handy
constructors. There's also utcfromtimestamp()
, which you should be using instead of fromtimestamp()
, because using
naive datetimes is just asking for trouble. combine()
dates and times. The attribute fold
(0 or 1) tells you if it's
the first or second occurrence of a time when it occurs twice in a day (due to daylight savings and similar edge cases).
You can add timedeltas, or get them by way of subtraction. You can compare them, which is pretty useful. You can extract
the date()
or time()
, though you should really be using timetz()
.
To add a timezone to a naive datetime, use replace(tzinfo=tz)
. To cast a datetime to a certain timezone, use
astimezone(tz)
. You can use tzname()
, utcoffset()
and dst()
to retrieve timezone related data.
time
time
is like datetime
without the date
.
tzinfo
tzinfo
is an abstract class. Use instances of it, provided by other people who know what they are doing. Hopefully.
strftime and strptime
Hey, look, it's another string formatting language! There's a lot of placeholders.
zoneinfo
zoneinfo
uses system timezone information, if available, and otherwise tzinfo
from PyPI, if available. It implements
the tzinfo
abstract base class.
calendar
Have you ever said to yourself "I wish I could build a calendar grid like unix's cal
does in Python"? Well, aren't you
in luck: calendar.TextCalendar().prmonth(2020, 12)
has you covered. There is also an HTMLCalendar
and localised
versions of each.
collections
ChainMap
ChainMap
allows you to group mappings together, and is more efficient than chaining .update()
calls. You can access
the initial list of mappings in the maps
attribute and update it with new_child()
.
Counter
Feed any iterable to a Counter
, or hand it keyword arguments. You can also subtract()
iterables, or update()
it
repeatedly. Access results via normal dictionary access, use .elements()
to get the raw elements fed into the counter,
and most_common(n)
for more evaluation.
deque
Deques (pronounced "deck"?!, double ended queue) permit threadsafe and fast (O(n)
) appends and pops to both sides of a
list-like. You can pass the constructor a maxlen
attribute that will cause items to be discarded when the deque grows
past its limit.
It follows a list-like interface, though it adds the appendleft
, extendleft
and popleft
methods for easier access.
Additionally, it comes with a rotate()
method that rotates the deque contents n steps to the left.
defaultdict
defaultdict is supremely useful, and I'm not sure there is more to say. Pass it a factory function, and you never have
to see any KeyError
ever again. My favourites are list
and int
(though you can use Counter
for most of the int
cases).
namedtuple
namedtuples
are slightly better than peppering everything with just unstructured tuples and dicts. Instantiate them
like namedtuple('Point', ['x', 'y'])
, then use them as a real class: Point(11, y=30)
. The field definition can also,
for some reason, be a single string with comma- or whitespace-separated values. Since they are full-featured classes,
you can also use them as base class for inheritance.
OrderedDict
OrderedDict
is important if you're unable to upgrade to modern Python. My sympathies.
… No, sorry. It still has things like move_to_end()
, popitem()
with a direction indicator, and better ordering
performance characteristics.
collections.abc
Use classes provided in collections.abc
to test if a collection type has certain attributes, such as Reversible
,
Mapping
, Sequence
.
heapq
heapq
implements prioritiy queues, aka binary trees. That means heap[0]
is always the smallest item, which is also
returned by pop()
. Initialize one with an empty array or run heapify()
on an existing one. heappush()
and
heappop()
interact with the heap. If you want to do both, run heapreplace()
. With merge()
you can take several
already sorted lists into a heap, and you can retrieve the nlargest
or nsmallest
items. Insertion and removal, if
I understand it correctly, run in O(log n)
.
bisect
Maintain a sorted list without expensive resorting – uses bisecting under the hood. You can tell this one's by and for
the theorists because everything is just called x
and a
and lo
and hi
. bisect
, bisect_left
and
bisect_right
finds the correct insertion point, while insort
, insort_left
, insort_right
handle the insertion,
too. Check the docs to see how to use bisect
to find items in a list.
array
Arrays are like lists, but constrained to one type out of a fairly small type pool. The type is specified at creation
time. Create one with array.array()
, which provides list-style interfaces. Additional methods: check its byte size
with array.itemsize
, use byteswap()
to swap the byte order on all items (if the items are 1, 2, 4, or 8 bytes long).
You can append additional items with methods like frombytes
, fromfile
, fromstring
, and export with tobytes
,
tofile
, tostring
.
weakref
weakref
is for people who miss playing with garbage collectors when they use Python: A weak reference lets you access
its referent, but does not keep it from being gc'd. Most types support weak references. Some built-in types, like list
and dict
do not, but if you subclass them, the subclasses support weak referencing.
Usually you want to handle many weak references at a time. For this purpose, weakref
provides WeakKeyDictionary
,
WeakValueDictionary
and WeakSet
.
Finalizing
With finalize
, you can register cleanup functions to be run when an object is
gc'd (but do not call the finalizer manually – it will run at most once!). Pass atexit=False
if you do not want the
callback to trigger when the whole program exits. You can remove them by running the returned finalizer's detach()
method.
Usually, those functions and classes will be sufficient, but you can also manually create weak references with
weakref.ref(object[, callback])
. If you create several references to the same object, the most recent garbage
collection callback will be called first. You can count and collect all weak references to an object with
getweakrefcount
and getweakrefs
.
types
Dynamic Type Creation
A module for wizards who create their own types dynamically. Use new_class
given a name, base classes, a metaclass and
so on, and prepare_class
to just generate the metaclass and namespaces.
Use resolve_bases()
to replace the __mro_entries__
method with the unrolled/evaluated MRO.
Standard Interpreter Types
types
includes some classes that are mostly useful for issubclass
and isinstance
checks, such as FunctionType
,
LambdaType
, GeneratorType
, CoroutineType
and so on.
Coroutines
With types.coroutine()
, you can transform a generator function into a coroutine function.
copy
Refreshingly, copy
provides exactly copy.copy
for shallow copies and copy.deepcopy
for deep copies. deepcopy
has
your back, and typically resolves recursion gracefully. You can change copy behaviour on your own classes.
pprint
I disagree with the decision to put pprint
and reprlib
into the data types section, when they would be better placed
in the string services section.
Anyways. pprint
serves to represent arbitrary data structures in a way that can be used as input to the Python
interpreter. Instantiate a PrettyPrinter
and call PrettyPrinter.pprint
. On the printer class, you can set an indent
, width
,
compact=False
, sort_dicts=True
, and most interestingly a depth
. I can recommend using a higher indent value than
the default 1
.
Use pformat
as a shortcut to instantiating a printer and receiving its result as a string, or pp
or pprint
to
print to stdout or an arbitrary stream. Use saferepr
to disable all recursion.
reprlib
reprlib
provides an alternative reprlib.repr()
that limits the size of the result. You can customize things like the
maximum array entries to be printed and the maximum recursion level. You can also decorate your __repr__()
methods
with @recursive_repr(fillvalue="...")
to handle how nested objects are represented.
enum
Module contents
Enums map their attributes ("members", symbolic constants) to unique constant values. They can be compared and iterated
over. You can use the classes Enum
, IntEnum
for numerical constants, IntFlag
for numerical constants that should
support bitwise combintion (and remain subclasses of int
), Flag
for the same just without the int
subclassing.
Combinations of Flag
enums that are not enum members have a boolean evaluation of False
.
You can decorate an enum class with enum.unique()
to ensure that each value occurs only once (names are always unique).
enum.auto()
can be called in the member definition to assign a value by function. By default you get integers starting
with 1
, but you can override _generate_next_value
on your enum class for different behaviour.
The type
of the enum's members is the enum class itself, so you can use isinstance()
to check for proper values. You
can iterate over enums to get all their members, or access __members__
. Enum members have properties to access their
name
and value
. They are also hashable, so you can use them as dict keys and in sets.
Methods defined on the enum class are available on the enum members. You can subclass enums only when they don't define
any members. They can be pickled. If you use throwaway values (like auto()
or object
) for your values, change
__repr__
to hide that value.
Creating enums
You create enums by subclassing Enum
, and you set members by defining class attributes.
You can also create them programmatically, like Enum('Colour', 'RED BLUE GREEN')
. Add a module=__name__
if you want
to pickle them.
Programmatic access
Enums are so weird. You can retrieve enum members by name with MyEnum.NAME
or MyEnum["NAME"]
, and you can retrieve
them by value with MyEnum(VALUE)
. If a value occurs multiple times, you just get the first one, because all later ones
are actually just aliases to the first member with their value.
Comparison
You should compare enum members by identity with is
and is not
, though use of equality comparison is also supported.
Only IntEnum
members can return True
when comparing to something other than an enum member.
Enums are usually not ordered – check the docs to see how to implement an OrderedEnum
base class.
Notes
Enums are extremely weird when you think in normal Python classes, which is mostly due to their custom metaclass. Members are somewhat like instances, except that they are singletons.
graphlib
graphlib
implements TopologicalSorter
, a handy sorting and iterating class for hashable node elements. In
topological sort, elements are connected by directed edges. The linear sort order guarantees that of two connected
nodes, the originating node comes before the other in the sorted list. A complete topological ordering is only possible
in acyclic graphs.