DjangoCon Europe 2019: Logging Rethought 2: The Actions of Frank Taylor Jr.

rixx

2019-04-10

Writeup of the DjangoCon Europe 2019 talk »Logging Rethought 2: The Actions of Frank Taylor Jr.« by Markus Holtermann

Markus Holtermann: Markus Holtermann works as a back-end and infrastructure engineer at Crate.io. He has been a Django core contributor since early 2015. He is a member of the Django security team as well as an organizer and advisor for DjangoCon conferences. Markus has been a project lead at the German ubuntuusers.de community support platform where he discovered Python and Django in 2010.

The current state of logging

The current state of logging is … okay? It certainly works, but there is much to be improved. Logging currently probably looks like this:

import logging
logger = logging.getLogger(__name__)
logger.error('Component xxx failed due to a timeout')

This is produces a timestamped message, but it doesn't necessarily help you figure out later what exactly lead to this issue. There is context missing – local variables, local context, helping you catch configuration mistakes, bugs in code, affected users/connections/areas. Did other services have the same or similar problems? At the same time, just jamming ALL the information in there will make the log messages hard to read and understand.

Structure your data

Instead we should be adding structure to our logging. There is a library called stuctlog that provides this.

import structlog
logger = structlog.getLogger('structlog')
logger.error(
    'auth_provider_failed',
    provder_name=provider_name,
    provider_ip=provider_ip,
    timeout=timeout,
)

You can now export those errors as JSON, and then load them anywhere, query them using the tools of your choice. You can visualize your errors now – because what you see is what you understand. It allows you to look for and find co-occurrences, and relationships between errors.

Tracing events

Zooming out from per-server logging, let's consider a big microservice infrastructure. Reasoning about microservices is hard. Figuring out how errors and events relate to each other is even harder. Attaching an ID to an incoming request, and then keeping it for all related data.

import uuid, structlog

logger = structlog.get_logger('structlog')

def structlog_middleware(get_response):  # Use as first middleware
    def _inner(request):
        request.trace_id = request.headers.get('x_trace_id') or str(uuid.uuid4)
        log = logger.new(trace_id=request.trace_id)
        return get_response(request)
    return _inner

def structlog_user_middleware(get_response):  # Use after authentication middleware
    def _inner(request):
        if request.user.is_authenticated:
            logger.bind(logger, user.id)
        return get_response(request)
    return _inner

Using middlewares like these allow you to trace events, both within a service and across services. (But be mindful what you log and where, both to protect your own data and secrets, but to )

Frank Taylor Jr

Frank William Abagnale Jr is the main character in "Catch me if you can", a conman movie, who nobody could trace. Don't let Frank William Abagnale Jr get away in your system – log smart, structured data to get more intelligence about your users and your servers. Check out the demo repo once it's public, which includes the code, a grafana dashboard, and lots of other helpful tools.