Analyzing public data with Datasette: Django tickets

Note: Early draft, structure and voice and basically everything is still up for debate.

Note: To me, the technical details of scraping and running a setup are hard to put into a nice flow of writing that anybody would want to read, so I put them into explicit appendices. Everybody interested can read them, everybody who already knows how to do them can stop after the fun dataviz stuff.

Note: Should I move the datasette to some path under data.rixx.de, so that it won't conflict with future datasette experiments?

Note: This is the first long and structured blog post in a while (the HTTP codes one was long, but not separated, most other stuff is short or storytelling). I maybe should get a sidebar table of content ready before I publish this.

Note: Especially in the last parts I mix "I want" and "you want" and "we want"

BEGIN BLOG POST

END BLOG POST

So that's it – thank you for reading, and thanks to Simon Willison, who not only built Datasette, but also helped out on Twitter and GitHub with advice and patience and enthusiasm, and was kind enough to proofread this blog post.

Appendix: Collecting the data

I got the data from the Django ticket system. It's a Trac instance and offers an RPC API. I wrote some very naive code to run through all ticket IDs, get the ticket's data, and dump it into an SQLite database.

This script is very much not optimized. It could have committed data less often to the database, or run in parallel, or done all sorts of things. Instead it ran for a couple of hours, and this was intentional: Well-behaved web scrapers should not be too aggressive and be mindful on the load they place on the systems they are targeting. XXX TODO LINK THE GOOD WEB SCRAPER POST I CANNOT FIND RIGHT NOW.

You can see the complete code here, but the basic flow is this: We create a database if one doesn't exist already. Then, we iterate over all tickets in our given range, request a ticket from the API, and if we get a valid response, we transform the JSON to SQL and push it to the DB. Afterwards I saw that there is great tooling to put JSON into SQLite databases, so I could have cut the last step, but I'm overall fairly happy that I got to write some simple SQL, so no harm done.

I simplified the code below and removed a bit of error handling – you can find the full version here.

import json
import sqlite3
from datetime import datetime

import requests

BASE_URL = "https://code.djangoproject.com/jsonrpc"
DJANGO_MAX_TICKET = 30588


def get_ticket(*, ticket_id, session):
    response = requests.post(
        BASE_URL,
        json.dumps(
            {
                "method": "ticket.get",
                "params": [ticket_id],
                "id": ticket_id,
            }
        ),
        headers={"Content-Type": "application/json"},
    )
    return response.json()


def store_ticket(*, cursor, data):
    ticket_id = data["result"][0]
    ticket_data = data["result"][-1]
    insert_keys = […]
    insert_data = {
        key: get_value_from_data(key=key, value=ticket_data[key])
        for key in insert_keys
    }
    cursor.execute(
        f"""
        INSERT INTO tickets ({", ".join(insert_data.keys())})
        VALUES ({", ".join("?" for _ in range(len(insert_data)))})
        """,
        list(insert_data.values()),
    )


def get_value_from_data(*, key, value):
    if key in ["changetime", "created"]:
        return datetime.strptime(
            value["__jsonclass__"][-1],
            "%Y-%m-%dT%H:%M:%S",
        )
    if key in […]:
        return bool(int(value))
    return value


def create_db(*, connection):
    cursor = connection.cursor()
    cursor.execute("CREATE TABLE IF NOT EXISTS tickets (…)")
    connection.commit()


def collect_data(*, connection, start=None, end=None, total=None):
    cursor = connection.cursor()
    start = start or 1
    if not end and total:
        end = start + total - 1
    if not end:
        end = DJANGO_MAX_TICKET
    with requests.Session() as session:
        for ticket_id in range(start, end + 1):
            data = get_ticket(ticket_id=ticket_id, session=session)
            store_ticket(data=data, cursor=cursor)
            connection.commit()


def main():
    connection = sqlite3.connect("django_tickets.db")
    create_db(connection=connection)
    collect_data(connection=connection, start=28369)


if __name__ == "__main__":
    main()

Appendix: Deploying Datasette

I think I deployed this Datasette to data.rixx.de in about five minutes – but that's because I have plenty of templates for the deployment of Python web applications, and I only copied two files and executed a couple of commands. Since I vividly remember being incredibly frustrated more often than I can count when I didn't have these templates, let me share the process with you.

But first, a word of advice: If you're playing around with Datasette, and you want to share the things you build, there is absolutely no need to deploy it on your own server/VPS. Datasette comes with its own datasette publish command, which supports deployment to Heroku, Google Cloud, or Zeit Now. Simon also has instructions on running Datasette on glitch.com.

Installation

But I enjoy running things on my own server, so this is how I did it! There are many different ways of installing Python tooling on servers, and running it, and I found a setup that has been stable for me. For each tool, I create a new user, and I install all tooling and dependencies for that user – this has the advantage that I can't forget to activate a virtualenv, or that I have to figure out if the current Python version shipped with venv or not. It also gives me isolation in a way that all Linux tooling supports.

# useradd -m datasette

The -m flag creates a home directory, which is not necessary, but nice. For consistency you could put the home directory to /var/www/datasette or /usr/share/webapps/datasette, but I just stuck with the default /home/datasette. Security note: my servers have an explicit list of users allowed to log in via ssh, so allowing this user to run a shell is not a big problem.

Next, I upload my data, give it to the datasette user, and install the necessary tools. # lines are root shells, and $ lines belong to the datasette user:

# cp data.csv ~datasette
# chown datasette:datasette ~datasette/data.csv
# sudo -i datasette
$ pip install --user datasette sqlite-utils datasette-vega

Running Datasette

We need a way of starting the Datasette process reliably. On the Linux distributions I'm running, systemd services are an easy and readable way to go.

I chose a port randomly to avoid collisions with other tools running in the 800x range. Take the time to make sure that you're running with a firewall like ufw or iptables which blocks all ports except for the ones you want to have exposed publicly.

Our systemd service is a plain text file placed at /etc/systemd/system/datasette.service, and it looks like this:

[Unit]
Description=datasette server application
After=network.target

[Service]
User=datasette
WorkingDirectory=/home/datasette
ExecStart=/home/datasette/.local/bin/datasette -p 21474 /home/datasette/data.db
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID

[Install]
WantedBy=multi-user.target

Now we can start, stop, and restart Datasette like this, and it will also start automatically after a reboot.

# systemctl daemon-reload
# systemctl enable datasette
# systemctl start datasette

As a free addition, we can look at any logs produced by Datasette by running journalctl -u datasette. Journalctl is a powerful tool – you can for example add flags like --since="10 minutes ago".

Serving Datasette

I'm using nginx as a web server, which has proven to be stable and good to use for me. The last command in my /etc/nginx/nginx.conf configuration file is include /etc/nginx/sites/*.conf;, so that I can have an individual file for each subdomain or project. My /etc/nginx/datasette.conf looks like this:

server {
    listen 443 ssl;
    listen [::]:443 ssl;
    server_name data.rixx.de;

    ssl_certificate /etc/ssl/letsencrypt/certs/data.rixx.de/fullchain.pem;
    ssl_certificate_key /etc/ssl/letsencrypt/certs/data.rixx.de/privkey.pem;

    access_log  /var/log/nginx/data.rixx.de.access.log;
    error_log   /var/log/nginx/data.rixx.de.error.log;

    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto https;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    location / {
        proxy_pass http://localhost:21474;
    }
}

For further information, have a look at Mozilla's web server config generator, and the links provided at the bottom of that page. Going into SSL key/cert generation would be a bit much at this point, but I can recommend dehydrated as one of several good ways of generating HTTPS certificates with Let's Encrypt.

And finally, once you reached this point, take a couple of minutes and run gixy over your nginx configuration to make sure that it does what you think it does – no more, and no less.

Congrats, you made it! If you want to know more about any of the points above, tell me (or in the fediverse)!