Setting up Datasette, step by step

rixx

2019-10-19

An old schematic drawing of a cross-section of a house, labelled with A, B, C in the main areas, and a,b,c,d below

To build a Datasette, just stack these three boxes on top of each other Source.

Last week, I showed off how I exported my Goodreads data to SQLite, and you can see the data at data.rixx.de. I use Datasette to make the data available to browse, query, plot, export, etc. Datasette is a brilliant project, because it makes data available to you in an interactive way, and allows you to share original data and analysis in a transparent bundle.

I think I deployed this Datasette to data.rixx.de in about five minutes – but that's because I have plenty of templates for the deployment of Python web applications, and I only copied two files and executed a couple of commands. Since I vividly remember being incredibly frustrated more often than I can count when I didn't have these templates, let me share the process with you.

Setup
Running Datasette
Serving Datasette
- nginx
- Further reading

First, a word of advice: If you're playing around with Datasette, and you want to share the things you build, there is absolutely no need to deploy it on your own server/VPS. Datasette comes with its own datasette publish command, which supports deployment to Heroku, Google Cloud, Zeit Now, and has a Docker setup. Simon Willison, the author of Datasette, also has instructions on running Datasette on glitch.com.

Setup

User creation

But I enjoy running things on my own server, so this is how I did it! There are many different ways of installing and running Python web tools on servers, but this is mine: For each tool, create a new user, and install all tooling and dependencies for that user.

This has the advantage that you can't forget to activate a virtualenv (server-side debugging is annoying enough without those hassles), or that you have to figure out if the current Python version shipped with venv or not. It also gives you isolation in a way that all Linux tooling supports: Systemd services, ansible roles, shell scripts, … everything that interacts with Linux-oid systems knows about users and access rights.

# useradd -m datasette

The -m flag creates a home directory. For consistency you could put the home directory to /var/www/datasette or /usr/share/webapps/datasette, but sticking with the default /home/datasette is fine, too.

Security note: My servers have an explicit list of users allowed to log in via SSH, so allowing this user to run a shell is not a big problem. Your setup may differ, so take a moment to find out if new users have SSH access by default on your system, or if there are any include-all sudo rules, etc.

Data upload

Next, upload your data. Since datasette works on any SQLite database, that's just copying a single database file via the tool and protocol of your choice, such as scp, rsync, sftp, …

Hand the file to the datasette user, and don't forget (second try, at least) to change the ownership:

# cp data.csv ~datasette
# chown datasette:datasette ~datasette/data.csv

Installation

Next, switch to the datasette user and install the necessary tools – I highly recommend installing datasette-vega, which gives Datasette integrated and practical plotting powers. Generating charts is good, and these here are particularly configurable, and provide exports/donwloads as PNG and SVG, to boot.

# sudo -i -u datasette
$ pip install --user datasette sqlite-utils datasette-vega

By convention, # lines are root shells, and $ lines belong to the datasette user. ~datasette is a shortcut for "the home directory of the datasette user" – very useful, because it doesn't require you to remember if you configured a non-standard home directory, or to figure out where on somebody else's servers the home directories are located.

sudo -i -u datasette logs you in as the datasette user and tries to setup a regular login shell environment for that user. Running pip (the Python package installer) with the --user flag allows you to install Python packages only for the currently active user account. This requires no superuser access (Python package installation does execute arbitrary code, after all), and does not pollute the global Python system with the user-local dependencies.

Running Datasette

You need a way of starting the Datasette process reliably. On the Linux distributions I'm running, systemd services are an easy and readable way to go.

Security note: Take the time to make sure that you're running with a firewall like ufw or iptables which blocks all ports except for the ones you want to have exposed publicly, usually your SSH port (default: 22), and the HTTP(S) ports 80 and 443.

Choose a port randomly to avoid collisions with other tools running in the 800x range (because that's where lots of Python web services set up camp). Your systemd service is a plain text file placed at /etc/systemd/system/datasette.service, and it looks like this:

[Unit]
Description=datasette server application
After=network.target

[Service]
User=datasette
WorkingDirectory=/home/datasette
ExecStart=/home/datasette/.local/bin/datasette -p 21474 /home/datasette/data.db
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID

[Install]
WantedBy=multi-user.target

Now you can start, stop, and restart Datasette like this, and it will also start automatically after a reboot:

# systemctl daemon-reload
# systemctl enable datasette
# systemctl start datasette

As a free bonus, you can look at any logs produced by Datasette by running journalctl -u datasette. Journalctl is a powerful tool – you can for example add flags like --since="10 minutes ago", which is a very useful shorthand for "oops I broke it what happen".

Serving Datasette

nginx

I'm using nginx as a web server, which has proven to be stable and good to use for me. The last command in my /etc/nginx/nginx.conf configuration file is include /etc/nginx/sites/*.conf;, so that I can have an individual file for each subdomain or project, all located in /etc/nginx/sites.

My /etc/nginx/datasette.conf takes incoming requests, notes that they arrived via a proxy and via HTTPS, and hands them off to the Datasette process. It looks like this:

server {
    listen 443 ssl;
    listen [::]:443 ssl;
    server_name data.rixx.de;

    ssl_certificate /etc/ssl/letsencrypt/certs/data.rixx.de/fullchain.pem;
    ssl_certificate_key /etc/ssl/letsencrypt/certs/data.rixx.de/privkey.pem;

    access_log  /var/log/nginx/data.rixx.de.access.log;
    error_log   /var/log/nginx/data.rixx.de.error.log;

    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto https;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    location / {
        proxy_pass http://localhost:21474;
    }
}

Security note: Once you reached this point, take a couple of minutes and run gixy against your nginx configuration to make sure that it does what you think it does – no more, and no less.