Setting up Datasette, step by step
Last week, I showed off how I exported my Goodreads data to SQLite, and you can see the data at data.rixx.de. I use Datasette to make the data available to browse, query, plot, export, etc. Datasette is a brilliant project, because it makes data available to you in an interactive way, and allows you to share original data and analysis in a transparent bundle.
I think I deployed this Datasette to data.rixx.de in about five minutes – but that's because I have plenty of templates for the deployment of Python web applications, and I only copied two files and executed a couple of commands. Since I vividly remember being incredibly frustrated more often than I can count when I didn't have these templates, let me share the process with you.
First, a word of advice: If you're playing around with Datasette, and you want to share the things you build, there is
absolutely no need to deploy it on your own server/VPS. Datasette comes with its own datasette publish
command, which supports deployment to Heroku, Google Cloud,
Zeit Now, and has a Docker setup. Simon Willison, the author of Datasette, also has
instructions on running Datasette on
glitch.com.
Setup
User creation
But I enjoy running things on my own server, so this is how I did it! There are many different ways of installing and running Python web tools on servers, but this is mine: For each tool, create a new user, and install all tooling and dependencies for that user.
This has the advantage that you can't forget to activate a virtualenv (server-side debugging is annoying enough without those hassles), or that you have to figure out if the current Python version shipped with venv or not. It also gives you isolation in a way that all Linux tooling supports: Systemd services, ansible roles, shell scripts, … everything that interacts with Linux-oid systems knows about users and access rights.
# useradd -m datasette
The -m
flag creates a home directory. For consistency you could put the home directory to /var/www/datasette
or
/usr/share/webapps/datasette
, but sticking with the default /home/datasette
is fine, too.
Security note: My servers have an explicit list of users allowed to log in via SSH, so allowing this user to run a shell is not a big problem. Your setup may differ, so take a moment to find out if new users have SSH access by default on your system, or if there are any include-all sudo rules, etc.
Data upload
Next, upload your data. Since datasette works on any SQLite database, that's just copying a single database file via the
tool and protocol of your choice, such as scp
, rsync
, sftp
, …
Hand the file to the datasette
user, and don't forget (second try, at least) to change the ownership:
# cp data.csv ~datasette # chown datasette:datasette ~datasette/data.csv
Installation
Next, switch to the datasette user and install the necessary tools – I highly recommend installing datasette-vega
,
which gives Datasette integrated and practical plotting powers. Generating charts is good, and these here are
particularly configurable, and provide exports/donwloads as PNG and SVG, to boot.
# sudo -i -u datasette $ pip install --user datasette sqlite-utils datasette-vega
By convention, #
lines are root shells, and $
lines belong to the datasette
user. ~datasette
is a shortcut for
"the home directory of the datasette user" – very useful, because it doesn't require you to remember if you configured a
non-standard home directory, or to figure out where on somebody else's servers the home directories are located.
sudo -i -u datasette
logs you in as the datasette user and tries to setup a regular login shell environment for that
user. Running pip (the Python package installer) with the --user
flag allows you to install Python packages only for
the currently active user account. This requires no superuser access (Python package installation does execute arbitrary
code, after all), and does not pollute the global Python system with the user-local dependencies.
Running Datasette
You need a way of starting the Datasette process reliably. On the Linux distributions I'm running, systemd services are an easy and readable way to go.
Security note: Take the time to make sure that you're running with a firewall like ufw
or iptables
which blocks
all ports except for the ones you want to have exposed publicly, usually your SSH port (default: 22), and the HTTP(S)
ports 80 and 443.
Choose a port randomly to avoid collisions with other tools running in the 800x range (because that's where lots of
Python web services set up camp). Your systemd service is a plain text file placed at
/etc/systemd/system/datasette.service
, and it looks like this:
[Unit] Description=datasette server application After=network.target [Service] User=datasette WorkingDirectory=/home/datasette ExecStart=/home/datasette/.local/bin/datasette -p 21474 /home/datasette/data.db ExecReload=/bin/kill -s HUP $MAINPID ExecStop=/bin/kill -s TERM $MAINPID [Install] WantedBy=multi-user.target
Now you can start, stop, and restart Datasette like this, and it will also start automatically after a reboot:
# systemctl daemon-reload # systemctl enable datasette # systemctl start datasette
As a free bonus, you can look at any logs produced by Datasette by running journalctl -u datasette
. Journalctl is a
powerful tool – you can for example add flags like --since="10 minutes ago"
, which is a very useful shorthand for
"oops I broke it what happen".
Serving Datasette
nginx
I'm using nginx as a web server, which has proven to be stable and good to use for me. The last command in my
/etc/nginx/nginx.conf
configuration file is include /etc/nginx/sites/*.conf;
, so that I can have an individual
file for each subdomain or project, all located in /etc/nginx/sites
.
My /etc/nginx/datasette.conf
takes incoming requests, notes that they arrived via a proxy and via HTTPS, and hands
them off to the Datasette process. It looks like this:
server { listen 443 ssl; listen [::]:443 ssl; server_name data.rixx.de; ssl_certificate /etc/ssl/letsencrypt/certs/data.rixx.de/fullchain.pem; ssl_certificate_key /etc/ssl/letsencrypt/certs/data.rixx.de/privkey.pem; access_log /var/log/nginx/data.rixx.de.access.log; error_log /var/log/nginx/data.rixx.de.error.log; proxy_set_header Host $host; proxy_set_header X-Forwarded-Proto https; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; location / { proxy_pass http://localhost:21474; } }
Security note: Once you reached this point, take a couple of minutes and run gixy against your nginx configuration to make sure that it does what you think it does – no more, and no less.
Further reading
For further information, have a look at Mozilla's web server configuration generator, and the links provided at the bottom of that page. Going into SSL key/cert generation would be a bit much at this point, but I can recommend dehydrated as one of several good ways of generating free HTTPS certificates with Let's Encrypt.
Congrats, you made it! If you want to know more about any of the points above, tell me on Twitter, in the fediverse, or via mail.