Static websites from data files

rixx

2020-03-08

Sometimes, you want to publish structured data in a way that is nice for people to look at. Maybe you have a CSV file with the export from a tool you sometimes use, or the JSON data from somebody's API. Instead of using the data in a program, you'd like the data to be readable for everybody, and maybe even nice to look at? With a way to provide seamless updates, even?

I recently found a nice solution to this problem using Jekyll, GitHub (Pull Requests, Pages, and Actions).

The starting point was the “chatterlist” of the German networking community DENOG. It shows a list of all members who want to share their contact information, their affiliation, and a couple of additional information. The goal in rebuilding this site was to

render the member list to a website
from a data file
which members can modify
without breaking the site
which will reflect modifications to the data file.

Rendering data

Thankfully, we're dealing with a group of technical people here, so I decided that it was okay to expect the members to edit a JSON file. That's lucky, because the static site generator Jekyll comes with support for Data Files in JSON, CSV or YML format.

To use these data files, you'll have to put them in the _data directory. You can see the complete structure of the Jekyll project here:

    .
    ├── _config.yml
    ├── _data
    │   └── chatterliste.json
    ├── index.markdown
    └── _layouts
        └── liste.html

If you pare things down, you need only two files for this to work: the data file in _data/chatterliste.json, and the index.markdown page to render the data. I separated the general layout from the index file for better readability, and also included the standard Jekyll config file.

The JSON file contains a list of member objects. In our index.markdown, these member objects are rendered into a table, basically like this:

<table>
  <tbody>
  {% for entry in site.data.chatterliste %}
    <tr>
      <td>{{ entry.Nick }}</td>
      <td>{{ entry.Name }}</td>
    </tr>
  {% endfor %}
  </tbody>
</table>

And that's it! We're done, good work, everybody.

Contribution workflow

… hoooold it, not so fast! How will members update the data?

Updates are handled with Pull Request on GitHub. Thankfully, GitHub makes this process very accessible, by allowing users to edit files directly, performing a fork in the background, and offering to create a Pull Request directly. The Pull Request system sees active use and seems to work well.

But not everybody is comfortable editing JSON files manually, and sadly GitHub doesn't have a feature to prepare a PR from a template. It does support issue templates though. I created two templates, one for adding new data, and one for changing existing information. These templates are simple markdown files located in .github/ISSUE_TEMPLATE/. Members were asked to either provide a PR, or to put their information into a new issue, which also seems to be working well.

Safeguards

Humans are not quite as good at data processing as computers, so it stands to reason that sooner or later people would submit faulty data. Maybe they'd forget to add a closing ", or maybe they'd use an incorrect data type, or maybe they'd just put their entry in the wrong place of the alphabetical order.

Our aim was to provide a stable website and data source, so I had to take steps to avoid these problems. To make sure that the list was sorted alphabetically, the project now uses a sorting filter in the index.markdown template:

{% assign sorted = site.data.chatterliste | sort_natural: 'Nick' %}
{% for entry in sorted %}
    …
{% endfor %}

The question of data integrity is a bit harder to solve well. We care about the integrity of our JSON file, which needs to be valid JSON and to conform to our assumptions about the data. For example, a user's name has to be a string, and a user's affiliation has to be a list of objects containing at least a name, and optionally a URL.

To this end, I wrote a JSON schema for the member list. It details exactly which properties should be of which type, and which are mandatory or optional. It even includes an example for a correct data entry.

When a member creates a new Pull Request, we want it to be validated before the merge. We can't reasonably ask the member to run the validation, so I decided to give the new builtin GitHub CI, GitHub Actions, a go. This turned out to be fairly easy, thanks to the JSON validation action by Or Rosenblatt. To use it, I added this configuration to .github/workflows/validate.yml, which basically says “Check out the current commit, and check if chatterliste.json conforms to schema.json”:

name: Validate JSONs

on: [push, pull_request]

jobs:
  verify-json-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v1
      - name: Validate JSON
        uses: docker://orrosenblatt/validate-json-action:latest
        env:
          INPUT_SCHEMA: /_data/schema.json
          INPUT_JSONS: /_data/chatterliste.json

JSON schema provides helpful feedback on failing checks, as you can see here:

✗ /github/workspace/_data/chatterliste.json
TYPE should be array

  2922 |         "Nick": "nomaster",
  2923 |         "Name": "Mic Szillat",
> 2924 |         "NIC Handle": "NOMA",
       |                       ^^^^^^ 👈   type should be array
  2925 |         "ASN": [
  2926 |             207871,
  2927 |             198018

And that's it! Now we have a finished website, updated live from a validated data file. I hope this was helpful to you – I'm sure I'll use this pattern again to publish data to static websites.