I usually only do writeups for conference talks I attend in person – it's what helps me focus on the speaker and the talk. But I found myself with these notes after watching the recording of Andrew's talk from PyCon Israel 2018, so with his permission I'm releasing the notes, in the hopes that it makes this important talk more accessible.
1. Hard failures
Faulty parts turn themselves off. This requires reliable redundancy, of course. But if you ignore errors instead, you'll never fix them. Accidents or outages never have a single cause, and every error state you leave running can contribute to the next serious outage, so reducing even minor error states is key.
So: fail hard if anything unexpected happens. Validate your data strictly in and out to avoid unexpected states. And to help isolate issues, deploy changes early and often.
Single points of failure can be good. They require less monitoring, and less troubleshooting until they are repaired, due to less complexity. They also need only a single backup.
2. Good alerting
Cockpits are very selective about what sets off which (or any) alarm. There are very few alarm levels, and they're well-designed, provoking a target response.
Do avoid alert fatigue – it's real and dangerous. Once you start ignoring alerts because there are too many and too few of relevance, you've got a big problem.
Don't put all errors in the same place. Choose your levels, places, and alert methods carefully, for example:
- Critical: Wake somebody up. Directly actionable. Critical to infrastructure functioning.
- Normal: Fix over the next week.
- Background: Metrics, not errors. Put them on dashboards or in other places to be looked up, but don't inform actively.
Learn from your behaviour: If you ignore an error for weeks without consequences, turn off the error. Consider just removing the cause entirely, because apparently nobody cares.
3. Know your limits
Everything will fail – you should know when. You don't have to be perfect, but you really need to know where you aren't perfect and how imperfect you are.
If you don't know: find out! Load test until your system crashes or starts failing. Check how it starts failing. Also, fuzz test your system – send incorrect and random data of all kinds and see how your system fails.
What's your minimum equipment list? What can you absolutely not do without? Can you design your system in a way that there are parts that are nice to have, but that can fail separately, e.g. a suggestion system, where missing updates are not the same as a database failure for the whole application?
Risk is perfectly fine and much easier to handle when you're informed about your capabilities. (Don't forget to update the information as the system evolves.)
4. Build for failure
Over three quarters of accidents are caused by the pilots – that's of course partly because the airplanes are built so well by now, but also due to humans being less than reliable under stress.
Since we already know that everything and everyone is going to fail at some point, so let's guard against it. Assume that everything is to fail right now and figure out what to do then. Make your code work with failure states, and recover from them, so you'll be prepared for the real thing. Kill your application randomly. Practice server network failures. Develop on unreliable connections.
Learn how to handle emergencies. Practice emergencies. Memorise the things you need to do immediately, and then have checklists checklists checklists for everything else. Include things you think you'll know. Have a disaster recovery plan. Better yet: Have disaster recovery plans, plural.
5. Communicate well
"You have control" – "I have control" – "You have control" is the typical exchange in aviation.
Software is complex. Complex software means more teams. Even with one small team you have a lot of communication, but as you grow, communication becomes exponentially harder.
Clear communication is absolutely vital, especially specifications. Write everything down (bonus: more written specs reduce meeting amounts and durations). Have a clear chain of command. This doesn't come into play at most times, but it is necessary for the final call on decisions.
6. No blame culture
Software engineering is horrible in dealing with mistakes. There is often a lot of clamoring and assigning blame. This is harmful.
We as people, as cultures, as companies, and as industries, learn from mistakes. When something fails, investigate deeply what led to an outage/failure. Remember: there is never a single cause of a problem. Don't assign blame to people. Instead, focus on making it very very difficult to make that mistake again.
To that end, you need to encourage reporting. If you assign blame, people will hide mistakes, making them worse in the process. Instead address the issue professionally (yes, that means not firing people, seriously). Reward maintenance as well as firefighting – it's more important to keep the system running than to look heroic while fixing it.
In aviation, every rule is written in blood. In software "engineering", we aren't there yet, but we're heading there. There are already stories of software being built very well and saving the day due to great error handling (Apollo 11). But we also have the Patriot Missile software which killed 28 people due to a floating point error. Or the Therac-25 X-ray machine which was entirely software controlled, without hardware fail safes, which overdosed at least six people, killing three due to radiation exposure. Or the Uber Autonomous Vehicle, which chose to hit a pedestrian, because hey, the comfort of the passenger is very important, and suddenly braking is uncomfortable, right?! Or the system in Germany that sent out an erroneous syphilis diagnosis leading the recipient to suicide.
So keep in mind:
- Hard failures
- Good alerting
- Know your limits
- Build for failure
- Communicate well
- No blame culture