A simple technique for more reliable software
We engineers love the happy paths. Our product leaders obsess over them. We design for them, we are energized by them, and we feel a clever sense of accomplishment when we build them. But the unhappy paths, the failures and edge cases, still matter. We tend to give them less thought and attention, because we expect that they happen less frequently (hopefully we’re right).
To build reliable software, you have to get comfortable with the unhappy paths.
Here’s a very straightforward, and very effective, technique for building more reliable software by treating unhappy paths as first class use cases -
0. Open a blank document
1. Identify the top 1-2 most critical parts of your service
This should not take much time. These are the things that immediately come to mind when you think about what your service actually does.
Let’s say we have a service that manages savings accounts and pays out monthly interest. The critical part is interest payouts. Specifically getting them out accurately and on time.
Write that thing (or things) down as a header in your doc.
2. Imagine through failure scenarios
Start thinking through all the ways that your critical part could fail. Only focus on *how* something could fail, not *why* something could fail. A good way to force yourself to think this is way is by prompting yourself with: “There was a bug and ____”
There was a bug and the job to send payments failed to run
There was a bug and some customers were paid the wrong amount
…
Write these down as subheaders in your doc.
3. For each failure case, reason through a fix
In rough order of preference, reason through how each failure can be recovered from.
This part of the process will reveal the specific areas where you need to invest in automation and/or tooling.
As you do this, write these down under each subheader, and take note of the pieces that don’t currently exist and would need to be built.
🤩 Most preferable: Automated recovery 🤩
This category is commonly known as “self-healing” systems…which in many cases is just a fancy way to say ‘retry a few times and see if it works’. But generally these are mechanisms that will help your system eventually get into the correct state without manual intervention
“Make the payments job idempotent and schedule it multiple times”
👌Still good: Manual recovery 👌
This category requires some manual intervention (typically a bug fix and/or admin tooling) but can still recover from failures in bulk.
“After deploying a fix, re-run the payments job using an admin tool”
“After deploying a fix, cure incorrect payments with the payment adjustment admin tool”
These are the types of tools you document in runbooks, and train new hires on how to use.
😣 Least preferable: Manual debugging 😣
This is the catch-all category for “something went horribly wrong, and it will take a human to parse out what happened”. This part of the exercise is where you discover the gaps in auditability and debuggability.
Can you actually find what broke and who it affected?
Can you, a human, reasonably access the data you need to debug the series of events leading to the failure?
Where is that data?
If you’re not storing audit trails, consider how you might.
Is that data stored in a human readable format?
If you’re forced to store it in bytes or something only machine readable, consider tooling to easily deserialize it into something readable.
How can you query it?
Do engineers have reasonable (read only, if necessary) access to the data they will need?
Once you find the issue, can you manually adjust things to make it right?
Think general use cases, not specific issues. Give yourself general purpose admin tooling to enable you to repair a wide range of issues.
4. Prioritize and execute
You will be left with a series of holes that need to be plugged
After working through the above, you will be left with a series of holes that need to be plugged…since most or all of the mechanisms you’ve identified as necessary probably won’t exist yet. You can treat these like any other software project - prioritize, estimate, and execute (or backlog).
Best case is that you’re left with a more robust, reliable system.
Worst case is that you have a ton of failure cases logged and have thought through remediation options, and you can at least scramble to get them built when you inevitably need them (hopefully not at 3am on a Sunday).
Happy Hacking.