Toil Reduction: The SRE Way
Toil is the kind of work that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly as a service grows. A core principle of SRE is to progressively eliminate toil to free up engineers for more valuable, long-term engineering work.
Characteristics of Toil
To identify toil, look for work with these characteristics:
- Manual: A human has to perform the task by hand.
- Repetitive: The task is performed over and over again.
- Automatable: A machine could perform the task more reliably and faster.
- Tactical: It's reactive work, often driven by interrupts, rather than strategic or proactive.
- No Enduring Value: After the task is done, the service is in the same state as before. It hasn't been improved.
- O(n) Scaling: The amount of work scales with the size or usage of the service. More users or more machines mean more toil.
Examples of Toil: - Manually provisioning a new customer account. - Restarting a crashed server by hand. - Applying database schema changes manually. - Copy-pasting monitoring alerts into a ticket.
Why Toil is Harmful
- Burnout: High levels of toil lead to engineer burnout and low morale.
- Slows Innovation: Engineers spending time on toil are not spending time on feature development or improvements.
- Increased Risk: Manual processes are prone to human error, which can cause outages.
- Scalability Limits: You can't hire your way out of a problem that scales linearly with your service.
Strategies for Toil Reduction
The goal for an SRE team is to keep toil to less than 50% of their time. The rest should be spent on engineering projects that reduce future toil or add service features.
-
Measure Everything:
- Track the time spent on operational tasks. Categorize work as either "toil" or "engineering."
- Use this data to identify the biggest sources of toil and prioritize what to automate first.
-
Automate, Automate, Automate:
- Write scripts and tools to automate repetitive tasks.
- Invest in configuration management (e.g., Ansible, Puppet) and Infrastructure as Code (e.g., Terraform, CloudFormation) to automate infrastructure management.
-
Improve Documentation and Runbooks:
- While a task is still manual, ensure it's well-documented in a runbook.
- A good runbook is a stepping stone to automation. If you can write down the steps, you can often script them.
-
Build Self-Service Tools:
- Empower other teams (like developers or customer support) to perform common tasks safely through self-service tools. This reduces the interrupt load on the SRE team.
- Example: A web portal for developers to create test environments.
-
Set an "Error Budget" for Toil:
- Just as you have an error budget for reliability, you can have a "toil budget." If a team exceeds its toil budget for a quarter, the next quarter's priority must be reducing that toil.