Error Budget Calculation: How to Track and Use Your Downtime Allowance
Learn how to calculate and manage error budgets for SRE teams. Convert SLA percentages into actionable downtime budgets with practical examples and burn rate tracking.
Detailed Explanation
What Is an Error Budget?
An error budget is the maximum amount of time or errors your service is allowed within a given period while still meeting its SLA. It is the mathematical complement of your SLA: if your SLA is 99.9%, your error budget is 0.1%.
Calculating Error Budgets
The formula is straightforward:
Error Budget = (1 - SLA/100) x Total Minutes in Period
Monthly error budgets for common SLA levels:
| SLA | Error Budget % | Monthly Budget (minutes) |
|---|---|---|
| 99% | 1.0% | 438 min (7h 18m) |
| 99.5% | 0.5% | 219 min (3h 39m) |
| 99.9% | 0.1% | 43.8 min |
| 99.95% | 0.05% | 21.9 min |
| 99.99% | 0.01% | 4.38 min |
| 99.999% | 0.001% | 0.44 min (26s) |
Error Budget Policy
A well-defined error budget policy answers these questions:
- What happens when the budget is exhausted? (Typically: freeze deployments, focus on reliability)
- What counts against the budget? (User-facing errors, latency SLO violations, full outages)
- How is the budget measured? (Request success rate, synthetic monitoring, real user metrics)
- Who owns the budget? (Usually shared between product and SRE teams)
Burn Rate Monitoring
Track how fast your error budget is being consumed:
Burn Rate = Actual Error Rate / Allowed Error Rate
- Burn rate = 1.0: Consuming budget at exactly the expected rate
- Burn rate = 2.0: Budget will be exhausted in half the period
- Burn rate = 10.0: Major incident — budget will be gone in days
Practical Example
Your service has a 99.9% SLA with a monthly budget of 43.8 minutes:
- Week 1: Deployment issue causes 5 minutes of errors → 38.8 minutes remaining
- Week 2: Clean week → 38.8 minutes remaining
- Week 3: Database failover takes 3 minutes → 35.8 minutes remaining
- Week 4: 35.8 minutes still available → safe to deploy new features
If the budget had been consumed by Week 2, the team would freeze deployments and focus on reliability improvements.
Use Case
Use error budget calculations to negotiate SLA targets with stakeholders, set deployment policies, define incident severity thresholds, and balance feature velocity with reliability investments in SRE teams.