Database SLA: Uptime Considerations for Data Services

Q: Database SLA: Uptime Considerations for Data Services

## Database Availability Is Different Database SLAs are more nuanced than compute SLAs because databases have stateful requirements. A web server can restart instantly, but a database failover involves data consistency checks, replication catchup, and connection re-establishment. ### Key Database SLA Metrics | Metric | Definition | Typical Targets | |--------|-----------|----------------| | Availability | Percentage of time the DB accepts queries | 99.9% - 99.99% | | RTO (Recovery Time Object

Q: When is this useful?

Reference this guide when designing database architectures, choosing between managed database services, setting RTO/RPO targets, and understanding how database failover impacts your application's overall SLA.

Understand database-specific SLA considerations including replication lag, failover time, RTO/RPO, and how database availability differs from compute availability.

Infrastructure SLAs

Detailed Explanation

Database Availability Is Different

Database SLAs are more nuanced than compute SLAs because databases have stateful requirements. A web server can restart instantly, but a database failover involves data consistency checks, replication catchup, and connection re-establishment.

Key Database SLA Metrics

Metric	Definition	Typical Targets
Availability	Percentage of time the DB accepts queries	99.9% - 99.99%
RTO (Recovery Time Objective)	Max time to restore service after failure	1 min - 4 hours
RPO (Recovery Point Objective)	Max acceptable data loss in time	0 - 24 hours
Replication Lag	Delay between primary and replica writes	0 - 60 seconds

Cloud Database SLA Comparison

Service	SLA	Failover Time	RPO
AWS RDS Multi-AZ	99.95%	60-120 seconds	0 (synchronous)
AWS Aurora	99.99%	<30 seconds	0 (synchronous)
Azure SQL Database (Business Critical)	99.995%	~30 seconds	0
GCP Cloud SQL (Regional)	99.95%	~60 seconds	0
GCP Cloud Spanner	99.999%	Automatic	0

Failover Impact on Application SLA

Database failover is not instantaneous. During failover:

Connection pool drain: Existing connections are broken (2-5 seconds)
DNS propagation: New endpoint resolves (0-30 seconds)
Replica promotion: New primary takes over (10-120 seconds)
Connection re-establishment: App reconnects (1-10 seconds)

Total perceived downtime: 30 seconds to 3 minutes per failover event

Write vs Read Availability

A common pattern is to have higher availability for reads than writes:

Writes: Primary only → 99.95% (single point of failure)
Reads: Primary + N replicas → 99.999%+ (parallel redundancy)

For applications that are read-heavy (most web apps), this means:

Read SLA: Very high (99.99%+) with multiple read replicas
Write SLA: Limited by primary availability (99.95% typically)
Composite: Weighted by read/write ratio

Database-Specific Downtime Causes

Unlike compute instances, databases face unique availability threats:

Schema migrations (ALTER TABLE on large tables can lock writes)
Replication breakage (replica falls too far behind)
Storage exhaustion (disk full = database crash)
Connection limit exhaustion (too many clients)
Vacuum/maintenance operations (PostgreSQL VACUUM, MySQL OPTIMIZE)

Use Case