Monitoring and Observability with Feature Flags
Set up monitoring and alerting for feature flag rollouts. Track flag evaluations, compare metrics between variations, and detect issues early.
Detailed Explanation
Monitoring Feature Flag Rollouts
Feature flags without monitoring are flying blind. You need to track which variation each user receives, compare key metrics between groups, and set up alerts to catch problems before they spread.
Monitoring Configuration
{
"new-recommendation-algo": {
"name": "New Recommendation Algorithm",
"type": "boolean",
"enabled": true,
"defaultValue": false,
"targeting": [
{ "type": "percentage-rollout", "percentage": 20 }
],
"_monitoring": {
"primaryMetrics": [
"recommendation.click_through_rate",
"recommendation.conversion_rate"
],
"guardrailMetrics": [
"page.load_time_p99",
"recommendation.error_rate",
"support.ticket_count"
],
"alertThresholds": {
"error_rate_increase": 0.5,
"latency_p99_increase_ms": 200,
"conversion_rate_decrease": 2.0
}
}
}
}
What to Monitor
| Category | Metrics | Alert When |
|---|---|---|
| Reliability | Error rate, exceptions | > 0.5% increase vs control |
| Performance | p50, p95, p99 latency | > 100ms increase |
| Business | Conversion, revenue | > 2% decrease |
| Infrastructure | CPU, memory, DB queries | > 20% increase |
| User experience | Bounce rate, session duration | Significant change |
Flag Evaluation Logging
Log every flag evaluation for analysis:
{
"timestamp": "2025-01-15T10:30:00Z",
"flagKey": "new-recommendation-algo",
"userId": "user-123",
"variation": true,
"reason": "percentage-rollout",
"evaluationTimeMs": 2
}
Dashboard Layout
Feature Flag Rollout Dashboard
├─ Flag Status: ON (20% rollout)
├─ Users in treatment: 12,456
├─ Users in control: 49,824
├─── Primary Metrics
│ ├─ CTR: Treatment 12.3% vs Control 10.1% (+2.2%)
│ └─ Conversion: Treatment 3.2% vs Control 3.0% (+0.2%)
├─── Guardrail Metrics
│ ├─ Error rate: 0.02% vs 0.01% (within threshold)
│ ├─ p99 latency: 340ms vs 310ms (within threshold)
│ └─ Support tickets: 3 vs 2 (normal)
└─── Rollout History
├─ Day 1: 5% → No issues
├─ Day 3: 10% → No issues
└─ Day 5: 20% → Current
Alerting Rules
- Immediate (PagerDuty): Error rate spike > 1% in treatment group
- Warning (Slack): Latency p99 increase > 50ms
- Informational (Dashboard): Any metric divergence > 5%
Post-Rollout Analysis
After reaching 100%, analyze the complete data:
- Compare all metrics between the treatment and control periods
- Check for any lagging effects (issues that took days to appear)
- Document findings in a rollout retrospective
- Archive the monitoring configuration with the flag cleanup ticket
Use Case
A media platform rolls out a new recommendation algorithm to 20% of users. Their monitoring dashboard shows a 2.2% improvement in click-through rate with no increase in error rate or latency. Guardrail alerts are configured to automatically page the on-call engineer if error rates spike, giving confidence to increase the rollout to 50%.