Monitoring and Observability with Feature Flags

Set up monitoring and alerting for feature flag rollouts. Track flag evaluations, compare metrics between variations, and detect issues early.

Best Practices

Detailed Explanation

Monitoring Feature Flag Rollouts

Feature flags without monitoring are flying blind. You need to track which variation each user receives, compare key metrics between groups, and set up alerts to catch problems before they spread.

Monitoring Configuration

{
  "new-recommendation-algo": {
    "name": "New Recommendation Algorithm",
    "type": "boolean",
    "enabled": true,
    "defaultValue": false,
    "targeting": [
      { "type": "percentage-rollout", "percentage": 20 }
    ],
    "_monitoring": {
      "primaryMetrics": [
        "recommendation.click_through_rate",
        "recommendation.conversion_rate"
      ],
      "guardrailMetrics": [
        "page.load_time_p99",
        "recommendation.error_rate",
        "support.ticket_count"
      ],
      "alertThresholds": {
        "error_rate_increase": 0.5,
        "latency_p99_increase_ms": 200,
        "conversion_rate_decrease": 2.0
      }
    }
  }
}

What to Monitor

Category Metrics Alert When
Reliability Error rate, exceptions > 0.5% increase vs control
Performance p50, p95, p99 latency > 100ms increase
Business Conversion, revenue > 2% decrease
Infrastructure CPU, memory, DB queries > 20% increase
User experience Bounce rate, session duration Significant change

Flag Evaluation Logging

Log every flag evaluation for analysis:

{
  "timestamp": "2025-01-15T10:30:00Z",
  "flagKey": "new-recommendation-algo",
  "userId": "user-123",
  "variation": true,
  "reason": "percentage-rollout",
  "evaluationTimeMs": 2
}

Dashboard Layout

Feature Flag Rollout Dashboard
├─ Flag Status: ON (20% rollout)
├─ Users in treatment: 12,456
├─ Users in control: 49,824
├─── Primary Metrics
│   ├─ CTR: Treatment 12.3% vs Control 10.1% (+2.2%)
│   └─ Conversion: Treatment 3.2% vs Control 3.0% (+0.2%)
├─── Guardrail Metrics
│   ├─ Error rate: 0.02% vs 0.01% (within threshold)
│   ├─ p99 latency: 340ms vs 310ms (within threshold)
│   └─ Support tickets: 3 vs 2 (normal)
└─── Rollout History
    ├─ Day 1: 5% → No issues
    ├─ Day 3: 10% → No issues
    └─ Day 5: 20% → Current

Alerting Rules

  • Immediate (PagerDuty): Error rate spike > 1% in treatment group
  • Warning (Slack): Latency p99 increase > 50ms
  • Informational (Dashboard): Any metric divergence > 5%

Post-Rollout Analysis

After reaching 100%, analyze the complete data:

  1. Compare all metrics between the treatment and control periods
  2. Check for any lagging effects (issues that took days to appear)
  3. Document findings in a rollout retrospective
  4. Archive the monitoring configuration with the flag cleanup ticket

Use Case

A media platform rolls out a new recommendation algorithm to 20% of users. Their monitoring dashboard shows a 2.2% improvement in click-through rate with no increase in error rate or latency. Guardrail alerts are configured to automatically page the on-call engineer if error rates spike, giving confidence to increase the rollout to 50%.

Try It — Feature Flag Config Generator

Open full tool