Skip to content

[Q1-P0] Add monitoring and alerting for SDK health metrics #19

@karlwaldman

Description

@karlwaldman

Problem

We found out about the historical timeout bug from a customer email. We should have known from monitoring.

What we don't track:

  • ❌ Timeout rate
  • ❌ Response times by endpoint
  • ❌ Error rates by SDK version
  • ❌ Customer retry patterns

Impact

  • Issues go undetected until customers report
  • No early warning system
  • Can't measure impact of changes
  • Can't track SDK adoption

Solution

Add comprehensive monitoring and alerting:

Backend Metrics (Production API)

# Track in production API (oilpriceapi-api)

from prometheus_client import Histogram, Counter

# Response time histogram
historical_response_time = Histogram(
    'historical_endpoint_response_seconds',
    'Historical endpoint response time',
    ['endpoint', 'commodity', 'interval'],
    buckets=[1, 5, 10, 30, 60, 90, 120, 180]
)

# Timeout counter
historical_timeout_total = Counter(
    'historical_endpoint_timeout_total',
    'Historical endpoint timeout count',
    ['endpoint', 'sdk_version']
)

# Error counter
sdk_error_total = Counter(
    'sdk_error_total',
    'SDK errors by type',
    ['sdk_version', 'error_type', 'endpoint']
)

Alerts

# prometheus/alerts/sdk_health.yml

groups:
  - name: sdk_health
    interval: 1m
    rules:
      - alert: HistoricalEndpointSlow
        expr: |
          histogram_quantile(0.95,
            rate(historical_endpoint_response_seconds_bucket[5m])
          ) > 60
        for: 5m
        labels:
          severity: warning
          component: sdk
        annotations:
          summary: "Historical endpoint P95 latency >60s"
          description: "95th percentile response time is {{ $value }}s"

      - alert: HighTimeoutRate
        expr: |
          rate(historical_endpoint_timeout_total[5m]) /
          rate(historical_endpoint_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
          component: sdk
        annotations:
          summary: "Historical endpoint timeout rate >5%"
          description: "{{ $value | humanizePercentage }} of requests timing out"

      - alert: SDKVersionErrorSpike
        expr: |
          rate(sdk_error_total{sdk_version="1.4.2"}[5m]) > 0.10
        for: 5m
        labels:
          severity: critical
          component: sdk
        annotations:
          summary: "SDK v1.4.2 error rate >10%"
          description: "High error rate detected: {{ $value | humanizePercentage }}"

      - alert: CustomerRetryPattern
        expr: |
          sum by (user_id, endpoint) (
            rate(api_requests_total[5m])
          ) > 0.5
        for: 2m
        labels:
          severity: warning
          component: sdk
        annotations:
          summary: "User retrying same query repeatedly"
          description: "User {{ $labels.user_id }} retrying {{ $labels.endpoint }}"

Dashboard

// grafana/dashboards/sdk_health.json

{
  "title": "SDK Health Dashboard",
  "panels": [
    {
      "title": "Historical Endpoint Response Time (P50, P95, P99)",
      "targets": [
        {"expr": "histogram_quantile(0.50, rate(historical_endpoint_response_seconds_bucket[5m]))"},
        {"expr": "histogram_quantile(0.95, rate(historical_endpoint_response_seconds_bucket[5m]))"},
        {"expr": "histogram_quantile(0.99, rate(historical_endpoint_response_seconds_bucket[5m]))"}
      ]
    },
    {
      "title": "Timeout Rate by Endpoint",
      "targets": [
        {"expr": "rate(historical_endpoint_timeout_total[5m]) / rate(historical_endpoint_requests_total[5m])"}
      ]
    },
    {
      "title": "Error Rate by SDK Version",
      "targets": [
        {"expr": "rate(sdk_error_total[5m]) by (sdk_version)"}
      ]
    }
  ]
}

Implementation Plan

Phase 1: Backend Instrumentation (oilpriceapi-api)

  1. Add Prometheus metrics to historical endpoints
  2. Track response times, timeouts, errors
  3. Include SDK version in request tracking

Phase 2: Alerting

  1. Deploy Prometheus alert rules
  2. Configure Slack notifications
  3. Set up on-call rotation for critical alerts

Phase 3: Dashboards

  1. Create Grafana dashboard
  2. Add to team's monitoring rotation
  3. Review weekly in team meetings

Acceptance Criteria

  • Prometheus metrics added to production API
  • Alert rules deployed and tested
  • Grafana dashboard created
  • Slack notifications configured
  • Runbook created for responding to alerts
  • Team trained on dashboard usage

Estimated Effort

Time: 4-6 hours

  • Backend instrumentation: 2h
  • Alert rules: 1h
  • Dashboard: 1h
  • Testing: 1h
  • Documentation: 1h

Success Metrics

  • Detect issues within 5 minutes of occurrence
  • Zero customer-reported bugs that weren't detected by monitoring
  • Response time to incidents <15 minutes

Test Plan

# Simulate timeout scenario
curl -X POST http://localhost:9090/api/v1/alerts/test \
  -d 'alert=HighTimeoutRate'

# Verify alert fires
# Verify Slack notification received
# Verify runbook is followed

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    opsOperations and infrastructurepriority: criticalMust be fixed immediatelyquadrant: q1Urgent & Important (Do First)type: monitoringMonitoring, observability, and alerting

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions