[Q1-P0] Add monitoring and alerting for SDK health metrics

## Problem

We found out about the historical timeout bug from a customer email. We should have known from monitoring.

**What we don't track**:
- ❌ Timeout rate
- ❌ Response times by endpoint
- ❌ Error rates by SDK version
- ❌ Customer retry patterns

## Impact

- Issues go undetected until customers report
- No early warning system
- Can't measure impact of changes
- Can't track SDK adoption

## Solution

Add comprehensive monitoring and alerting:

### Backend Metrics (Production API)

```python
# Track in production API (oilpriceapi-api)

from prometheus_client import Histogram, Counter

# Response time histogram
historical_response_time = Histogram(
    'historical_endpoint_response_seconds',
    'Historical endpoint response time',
    ['endpoint', 'commodity', 'interval'],
    buckets=[1, 5, 10, 30, 60, 90, 120, 180]
)

# Timeout counter
historical_timeout_total = Counter(
    'historical_endpoint_timeout_total',
    'Historical endpoint timeout count',
    ['endpoint', 'sdk_version']
)

# Error counter
sdk_error_total = Counter(
    'sdk_error_total',
    'SDK errors by type',
    ['sdk_version', 'error_type', 'endpoint']
)
```

### Alerts

```yaml
# prometheus/alerts/sdk_health.yml

groups:
  - name: sdk_health
    interval: 1m
    rules:
      - alert: HistoricalEndpointSlow
        expr: |
          histogram_quantile(0.95,
            rate(historical_endpoint_response_seconds_bucket[5m])
          ) > 60
        for: 5m
        labels:
          severity: warning
          component: sdk
        annotations:
          summary: "Historical endpoint P95 latency >60s"
          description: "95th percentile response time is {{ $value }}s"

      - alert: HighTimeoutRate
        expr: |
          rate(historical_endpoint_timeout_total[5m]) /
          rate(historical_endpoint_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
          component: sdk
        annotations:
          summary: "Historical endpoint timeout rate >5%"
          description: "{{ $value | humanizePercentage }} of requests timing out"

      - alert: SDKVersionErrorSpike
        expr: |
          rate(sdk_error_total{sdk_version="1.4.2"}[5m]) > 0.10
        for: 5m
        labels:
          severity: critical
          component: sdk
        annotations:
          summary: "SDK v1.4.2 error rate >10%"
          description: "High error rate detected: {{ $value | humanizePercentage }}"

      - alert: CustomerRetryPattern
        expr: |
          sum by (user_id, endpoint) (
            rate(api_requests_total[5m])
          ) > 0.5
        for: 2m
        labels:
          severity: warning
          component: sdk
        annotations:
          summary: "User retrying same query repeatedly"
          description: "User {{ $labels.user_id }} retrying {{ $labels.endpoint }}"
```

### Dashboard

```json
// grafana/dashboards/sdk_health.json

{
  "title": "SDK Health Dashboard",
  "panels": [
    {
      "title": "Historical Endpoint Response Time (P50, P95, P99)",
      "targets": [
        {"expr": "histogram_quantile(0.50, rate(historical_endpoint_response_seconds_bucket[5m]))"},
        {"expr": "histogram_quantile(0.95, rate(historical_endpoint_response_seconds_bucket[5m]))"},
        {"expr": "histogram_quantile(0.99, rate(historical_endpoint_response_seconds_bucket[5m]))"}
      ]
    },
    {
      "title": "Timeout Rate by Endpoint",
      "targets": [
        {"expr": "rate(historical_endpoint_timeout_total[5m]) / rate(historical_endpoint_requests_total[5m])"}
      ]
    },
    {
      "title": "Error Rate by SDK Version",
      "targets": [
        {"expr": "rate(sdk_error_total[5m]) by (sdk_version)"}
      ]
    }
  ]
}
```

## Implementation Plan

### Phase 1: Backend Instrumentation (oilpriceapi-api)

1. Add Prometheus metrics to historical endpoints
2. Track response times, timeouts, errors
3. Include SDK version in request tracking

### Phase 2: Alerting

1. Deploy Prometheus alert rules
2. Configure Slack notifications
3. Set up on-call rotation for critical alerts

### Phase 3: Dashboards

1. Create Grafana dashboard
2. Add to team's monitoring rotation
3. Review weekly in team meetings

## Acceptance Criteria

- [ ] Prometheus metrics added to production API
- [ ] Alert rules deployed and tested
- [ ] Grafana dashboard created
- [ ] Slack notifications configured
- [ ] Runbook created for responding to alerts
- [ ] Team trained on dashboard usage

## Estimated Effort

**Time**: 4-6 hours
- Backend instrumentation: 2h
- Alert rules: 1h
- Dashboard: 1h
- Testing: 1h
- Documentation: 1h

## Success Metrics

- Detect issues within 5 minutes of occurrence
- Zero customer-reported bugs that weren't detected by monitoring
- Response time to incidents <15 minutes

## Test Plan

```bash
# Simulate timeout scenario
curl -X POST http://localhost:9090/api/v1/alerts/test \
  -d 'alert=HighTimeoutRate'

# Verify alert fires
# Verify Slack notification received
# Verify runbook is followed
```

## Related

- Historical timeout bug (would have been detected in 5min)
- Issue #1 (Integration tests)
- Issue #2 (Performance tests)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Q1-P0] Add monitoring and alerting for SDK health metrics #19

Problem

Impact

Solution

Backend Metrics (Production API)

Alerts

Dashboard

Implementation Plan

Phase 1: Backend Instrumentation (oilpriceapi-api)

Phase 2: Alerting

Phase 3: Dashboards

Acceptance Criteria

Estimated Effort

Success Metrics

Test Plan

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Q1-P0] Add monitoring and alerting for SDK health metrics #19

Description

Problem

Impact

Solution

Backend Metrics (Production API)

Alerts

Dashboard

Implementation Plan

Phase 1: Backend Instrumentation (oilpriceapi-api)

Phase 2: Alerting

Phase 3: Dashboards

Acceptance Criteria

Estimated Effort

Success Metrics

Test Plan

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions