-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
opsOperations and infrastructureOperations and infrastructurepriority: criticalMust be fixed immediatelyMust be fixed immediatelyquadrant: q1Urgent & Important (Do First)Urgent & Important (Do First)type: monitoringMonitoring, observability, and alertingMonitoring, observability, and alerting
Description
Problem
We found out about the historical timeout bug from a customer email. We should have known from monitoring.
What we don't track:
- ❌ Timeout rate
- ❌ Response times by endpoint
- ❌ Error rates by SDK version
- ❌ Customer retry patterns
Impact
- Issues go undetected until customers report
- No early warning system
- Can't measure impact of changes
- Can't track SDK adoption
Solution
Add comprehensive monitoring and alerting:
Backend Metrics (Production API)
# Track in production API (oilpriceapi-api)
from prometheus_client import Histogram, Counter
# Response time histogram
historical_response_time = Histogram(
'historical_endpoint_response_seconds',
'Historical endpoint response time',
['endpoint', 'commodity', 'interval'],
buckets=[1, 5, 10, 30, 60, 90, 120, 180]
)
# Timeout counter
historical_timeout_total = Counter(
'historical_endpoint_timeout_total',
'Historical endpoint timeout count',
['endpoint', 'sdk_version']
)
# Error counter
sdk_error_total = Counter(
'sdk_error_total',
'SDK errors by type',
['sdk_version', 'error_type', 'endpoint']
)Alerts
# prometheus/alerts/sdk_health.yml
groups:
- name: sdk_health
interval: 1m
rules:
- alert: HistoricalEndpointSlow
expr: |
histogram_quantile(0.95,
rate(historical_endpoint_response_seconds_bucket[5m])
) > 60
for: 5m
labels:
severity: warning
component: sdk
annotations:
summary: "Historical endpoint P95 latency >60s"
description: "95th percentile response time is {{ $value }}s"
- alert: HighTimeoutRate
expr: |
rate(historical_endpoint_timeout_total[5m]) /
rate(historical_endpoint_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
component: sdk
annotations:
summary: "Historical endpoint timeout rate >5%"
description: "{{ $value | humanizePercentage }} of requests timing out"
- alert: SDKVersionErrorSpike
expr: |
rate(sdk_error_total{sdk_version="1.4.2"}[5m]) > 0.10
for: 5m
labels:
severity: critical
component: sdk
annotations:
summary: "SDK v1.4.2 error rate >10%"
description: "High error rate detected: {{ $value | humanizePercentage }}"
- alert: CustomerRetryPattern
expr: |
sum by (user_id, endpoint) (
rate(api_requests_total[5m])
) > 0.5
for: 2m
labels:
severity: warning
component: sdk
annotations:
summary: "User retrying same query repeatedly"
description: "User {{ $labels.user_id }} retrying {{ $labels.endpoint }}"Dashboard
// grafana/dashboards/sdk_health.json
{
"title": "SDK Health Dashboard",
"panels": [
{
"title": "Historical Endpoint Response Time (P50, P95, P99)",
"targets": [
{"expr": "histogram_quantile(0.50, rate(historical_endpoint_response_seconds_bucket[5m]))"},
{"expr": "histogram_quantile(0.95, rate(historical_endpoint_response_seconds_bucket[5m]))"},
{"expr": "histogram_quantile(0.99, rate(historical_endpoint_response_seconds_bucket[5m]))"}
]
},
{
"title": "Timeout Rate by Endpoint",
"targets": [
{"expr": "rate(historical_endpoint_timeout_total[5m]) / rate(historical_endpoint_requests_total[5m])"}
]
},
{
"title": "Error Rate by SDK Version",
"targets": [
{"expr": "rate(sdk_error_total[5m]) by (sdk_version)"}
]
}
]
}Implementation Plan
Phase 1: Backend Instrumentation (oilpriceapi-api)
- Add Prometheus metrics to historical endpoints
- Track response times, timeouts, errors
- Include SDK version in request tracking
Phase 2: Alerting
- Deploy Prometheus alert rules
- Configure Slack notifications
- Set up on-call rotation for critical alerts
Phase 3: Dashboards
- Create Grafana dashboard
- Add to team's monitoring rotation
- Review weekly in team meetings
Acceptance Criteria
- Prometheus metrics added to production API
- Alert rules deployed and tested
- Grafana dashboard created
- Slack notifications configured
- Runbook created for responding to alerts
- Team trained on dashboard usage
Estimated Effort
Time: 4-6 hours
- Backend instrumentation: 2h
- Alert rules: 1h
- Dashboard: 1h
- Testing: 1h
- Documentation: 1h
Success Metrics
- Detect issues within 5 minutes of occurrence
- Zero customer-reported bugs that weren't detected by monitoring
- Response time to incidents <15 minutes
Test Plan
# Simulate timeout scenario
curl -X POST http://localhost:9090/api/v1/alerts/test \
-d 'alert=HighTimeoutRate'
# Verify alert fires
# Verify Slack notification received
# Verify runbook is followedRelated
- Historical timeout bug (would have been detected in 5min)
- Issue Refactor: Extract duplicate retry logic into shared strategy class #1 (Integration tests)
- Issue Fix broken PyPI examples page (404 error) #2 (Performance tests)
coderabbitai
Metadata
Metadata
Assignees
Labels
opsOperations and infrastructureOperations and infrastructurepriority: criticalMust be fixed immediatelyMust be fixed immediatelyquadrant: q1Urgent & Important (Do First)Urgent & Important (Do First)type: monitoringMonitoring, observability, and alertingMonitoring, observability, and alerting