Health System
The health system tracks provider reliability and provides both passive monitoring (based on request outcomes) and active probing (periodic HTTP checks).
Components
Health Tracker (internal/health/tracker.go)
The tracker maintains per-provider health state:
type ProviderHealthState struct {
State string // "healthy", "degraded", "down"
TotalRequests int64
TotalErrors int64
ConsecErrors int
AvgLatencyMs float64 // Exponential moving average
LastError string
LastSuccessAt time.Time
CooldownUntil time.Time
}
State Transitions
success
┌─────────────────────────────────┐
│ │
▼ 2+ consec errors │
Healthy ──────────────────────► Degraded
▲ │
│ success │ 5+ consec errors
│◄────────────────────────────────┤
│ ▼
│ Down
│ cooldown expired │
│ + success │
└─────────────────────────────────┘
Configuration
type Config struct {
DegradedThreshold int // Consecutive errors to enter degraded (default: 2)
DownThreshold int // Consecutive errors to enter down (default: 5)
CooldownDuration time.Duration // Time in down state before retry (default: 30s)
}
Recording Results
// Called after every provider request
tracker.RecordSuccess(providerID, latencyMs)
tracker.RecordError(providerID, errorMsg)
Each success resets the consecutive error counter. Each error increments it and potentially triggers a state transition.
Health Prober (internal/health/prober.go)
The prober performs active health checks against provider endpoints:
type Probeable interface {
ID() string
HealthEndpoint() string
}
Probe Logic
- Sends
GETrequests to each provider's health endpoint - Runs all probes concurrently with a per-probe timeout
- 2xx or 405 responses are considered healthy (405 is expected from some endpoints like Anthropic's
/v1/messages) - Any other response or connection error records a failure
Configuration
type ProberConfig struct {
Interval time.Duration // Time between probe rounds (default: 30s)
Timeout time.Duration // Per-probe HTTP timeout (default: 10s)
}
Provider Health Endpoints
| Provider | Endpoint | Success |
|---|---|---|
| OpenAI | GET /v1/models | 2xx |
| Anthropic | GET /v1/messages | 2xx or 405 |
| vLLM | GET /health | 2xx |
Integration with Routing
The routing engine queries health state during model selection:
- Eligibility: Models from providers in "down" state are excluded
- Scoring: The failure rate (
totalErrors / totalRequests) contributes to the model's score - Latency: The exponential moving average latency contributes to the model's score
type HealthChecker interface {
ProviderState(providerID string) ProviderHealthState
}
The tracker implements this interface and is passed to the engine via engine.SetHealthChecker().
Observability
Provider health is exposed via:
GET /admin/v1/health— JSON health state for all providers- Admin UI health panel — Visual health badges
- SSE events — Error events include provider state changes