Monitoring & Observability

TokenHub provides multiple layers of observability: health tracking, Prometheus metrics, time-series data, request logs, audit logs, reward logs, and real-time SSE events.

Health Endpoint

curl http://localhost:8080/healthz
StatusMeaning
200System is healthy, adapters and models are registered
503No adapters or no models are registered

Response:

{"status": "ok", "adapters": 2, "models": 6}

Provider Health

View per-provider health status:

curl http://localhost:8080/admin/v1/health

Response:

{
  "providers": [
    {
      "provider_id": "openai",
      "state": "healthy",
      "total_requests": 1234,
      "total_errors": 5,
      "consec_errors": 0,
      "avg_latency_ms": 456.7,
      "last_error": "",
      "last_success_at": "2026-02-16T12:34:56Z",
      "cooldown_until": "0001-01-01T00:00:00Z"
    }
  ]
}

Health States

StateConsecutive ErrorsBehavior
Healthy0-1Normal routing
Degraded2-4Still routed but penalized in scoring
Down5+Excluded from routing; 30-second cooldown

Active Health Probing

TokenHub actively probes provider health endpoints in the background:

ProviderHealth EndpointSuccess Criteria
OpenAIGET /v1/models2xx response
AnthropicGET /v1/messages2xx or 405 response
vLLMGET /health2xx response

Probes run every 30 seconds with a 10-second timeout.

Prometheus Metrics

Expose metrics at:

curl http://localhost:8080/metrics

Available Metrics

MetricTypeLabelsDescription
tokenhub_requests_totalcountermode, model, provider, statusTotal requests processed
tokenhub_request_latency_mshistogrammode, model, providerRequest latency distribution
tokenhub_cost_usd_totalcountermodel, providerCumulative estimated cost

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: tokenhub
    scrape_interval: 15s
    static_configs:
      - targets: ['tokenhub:8080']

Example Queries

# Request rate by model
rate(tokenhub_requests_total[5m])

# P95 latency
histogram_quantile(0.95, rate(tokenhub_request_latency_ms_bucket[5m]))

# Cost per hour by provider
rate(tokenhub_cost_usd_total[1h]) * 3600

# Error rate
sum(rate(tokenhub_requests_total{status="error"}[5m])) /
sum(rate(tokenhub_requests_total[5m]))

Time-Series Database (TSDB)

TokenHub includes a lightweight SQLite-backed TSDB for historical metrics with querying and downsampling.

Query Metrics

curl "http://localhost:8080/admin/v1/tsdb/query?metric=latency&model_id=gpt-4&start=2026-02-16T00:00:00Z&end=2026-02-16T23:59:59Z&step_ms=60000"
ParameterRequiredDescription
metricYesMetric name (latency or cost)
model_idNoFilter by model
provider_idNoFilter by provider
startNoStart time (RFC3339)
endNoEnd time (RFC3339)
step_msNoDownsample bucket in milliseconds

List Available Metrics

curl http://localhost:8080/admin/v1/tsdb/metrics

Configure Retention

curl -X PUT http://localhost:8080/admin/v1/tsdb/retention \
  -H "Content-Type: application/json" \
  -d '{"retention_days": 14}'

Default retention is 7 days. Old data is automatically pruned hourly.

Manual Prune

curl -X POST http://localhost:8080/admin/v1/tsdb/prune

Request Logs

View paginated request history:

curl "http://localhost:8080/admin/v1/logs?limit=50&offset=0"

Each entry contains:

  • Timestamp, request ID
  • Model ID, provider ID, routing mode
  • Estimated cost, latency
  • HTTP status code, error class (if failed)

Audit Logs

View admin action history:

curl "http://localhost:8080/admin/v1/audit?limit=50&offset=0"

Logged actions:

  • vault.lock, vault.unlock, vault.rotate
  • provider.upsert, provider.delete
  • model.upsert, model.patch, model.delete
  • apikey.create, apikey.rotate, apikey.update, apikey.revoke
  • routing-config.update

Reward Logs

View contextual bandit reward data for RL-based routing analysis:

curl "http://localhost:8080/admin/v1/rewards?limit=50&offset=0"

Each entry contains: request ID, mode, model, provider, token count, token bucket (small/medium/large), latency budget, actual latency, cost, success flag, error class, and computed reward.

Aggregated Statistics

curl http://localhost:8080/admin/v1/stats

Returns global aggregates plus breakdowns by model and by provider.

Server-Sent Events (SSE)

Subscribe to real-time events:

curl -N http://localhost:8080/admin/v1/events

Event types:

EventFieldsWhen
route_successmodel_id, provider_id, latency_ms, cost_usd, reasonRequest completed successfully
route_errorlatency_ms, error_class, error_msgRequest failed

Example:

data: {"type":"route_success","model_id":"gpt-4","provider_id":"openai","latency_ms":456.7,"cost_usd":0.023,"reason":"routed-weight-8"}
AlertConditionSeverity
High error rateError rate > 5% over 5 minutesWarning
Provider downProvider in "down" state > 2 minutesCritical
High latencyP95 latency > 10 secondsWarning
Cost spikeHourly cost > 2x 7-day averageWarning
Vault lockedVault locked during business hoursCritical
No providersAdapter count = 0Critical