Introduction
TokenHub is an intelligent LLM routing proxy that sits between your applications and multiple AI providers. It provides a unified API for chat and planning requests while automatically selecting the best model based on cost, latency, capability, and provider health.
What TokenHub Does
- Unified API: Single endpoint for OpenAI, Anthropic, and vLLM models
- Intelligent Routing: Multi-objective model selection considering cost, latency, capability weight, and provider health
- Orchestration: Multi-model reasoning with adversarial critique, voting, and iterative refinement modes
- Credential Security: AES-256-GCM encrypted vault for provider API keys with auto-lock and password rotation
- Client Key Management: Issue, rotate, and revoke API keys for your applications
- Real-Time Monitoring: Prometheus metrics, time-series database, audit logs, and a built-in admin UI
- Streaming: Server-Sent Events (SSE) streaming pass-through to all providers
- Reinforcement Learning: Thompson Sampling bandit policy for adaptive model routing
Architecture at a Glance
┌─────────────┐ ┌──────────────────────────────────────────────┐
│ Client App │────▶│ TokenHub │
│ │◀────│ │
└─────────────┘ │ ┌─────────┐ ┌────────┐ ┌──────────────┐ │
│ │ Router │──│ Health │ │ Admin API │ │
│ │ Engine │ │ Tracker │ │ + UI (SPA) │ │
│ └────┬────┘ └────────┘ └──────────────┘ │
│ │ │
│ ┌────┴──────────────────────────┐ │
│ │ Provider Adapters │ │
│ │ ┌────────┐┌─────────┐┌────┐ │ │
│ │ │ OpenAI ││Anthropic││vLLM│ │ │
│ │ └────────┘└─────────┘└────┘ │ │
│ └───────────────────────────────┘ │
│ │
│ ┌─────────┐ ┌──────┐ ┌──────┐ ┌─────────┐ │
│ │ SQLite │ │ TSDB │ │Vault │ │Temporal │ │
│ └─────────┘ └──────┘ └──────┘ └─────────┘ │
└──────────────────────────────────────────────┘
Who This Documentation Is For
- Users / Application Developers: Learn how to send requests through TokenHub and use features like streaming, directives, and output formatting. Start with the User Guide.
- Administrators: Configure providers, manage credentials, set routing policies, and monitor the system. Start with the Administrator Guide.
- Developers / Contributors: Understand the internals, extend provider support, or contribute to the project. Start with the Developer Guide.
Quick Links
| Task | Where to Go |
|---|---|
| Send your first request | Quick Start |
| Configure providers | Provider Management |
| Set up API keys | API Key Management |
| Command-line admin | tokenhubctl CLI |
| Deploy with Docker | Docker & Compose |
| Full API reference | API Reference |
| Monitor the system | Monitoring |
Quick Start
This guide gets TokenHub running and serving your first request in under five minutes.
Prerequisites
- Docker (for Docker Compose), or Go 1.24+ (for building from source)
- At least one LLM provider endpoint and API key
TokenHub works with any OpenAI-compatible API, the Anthropic API, or vLLM.
This includes services like NVIDIA NIM, Azure OpenAI, Together AI, Groq,
Fireworks, Mistral, local Ollama instances — anything that speaks the
OpenAI /v1/chat/completions protocol.
1. Start the Server
Docker Compose (recommended)
git clone https://github.com/jordanhubbard/tokenhub.git
cd tokenhub
docker compose up -d tokenhub
Build from Source
git clone https://github.com/jordanhubbard/tokenhub.git
cd tokenhub
make install # builds and installs tokenhub + tokenhubctl to ~/.local/bin
tokenhub
TokenHub starts on port 8080 by default. Docker Compose maps this to host port 8090. Adjust the examples below accordingly.
2. Register Providers
A freshly started TokenHub has no providers configured. You need to tell it where your LLM endpoints are. There are several ways to do this. Pick whichever fits your workflow.
Option A: Credentials file (recommended)
The ~/.tokenhub/credentials file is a declarative JSON file that seeds
providers and models at startup. It lives outside the source tree, requires
0600 permissions, and is processed before the service accepts requests.
API keys are automatically stored in the vault (when TOKENHUB_VAULT_PASSWORD
is set) and providers are persisted to the database on first boot. The file
is idempotent — it can stay in place across restarts.
mkdir -p ~/.tokenhub
chmod 700 ~/.tokenhub
cat > ~/.tokenhub/credentials << 'EOF'
{
"providers": [
{
"id": "ollama",
"type": "openai",
"base_url": "http://localhost:11434"
},
{
"id": "nvidia",
"type": "openai",
"base_url": "https://integrate.api.nvidia.com",
"api_key": "nvapi-..."
}
],
"models": [
{
"id": "llama3.1:8b",
"provider_id": "ollama",
"weight": 5,
"max_context_tokens": 8192,
"input_per_1k": 0.0,
"output_per_1k": 0.0
},
{
"id": "meta/llama-3.1-70b-instruct",
"provider_id": "nvidia",
"weight": 8,
"max_context_tokens": 128000,
"input_per_1k": 0.0003,
"output_per_1k": 0.0003
}
]
}
EOF
chmod 600 ~/.tokenhub/credentials
Then start the server:
make run # builds image, starts compose, tails logs
Override the default path with TOKENHUB_CREDENTIALS_FILE.
Option B: tokenhubctl (interactive)
With the server already running, use the CLI directly:
export TOKENHUB_URL="http://localhost:8090"
# Register a provider
tokenhubctl provider add '{
"id": "openai",
"type": "openai",
"base_url": "https://api.openai.com",
"api_key": "sk-..."
}'
# Register a model on that provider
tokenhubctl model add '{
"id": "gpt-4o",
"provider_id": "openai",
"weight": 8,
"max_context_tokens": 128000,
"input_per_1k": 0.0025,
"output_per_1k": 0.01,
"enabled": true
}'
Option C: Admin UI
Open http://localhost:8090/admin in your browser. The setup wizard walks you through adding your first provider: select the type, enter the base URL and API key, test the connection, then discover and register available models — all without touching the command line.
Option D: Admin API (curl)
# Register a provider
curl -X POST http://localhost:8090/admin/v1/providers \
-H "Content-Type: application/json" \
-d '{
"id": "anthropic",
"type": "anthropic",
"base_url": "https://api.anthropic.com",
"api_key": "sk-ant-...",
"enabled": true
}'
# Register a model
curl -X POST http://localhost:8090/admin/v1/models \
-H "Content-Type: application/json" \
-d '{
"id": "claude-sonnet-4-5-20250514",
"provider_id": "anthropic",
"weight": 8,
"max_context_tokens": 200000,
"input_per_1k": 0.003,
"output_per_1k": 0.015,
"enabled": true
}'
Providers persist across restarts. Once registered via the credentials file, the API,
tokenhubctl, or the UI, providers and models are stored in the database and restored automatically on restart. You only need to configure them once. API keys for vault-backed providers require the vault to be unlocked after restart (setTOKENHUB_VAULT_PASSWORDfor automatic unlock).
3. Verify It's Running
curl http://localhost:8090/healthz
Or:
tokenhubctl status
Expected response:
{"status": "ok", "adapters": 2, "models": 2}
4. Create an API Key
TokenHub issues its own API keys to clients. Provider keys stay on the server.
tokenhubctl apikey create '{"name":"my-first-key","scopes":"[\"chat\",\"plan\"]"}'
Or via curl:
curl -X POST http://localhost:8090/admin/v1/apikeys \
-H "Content-Type: application/json" \
-d '{"name": "my-first-key", "scopes": "[\"chat\",\"plan\"]"}'
Save the returned key value — it is shown only once:
{
"ok": true,
"key": "tokenhub_a1b2c3d4...",
"id": "a1b2c3d4e5f6g7h8",
"prefix": "tokenhub_a1b2c3d4"
}
5. Send Your First Request
curl -X POST http://localhost:8090/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_a1b2c3d4..." \
-d '{
"request": {
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}
}'
TokenHub selects the best available model based on its routing policy and returns the response:
{
"negotiated_model": "gpt-4o",
"estimated_cost_usd": 0.0023,
"routing_reason": "routed-weight-8",
"response": {
"choices": [{
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
}
}]
}
}
6. Explore
# See all registered providers and models
tokenhubctl provider list
tokenhubctl model list
# Watch routing decisions in real time
tokenhubctl events
# Open the admin dashboard
open http://localhost:8090/admin
Next Steps
- Provider Management for provider types, credential storage, and model discovery
- Chat API for request options, routing policies, and parameters
- Routing Configuration to tune model selection behavior
- tokenhubctl CLI for command-line administration
- Configuration Reference for all environment variables
User Guide Overview
This section is for application developers integrating with TokenHub. TokenHub exposes two main endpoints:
| Endpoint | Purpose |
|---|---|
POST /v1/chat | Single-turn or multi-turn chat completion |
POST /v1/plan | Multi-model orchestrated reasoning |
Both endpoints accept a unified request format and return the provider's response along with routing metadata (which model was chosen, estimated cost, and routing reason).
Key Concepts
Routing Policies
Every request can include a policy that guides model selection:
cheap— Minimize cost (prefer smaller, cheaper models)normal— Balance cost, latency, capability, and reliabilityhigh_confidence— Prefer the most capable models regardless of costplanning— Optimized for planning and reasoning tasksthompson— Adaptive selection using reinforcement learning
If no policy is specified, the server's default routing mode applies.
Model Selection
TokenHub maintains a registry of models from all configured providers. Each model has:
- Weight (0-10): Higher weight = more capable
- Context window: Maximum tokens the model can process
- Pricing: Cost per 1,000 input and output tokens
- Health status: Based on recent success/failure rates
The routing engine scores all eligible models and selects the best match for your request.
Authentication
All /v1 requests require an API key in the Authorization header:
Authorization: Bearer tokenhub_<key>
API keys are created and managed by administrators. Each key has scopes controlling which endpoints it can access (chat, plan, or both).
Provider Transparency
You interact only with TokenHub. The underlying provider (OpenAI, Anthropic, vLLM) is selected automatically and its API key is never exposed. The response includes which model and provider were used in the negotiated_model field.
Sections
- Chat API — Detailed guide to
/v1/chat - Plan API — Multi-model orchestration via
/v1/plan - Streaming — Server-Sent Events streaming
- Directives — In-band routing overrides embedded in messages
- Output Formats — JSON Schema validation, Markdown, XML output shaping
- Authentication — API key usage and scopes
Chat API
The chat endpoint provides single-turn or multi-turn completions with automatic model routing.
Endpoint: POST /v1/chat
Request Format
{
"request": {
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"model_hint": "gpt-4",
"estimated_input_tokens": 500,
"parameters": {
"temperature": 0.7,
"max_tokens": 1024,
"top_p": 0.9
},
"stream": false,
"meta": {
"user_id": "u123",
"session": "abc"
}
},
"capabilities": {
"planning": true
},
"policy": {
"mode": "normal",
"max_budget_usd": 0.05,
"max_latency_ms": 15000,
"min_weight": 5
},
"output_format": {
"type": "json",
"schema": "{\"type\":\"object\",\"properties\":{\"answer\":{\"type\":\"string\"}}}",
"max_tokens": 500,
"strip_think": true
}
}
Request Fields
request (required)
| Field | Type | Required | Description |
|---|---|---|---|
messages | array | Yes | Array of {role, content} message objects |
model_hint | string | No | Preferred model ID; tried first before scoring |
estimated_input_tokens | int | No | Token count hint for routing decisions |
parameters | object | No | Provider parameters forwarded as-is (temperature, max_tokens, top_p, etc.) |
stream | bool | No | Enable SSE streaming response |
meta | object | No | Arbitrary metadata for logging and tracing |
output_schema | JSON | No | JSON Schema for structured output validation |
policy (optional)
Controls model selection behavior. All fields are optional and fall back to server defaults.
| Field | Type | Default | Range | Description |
|---|---|---|---|---|
mode | string | normal | See below | Routing mode |
max_budget_usd | float | 0.05 | 0-100 | Maximum cost per request |
max_latency_ms | int | 20000 | 0-300000 | Maximum acceptable latency |
min_weight | int | 0 | 0-10 | Minimum model capability weight |
Routing modes:
| Mode | Cost Weight | Latency Weight | Failure Weight | Capability Weight |
|---|---|---|---|---|
cheap | 0.7 | 0.1 | 0.1 | 0.1 |
normal | 0.25 | 0.25 | 0.25 | 0.25 |
high_confidence | 0.05 | 0.1 | 0.15 | 0.7 |
planning | 0.1 | 0.1 | 0.2 | 0.6 |
thompson | N/A | N/A | N/A | N/A |
The thompson mode uses reinforcement learning (Thompson Sampling with Beta distributions) to adaptively select models based on historical reward data.
capabilities (optional)
| Field | Type | Description |
|---|---|---|
planning | bool | Indicates request needs planning capability |
Capabilities influence which routing mode profile is used when no explicit mode is set.
output_format (optional)
| Field | Type | Description |
|---|---|---|
type | string | Output format: json, markdown, text, xml |
schema | string | JSON Schema string for validating structured output |
max_tokens | int | Maximum output tokens to request from provider |
strip_think | bool | Remove <think>...</think> blocks from response |
Response Format
{
"negotiated_model": "gpt-4",
"estimated_cost_usd": 0.0023,
"routing_reason": "routed-weight-8",
"response": {
"id": "chatcmpl-...",
"choices": [{
"message": {
"role": "assistant",
"content": "Quantum computing uses..."
}
}],
"usage": {
"prompt_tokens": 45,
"completion_tokens": 128,
"total_tokens": 173
}
}
}
| Field | Description |
|---|---|
negotiated_model | The model ID that was selected |
estimated_cost_usd | Estimated cost based on model pricing and token counts |
routing_reason | Why this model was chosen (see Routing Reasons) |
response | Raw JSON response from the selected provider |
Routing Reasons
| Reason | Description |
|---|---|
routed-weight-N | Selected by scoring; N is the model's weight |
model-hint | Client's model hint was used |
escalated-context-overflow | Escalated to a model with a larger context window |
retried-transient | Retried after a transient provider error |
Error Responses
| Status | Body | Cause |
|---|---|---|
| 400 | "bad json" | Malformed request body |
| 400 | "messages required" | Empty messages array |
| 400 | "max_budget_usd must be between 0 and 100" | Policy validation failure |
| 401 | "missing or invalid api key" | Missing or invalid Authorization header |
| 403 | "scope not allowed" | API key lacks chat scope |
| 502 | Error message | All models failed or no eligible models |
Examples
Minimal Request
curl -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "Hello!"}]
}
}'
Cost-Optimized Request
curl -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "Summarize this text..."}]
},
"policy": {
"mode": "cheap",
"max_budget_usd": 0.001
}
}'
Request with Model Hint
curl -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "Write a poem about the ocean."}],
"model_hint": "claude-opus",
"parameters": {
"temperature": 0.9,
"max_tokens": 2048
}
}
}'
Structured JSON Output
curl -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "List 3 programming languages with their year of creation"}]
},
"output_format": {
"type": "json",
"schema": "{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"year\":{\"type\":\"integer\"}}}}"
}
}'
Plan API
The plan endpoint provides multi-model orchestrated reasoning. It coordinates multiple LLM calls using different strategies to produce higher-quality outputs than a single model call.
Endpoint: POST /v1/plan
Request Format
{
"request": {
"messages": [
{"role": "user", "content": "Design a REST API for a task management app"}
]
},
"orchestration": {
"mode": "adversarial",
"iterations": 2,
"primary_model_id": "claude-opus",
"review_model_id": "gpt-4",
"primary_min_weight": 5,
"review_min_weight": 8,
"return_plan_only": false,
"output_schema": "{\"type\":\"object\"}"
}
}
Orchestration Modes
Adversarial
A three-phase plan-critique-refine loop:
- Plan: Primary model generates an initial plan
- Critique: Review model analyzes the plan and provides feedback
- Refine: Primary model improves the plan based on the critique
The critique-refine cycle repeats for the configured number of iterations.
{
"orchestration": {
"mode": "adversarial",
"iterations": 2
}
}
Response:
{
"negotiated_model": "claude-opus",
"estimated_cost_usd": 0.15,
"routing_reason": "adversarial-orchestration",
"response": {
"initial_plan": "Here is the initial API design...",
"critique": "The design has these issues: ...",
"refined_plan": "Here is the improved design addressing the feedback..."
}
}
Vote
Multiple models respond independently, then a judge model selects the best:
- N models (voters) each generate a response to the same prompt
- A judge model reviews all responses and selects the best one
{
"orchestration": {
"mode": "vote"
}
}
Response:
{
"negotiated_model": "gpt-4",
"estimated_cost_usd": 0.08,
"routing_reason": "vote-orchestration",
"response": {
"responses": [
{"model": "gpt-4", "content": "Response A...", "selected": true},
{"model": "claude-sonnet", "content": "Response B...", "selected": false},
{"model": "gpt-3.5-turbo", "content": "Response C...", "selected": false}
],
"selected": 0,
"judge": "claude-opus"
}
}
Refine
A single model iteratively improves its own response:
- Model generates an initial response
- Model reviews and refines its own response (repeats for N iterations)
{
"orchestration": {
"mode": "refine",
"iterations": 3
}
}
Response:
{
"negotiated_model": "claude-opus",
"estimated_cost_usd": 0.12,
"routing_reason": "refine-orchestration",
"response": {
"refined_response": "Final refined response...",
"iterations": 3,
"model": "claude-opus"
}
}
Planning
Simple single-route with the planning weight profile (prioritizes capable models):
{
"orchestration": {
"mode": "planning"
}
}
Orchestration Fields
| Field | Type | Default | Range | Description |
|---|---|---|---|---|
mode | string | planning | See above | Orchestration strategy |
iterations | int | 1-2 | 0-10 | Number of refinement iterations |
primary_model_id | string | — | — | Explicit model for primary phase |
review_model_id | string | — | — | Explicit model for review/judge phase |
primary_min_weight | int | 0 | 0-10 | Minimum weight for primary model |
review_min_weight | int | 0 | 0-10 | Minimum weight for review model |
return_plan_only | bool | false | — | Return plan without executing refinement |
output_schema | string | — | — | JSON Schema for structured output validation |
Explicit Model Selection
By default, TokenHub selects models using its routing engine. You can override this with explicit model IDs:
{
"orchestration": {
"mode": "adversarial",
"primary_model_id": "claude-opus",
"review_model_id": "gpt-4"
}
}
Alternatively, use primary_min_weight and review_min_weight to set capability floors without specifying exact models:
{
"orchestration": {
"mode": "adversarial",
"primary_min_weight": 7,
"review_min_weight": 9
}
}
Error Responses
| Status | Body | Cause |
|---|---|---|
| 400 | "messages required" | Empty messages array |
| 400 | "iterations must be between 0 and 10" | Invalid iteration count |
| 400 | "unknown orchestration mode" | Unrecognized mode value |
| 401 | "missing or invalid api key" | Authentication failure |
| 403 | "scope not allowed" | API key lacks plan scope |
| 502 | Error message | Orchestration failed (all models failed) |
Cost Considerations
Orchestration modes make multiple LLM calls. Approximate cost multipliers:
| Mode | Calls per Request | Typical Cost Multiplier |
|---|---|---|
| Planning | 1 | 1x |
| Adversarial (2 iter) | 5 (plan + 2x(critique + refine)) | 5x |
| Vote (3 voters) | 4 (3 voters + 1 judge) | 4x |
| Refine (3 iter) | 4 (initial + 3 refinements) | 4x |
Budget accordingly when setting max_budget_usd in your policy.
Streaming
TokenHub supports Server-Sent Events (SSE) streaming for chat requests. When streaming is enabled, tokens are delivered incrementally as they are generated by the provider.
Enabling Streaming
Set stream: true in your request:
{
"request": {
"messages": [{"role": "user", "content": "Tell me a story..."}],
"stream": true
}
}
Response Format
Streaming responses use the text/event-stream content type. Each event is a line prefixed with data: :
data: {"choices":[{"delta":{"content":"Once"},"index":0}]}
data: {"choices":[{"delta":{"content":" upon"},"index":0}]}
data: {"choices":[{"delta":{"content":" a"},"index":0}]}
data: {"choices":[{"delta":{"content":" time"},"index":0}]}
data: [DONE]
The stream ends with data: [DONE].
Response Headers
Streaming responses include these headers:
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
X-TokenHub-Model: gpt-4
X-TokenHub-Provider: openai
X-TokenHub-Reason: routed-weight-8
The X-TokenHub-* headers provide routing metadata that would normally be in the JSON response envelope.
Example with curl
curl -N -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "Count from 1 to 10 slowly."}],
"stream": true
}
}'
The -N flag disables output buffering so tokens appear as they arrive.
Example with Python
import requests
import json
response = requests.post(
"http://localhost:8080/v1/chat",
headers={
"Content-Type": "application/json",
"Authorization": "Bearer tokenhub_..."
},
json={
"request": {
"messages": [{"role": "user", "content": "Tell me a story."}],
"stream": True
}
},
stream=True
)
for line in response.iter_lines():
if line:
text = line.decode("utf-8")
if text.startswith("data: ") and text != "data: [DONE]":
chunk = json.loads(text[6:])
delta = chunk["choices"][0].get("delta", {})
if "content" in delta:
print(delta["content"], end="", flush=True)
Provider Compatibility
All three provider adapters support streaming:
| Provider | Streaming Protocol |
|---|---|
| OpenAI | SSE (native) |
| Anthropic | SSE (native) |
| vLLM | SSE (OpenAI-compatible) |
TokenHub passes the SSE stream through directly from the selected provider. The event format matches the provider's native format.
Failover Behavior
Streaming uses the same model selection and failover logic as non-streaming requests. If the selected model fails to establish a stream, TokenHub falls back through eligible models in scored order.
However, once streaming has begun (first bytes sent to the client), failover is not possible. If the provider disconnects mid-stream, the stream ends with an error event.
Limitations
- Streaming is only available on
/v1/chat, not/v1/plan - Output format validation (
output_format.schema) is not applied to streaming responses - Cost estimation in streaming responses may be less accurate since token counts are not known until the stream completes
- When Temporal workflows are enabled, streaming bypasses Temporal and uses direct engine dispatch
In-Band Directives
TokenHub supports embedding routing directives directly in message content. This allows clients to override routing policy without changing the request structure, which is useful when working through intermediary systems that pass messages through unchanged.
Single-Line Directive
Embed a directive anywhere in a message's content using the @@tokenhub prefix:
@@tokenhub mode=cheap budget=0.01 latency=5000 min_weight=5
Example in a full request:
{
"request": {
"messages": [
{
"role": "user",
"content": "@@tokenhub mode=cheap budget=0.005\nSummarize this document..."
}
]
}
}
Block Directive
For complex directives (especially those containing JSON schemas), use the block format:
@@tokenhub
mode=high_confidence
budget=0.10
latency=30000
min_weight=8
output_schema={"type":"object","properties":{"answer":{"type":"string"},"confidence":{"type":"number"}}}
@@end
The block starts with @@tokenhub on its own line and ends with @@end.
Supported Keys
| Key | Type | Maps To | Description |
|---|---|---|---|
mode | string | policy.mode | Routing mode (cheap, normal, high_confidence, planning, adversarial) |
budget | float | policy.max_budget_usd | Maximum cost in USD |
latency | int | policy.max_latency_ms | Maximum latency in milliseconds |
min_weight | int | policy.min_weight | Minimum model capability weight |
output_schema | JSON | request.output_schema | JSON Schema for structured output |
Processing Rules
- Scanning: TokenHub scans all messages for directives. The last directive found takes precedence.
- Stripping: Directives are removed from message content before forwarding to the provider. The LLM never sees
@@tokenhubtext. - Override: Directive values override both server defaults and request-level policy fields.
- Partial override: You can set only the fields you want to override. Unspecified fields retain their values from the request policy or server defaults.
Examples
Cost-optimize a specific request
@@tokenhub mode=cheap budget=0.001
What is 2 + 2?
Force high-quality response
@@tokenhub mode=high_confidence min_weight=9
Write a detailed analysis of the economic implications of quantum computing.
Structured output via directive
@@tokenhub
output_schema={"type":"object","properties":{"name":{"type":"string"},"population":{"type":"integer"}}}
@@end
What is the most populous city in Japan?
Output Formats
TokenHub can shape provider responses into specific output formats. This is useful for applications that need structured data from LLM responses.
Configuration
Set the output_format field in your chat request:
{
"output_format": {
"type": "json",
"schema": "{\"type\":\"object\",\"properties\":{\"answer\":{\"type\":\"string\"}}}",
"max_tokens": 500,
"strip_think": true
}
}
Format Types
JSON
Validates the response against a JSON Schema. If the provider's output doesn't match the schema, TokenHub returns a validation error.
{
"output_format": {
"type": "json",
"schema": "{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"value\":{\"type\":\"number\"}}}}"
}
}
The schema is passed as a string (not a nested object) to allow maximum flexibility.
Markdown
Requests the provider to format its response as Markdown:
{
"output_format": {
"type": "markdown"
}
}
Text
Plain text output with optional truncation:
{
"output_format": {
"type": "text",
"max_tokens": 200
}
}
XML
Requests XML-formatted output:
{
"output_format": {
"type": "xml"
}
}
Output Format Fields
| Field | Type | Description |
|---|---|---|
type | string | Output format: json, markdown, text, xml |
schema | string | JSON Schema for validation (only with type: "json") |
max_tokens | int | Maximum output tokens to request from the provider |
strip_think | bool | Remove <think>...</think> reasoning blocks from the response |
Think Block Stripping
Some models (particularly those with chain-of-thought reasoning) wrap their internal reasoning in <think>...</think> tags. Setting strip_think: true removes these blocks from the final response:
Before stripping:
<think>
The user wants to know the capital of France. This is a straightforward factual question.
</think>
The capital of France is Paris.
After stripping:
The capital of France is Paris.
JSON Schema Validation
When type: "json" is specified with a schema, TokenHub:
- Sends the request to the provider (with a system message hint to produce JSON)
- Parses the provider's response as JSON
- Validates against the provided JSON Schema
- Returns the validated JSON in the response
If validation fails, the error is returned in the response body with a 502 status.
Authentication
All requests to TokenHub's consumer API (/v1/*) require authentication via API keys.
API Key Format
TokenHub API keys follow this format:
tokenhub_<64 hex characters>
Example: tokenhub_a1b2c3d4e5f6789012345678abcdef0123456789abcdef0123456789abcdef01
Using API Keys
Include the key in the Authorization header as a Bearer token:
curl -X POST http://localhost:8080/v1/chat \
-H "Authorization: Bearer tokenhub_a1b2c3d4..." \
-H "Content-Type: application/json" \
-d '{"request": {"messages": [{"role": "user", "content": "Hello"}]}}'
Scopes
Each API key has scopes that control which endpoints it can access:
| Scope | Endpoint | Description |
|---|---|---|
chat | POST /v1/chat | Chat completion requests |
plan | POST /v1/plan | Orchestrated planning requests |
A key with scopes ["chat", "plan"] can access both endpoints. A key with only ["chat"] receives a 403 Forbidden when calling /v1/plan.
If scopes are empty ([]), the key has access to all endpoints.
Error Responses
| Status | Message | Cause |
|---|---|---|
| 401 | "missing or invalid api key" | No Authorization header, invalid format, wrong key, expired, or disabled |
| 403 | "scope not allowed" | Valid key but lacks the required scope |
Key Lifecycle
- Created by an administrator via the admin API or UI
- Distributed to the client application (plaintext shown only once at creation)
- Used by the client for all
/v1requests - Rotated periodically (manually or on a configured schedule)
- Revoked when no longer needed
Keys can be configured with:
- Expiration: Automatic expiry after a set duration
- Rotation schedule: Recommended rotation period in days
- Enable/disable: Temporarily deactivate without deleting
Security Properties
- Plaintext is never stored: Only a bcrypt hash is persisted
- Shown once: The plaintext key is returned only at creation and rotation
- Provider isolation: Clients authenticate with TokenHub keys. Provider API keys are stored encrypted in the vault and never exposed.
- Validation cache: A 5-minute TTL cache reduces bcrypt overhead without compromising security
See API Key Management for the administrator's guide to creating and managing keys.
Administrator Guide Overview
This section covers how to configure, manage, and monitor a TokenHub deployment.
Administration Model
TokenHub uses a three-tier security model:
- Admin token (
TOKENHUB_ADMIN_TOKEN): Authenticates access to the admin API (/admin/v1/*) and the admin dashboard. The UI requires the token at login; all admin API calls include it asAuthorization: Bearer <token>. Retrieve it withtokenhubctl admin-token. - Vault password: A separate secret that encrypts provider API keys at rest. Even a valid admin token cannot decrypt the vault — the vault must be explicitly unlocked after each restart (or set
TOKENHUB_VAULT_PASSWORDfor auto-unlock). - API keys: Issued to client applications for
/v1endpoint access. Managed via the admin API or UI.
In production, always set TOKENHUB_ADMIN_TOKEN and restrict network access to /admin/* at the firewall, VPN, or reverse proxy level.
Administration Tools
Admin UI
The built-in web dashboard at /admin provides a graphical interface for all admin operations. See Admin UI.
tokenhubctl
A command-line tool for scripting and quick administration. Covers all admin API operations. See tokenhubctl CLI.
curl / Admin API
All operations are available via the REST API at /admin/v1/*. See API Reference.
Admin Endpoints
| Category | Endpoints | Purpose |
|---|---|---|
| Vault | /admin/v1/vault/* | Lock, unlock, rotate vault password |
| Providers | /admin/v1/providers | Register, edit, and manage LLM providers |
| Models | /admin/v1/models | Register, edit, and manage model configurations |
| Discovery | /admin/v1/providers/{id}/discover | Discover models from a provider's API |
| Simulation | /admin/v1/routing/simulate | What-if routing simulation |
| Routing | /admin/v1/routing-config | Set default routing policy |
| API Keys | /admin/v1/apikeys | Create, rotate, revoke client API keys |
| Health | /admin/v1/health | View provider health status |
| Stats | /admin/v1/stats | View aggregated request statistics |
| Logs | /admin/v1/logs | View request logs |
| Audit | /admin/v1/audit | View audit trail |
| Rewards | /admin/v1/rewards | View contextual bandit reward data |
| Engine | /admin/v1/engine/models | View runtime model registry and adapter info |
| TSDB | /admin/v1/tsdb/* | Query time-series metrics |
| Workflows | /admin/v1/workflows | View Temporal workflow executions |
| Events | /admin/v1/events | SSE stream of real-time events |
Sections
- Vault & Credentials — Encrypted credential storage
- Provider Management — Configure LLM providers
- Model Management — Configure model registry
- Routing Configuration — Tune model selection
- API Key Management — Issue and manage client keys
- Monitoring & Observability — Health, metrics, logs, and alerts
- Admin UI — Built-in dashboard
- tokenhubctl CLI — Command-line administration
Vault & Credentials
TokenHub includes an AES-256-GCM encrypted vault for storing provider API keys securely. Provider credentials are encrypted at rest and only decrypted in memory when the vault is unlocked.
Vault password vs. admin token: The vault password is not the same as your admin token. The admin token authenticates HTTP requests to the admin API. The vault password derives the encryption key used to protect stored credentials. Both are required in a production deployment: the admin token to access the API, and the vault password to decrypt provider keys.
How It Works
- An administrator sets a vault password when first configuring TokenHub
- The password is run through Argon2id key derivation (OWASP-recommended parameters) to produce an encryption key
- Provider API keys are encrypted with AES-256-GCM and stored in SQLite
- A random salt is generated per vault instance and persisted alongside the encrypted data
- After server restart, the vault must be unlocked with the same password before provider requests can be made
Vault States
| State | Description |
|---|---|
| Not initialized | First-time setup required — choose a master password |
| Locked | Credentials encrypted; provider requests will fail |
| Unlocked | Credentials decrypted in memory; requests are served normally |
Auto-Unlock (Headless)
Set TOKENHUB_VAULT_PASSWORD to unlock the vault automatically at startup.
This is required for automated/headless deployments where no operator is
present to enter the password interactively.
export TOKENHUB_VAULT_PASSWORD="your-secure-password"
On first boot this also initializes the vault, so no interactive setup is needed.
Operations
Unlock the Vault
Via the admin UI (recommended for first-time setup — the UI asks for the password twice to prevent typos), or via API/CLI:
tokenhubctl vault unlock "your-secure-password"
Or via curl:
curl -X POST http://localhost:8080/admin/v1/vault/unlock \
-H "Content-Type: application/json" \
-d '{"admin_password": "your-secure-password"}'
Response:
{"ok": true}
Lock the Vault
curl -X POST http://localhost:8080/admin/v1/vault/lock
Response:
{"ok": true, "already_locked": false}
Rotate the Vault Password
Re-encrypts all stored credentials with a new password:
curl -X POST http://localhost:8080/admin/v1/vault/rotate \
-H "Content-Type: application/json" \
-d '{
"old_password": "current-password",
"new_password": "new-secure-password"
}'
This operation is atomic — all credentials are re-encrypted in a single transaction.
Auto-Lock
The vault automatically locks after 30 minutes of inactivity. Every successful credential access resets the timer.
When the vault auto-locks:
- In-flight requests that have already retrieved credentials continue normally
- New requests will fail with a provider error until the vault is unlocked again
- An audit log entry is recorded
Credential Storage
When you register a provider with cred_store: "vault", TokenHub stores the API key encrypted in the vault under the key provider:{provider_id}:api_key.
The credential lifecycle:
- Admin provides API key when creating/updating a provider
- Key is encrypted and stored in the vault
- Key is also persisted (encrypted) in the database for recovery after restart
- When the vault is unlocked, the salt and encrypted blob are loaded from the database
- Keys are decrypted only in memory
Security Parameters
| Parameter | Value |
|---|---|
| Encryption | AES-256-GCM |
| Key derivation | Argon2id |
| Argon2id time | 3 iterations |
| Argon2id memory | 64 MB |
| Argon2id threads | 4 |
| Salt | 16 bytes, random per vault |
| Auto-lock timeout | 30 minutes |
Best Practices
- Use a strong vault password: At least 16 characters with mixed case, numbers, and symbols
- Use
TOKENHUB_VAULT_PASSWORDfor automated deployments so the vault unlocks on restart - Rotate regularly: Use the rotate endpoint to change the vault password periodically
- Monitor auto-lock: Set up alerts if the vault locks unexpectedly during business hours
- Backup the database: The vault salt and encrypted blob are stored in SQLite. Back up the database file to ensure credential recovery
- Network isolation: Restrict access to vault admin endpoints to trusted networks
Provider Management
Providers are the LLM services that TokenHub routes requests to. TokenHub ships with adapter support for OpenAI, Anthropic, and vLLM (OpenAI-compatible).
Registration Methods
Credentials File (recommended)
The ~/.tokenhub/credentials file is a declarative JSON file processed at
startup. Providers are persisted to the database and API keys are stored in
the vault (when unlocked via TOKENHUB_VAULT_PASSWORD). The file is
idempotent — it can remain in place across restarts.
The file must have 0600 permissions and live outside the source tree.
{
"providers": [
{
"id": "openai",
"type": "openai",
"base_url": "https://api.openai.com",
"api_key": "sk-..."
},
{
"id": "anthropic",
"type": "anthropic",
"base_url": "https://api.anthropic.com",
"api_key": "sk-ant-..."
},
{
"id": "ollama-local",
"type": "openai",
"base_url": "http://localhost:11434"
}
],
"models": [
{
"id": "gpt-4o",
"provider_id": "openai",
"weight": 8,
"max_context_tokens": 128000,
"input_per_1k": 0.0025,
"output_per_1k": 0.01
}
]
}
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | Unique provider identifier |
type | string | Yes | Provider type: openai, anthropic, or vllm |
base_url | string | Yes | Provider API base URL |
api_key | string | No | API key (stored in vault when available, omit for keyless providers) |
enabled | bool | No | Whether the provider is active (default: true) |
Override the default path with TOKENHUB_CREDENTIALS_FILE.
Admin API / tokenhubctl
Providers can be registered and managed dynamically via the admin API or tokenhubctl at any time after the service starts.
Admin UI
The setup wizard at /admin walks through adding providers interactively.
API Operations
Create or Update a Provider
curl -X POST http://localhost:8080/admin/v1/providers \
-H "Content-Type: application/json" \
-d '{
"id": "openai-prod",
"type": "openai",
"enabled": true,
"base_url": "https://api.openai.com",
"cred_store": "vault",
"api_key": "sk-..."
}'
Or with tokenhubctl:
tokenhubctl provider add '{"id":"openai-prod","type":"openai","base_url":"https://api.openai.com","api_key":"sk-..."}'
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | Unique provider identifier |
type | string | Yes | Provider type: openai, anthropic, or vllm |
enabled | bool | No | Whether the provider is active (default: true) |
base_url | string | Yes | Provider API base URL |
cred_store | string | No | Where to store credentials: vault or none |
api_key | string | No | API key (stored according to cred_store) |
List Providers
curl http://localhost:8080/admin/v1/providers
tokenhubctl provider list
The tokenhubctl provider list command merges providers from both the persistent store and the runtime engine, showing base URLs derived from adapter health endpoints and indicating whether each provider is store-persisted or runtime-only.
API keys are never returned in list responses.
Edit a Provider
Partial updates via PATCH:
curl -X PATCH http://localhost:8080/admin/v1/providers/openai \
-H "Content-Type: application/json" \
-d '{"base_url": "https://api.openai.com", "enabled": true}'
Or:
tokenhubctl provider edit openai '{"base_url":"https://api.openai.com","enabled":true}'
Patchable fields: type, base_url, enabled, api_key, cred_store.
Delete a Provider
curl -X DELETE http://localhost:8080/admin/v1/providers/openai-staging
tokenhubctl provider delete openai-staging
Discover Models
Query a provider's API to discover available models:
curl http://localhost:8080/admin/v1/providers/openai/discover
tokenhubctl provider discover openai
This calls the provider's /v1/models endpoint (using the stored API key from the vault if available) and returns the list of models with a registered flag indicating which are already configured in TokenHub.
Credential Storage Options
cred_store | Description |
|---|---|
vault | API key is encrypted and stored in the vault (default when api_key is provided) |
none | No credentials needed (e.g., local vLLM/Ollama without auth) |
When using vault, the API key is encrypted with AES-256-GCM and only available when the vault is unlocked.
Supported Provider Types
OpenAI (openai)
- API endpoint:
/v1/chat/completions - Health probe:
GET /v1/models - Streaming: SSE (native)
- Authentication:
Authorization: Bearer <key>
Anthropic (anthropic)
- API endpoint:
/v1/messages - Health probe:
GET /v1/messages(405 response = healthy) - Streaming: SSE (native)
- Authentication:
x-api-key: <key>,anthropic-version: 2023-06-01
vLLM (vllm)
- API endpoint:
/v1/chat/completions(OpenAI-compatible) - Health probe:
GET /health - Streaming: SSE (OpenAI-compatible)
- Authentication: None (or custom header if configured)
- Multi-endpoint: Supports multiple endpoints with round-robin load balancing
Audit Trail
All provider mutations are logged in the audit trail:
provider.upsert— Provider created or updatedprovider.patch— Provider partially updatedprovider.delete— Provider removed
Model Management
Models are the LLM model definitions that TokenHub uses for routing decisions. Each model is associated with a provider and has properties that affect routing: capability weight, context window size, and pricing.
Default Models
TokenHub registers these default models at startup:
| Model ID | Provider | Weight | Context | Input $/1K | Output $/1K |
|---|---|---|---|---|---|
gpt-4 | openai | 8 | 128,000 | $0.010 | $0.030 |
gpt-3.5-turbo | openai | 3 | 16,385 | $0.0005 | $0.0015 |
claude-opus | anthropic | 10 | 200,000 | $0.015 | $0.075 |
claude-sonnet | anthropic | 7 | 200,000 | $0.003 | $0.015 |
Defaults are overridden if persisted models exist in the database or are registered via the credentials file.
API Operations
Create or Update a Model
curl -X POST http://localhost:8080/admin/v1/models \
-H "Content-Type: application/json" \
-d '{
"id": "gpt-4-turbo",
"provider_id": "openai",
"weight": 7,
"max_context_tokens": 128000,
"input_per_1k": 0.01,
"output_per_1k": 0.03,
"enabled": true
}'
Or with tokenhubctl:
tokenhubctl model add '{"id":"gpt-4-turbo","provider_id":"openai","weight":7,"max_context_tokens":128000,"input_per_1k":0.01,"output_per_1k":0.03,"enabled":true}'
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | Model identifier (must match provider's model name) |
provider_id | string | Yes | ID of the registered provider |
weight | int | Yes | Capability weight (0-10); higher = more capable |
max_context_tokens | int | Yes | Maximum context window in tokens |
input_per_1k | float | Yes | Cost per 1,000 input tokens in USD |
output_per_1k | float | Yes | Cost per 1,000 output tokens in USD |
enabled | bool | Yes | Whether the model is available for routing |
Model IDs can contain slashes (e.g., Qwen/Qwen2.5-Coder-32B-Instruct, nvidia/openai/gpt-oss-20b). The API handles them correctly.
List Models
curl http://localhost:8080/admin/v1/models
tokenhubctl model list
The tokenhubctl model list command merges models from both the persistent store and the runtime engine, so models registered via environment variables or the credentials file are also shown.
Patch a Model
Update individual fields without resending the full configuration:
curl -X PATCH http://localhost:8080/admin/v1/models/gpt-4o \
-H "Content-Type: application/json" \
-d '{
"weight": 9,
"enabled": true,
"input_per_1k": 0.012
}'
Or:
tokenhubctl model edit gpt-4o '{"weight":9}'
Patchable fields: weight, enabled, input_per_1k, output_per_1k, max_context_tokens.
Runtime-only models (those registered via env vars or credentials file but not in the store) can also be patched. The first patch creates a store record seeded from the engine's runtime data.
Enable / Disable a Model
Quick shortcuts via tokenhubctl:
tokenhubctl model enable gpt-4o
tokenhubctl model disable gpt-4o-legacy
Delete a Model
curl -X DELETE http://localhost:8080/admin/v1/models/gpt-4-legacy
tokenhubctl model delete gpt-4-legacy
Weight Guidelines
The model weight is the primary indicator of model capability used in routing decisions:
| Weight | Intended For |
|---|---|
| 1-3 | Simple tasks, low cost (e.g., GPT-3.5 Turbo) |
| 4-6 | General purpose (e.g., GPT-4 Turbo, Claude Sonnet) |
| 7-8 | Complex reasoning (e.g., GPT-4, Claude Opus) |
| 9-10 | Highest capability (e.g., next-gen frontier models) |
Different routing modes weight the capability score differently:
cheapmode barely considers weight (0.1 factor)high_confidenceandplanningmodes heavily favor higher weights (0.6-0.7 factor)normalmode balances weight equally with cost, latency, and reliability (0.25 each)
Context Window
The max_context_tokens field tells the router whether a model can handle a given request size. The router applies a 15% headroom buffer — a model with 128,000 tokens can handle requests estimated up to ~108,000 tokens.
Token estimation uses estimated_input_tokens from the request if provided, otherwise falls back to a characters / 4 heuristic.
Pricing
Model pricing is used for:
- Cost estimation: Returned in the response as
estimated_cost_usd - Budget filtering: Models exceeding the request's
max_budget_usdare excluded - Cost scoring: In routing modes that consider cost (especially
cheapmode)
Keep pricing up to date as providers change their rates.
Audit Trail
Model mutations are logged:
model.upsert— Model created or updatedmodel.patch— Model partially updatedmodel.delete— Model removed
Routing Configuration
TokenHub's routing engine uses a multi-objective scoring function to select the best model for each request. Administrators can configure the default routing behavior that applies when clients don't specify a policy.
Default Routing Settings
View Current Defaults
curl http://localhost:8080/admin/v1/routing-config
Response:
{
"default_mode": "normal",
"default_max_budget_usd": 0.05,
"default_max_latency_ms": 20000
}
Update Defaults
curl -X PUT http://localhost:8080/admin/v1/routing-config \
-H "Content-Type: application/json" \
-d '{
"default_mode": "normal",
"default_max_budget_usd": 0.10,
"default_max_latency_ms": 30000
}'
| Field | Type | Range | Description |
|---|---|---|---|
default_mode | string | See below | Default routing mode |
default_max_budget_usd | float | 0-100 | Default cost ceiling per request |
default_max_latency_ms | int | 0-300000 | Default latency ceiling |
Changes take effect immediately for new requests and are persisted to the database.
Routing Modes
Each mode applies different weights to the four scoring objectives:
| Mode | Cost | Latency | Failure Rate | Capability | Use Case |
|---|---|---|---|---|---|
cheap | 0.7 | 0.1 | 0.1 | 0.1 | Minimize costs for simple tasks |
normal | 0.25 | 0.25 | 0.25 | 0.25 | Balanced operation |
high_confidence | 0.05 | 0.1 | 0.15 | 0.7 | Complex tasks needing strong models |
planning | 0.1 | 0.1 | 0.2 | 0.6 | Multi-step reasoning tasks |
adversarial | 0.1 | 0.1 | 0.2 | 0.6 | Adversarial orchestration |
thompson | — | — | — | — | Adaptive RL-based selection |
How Scoring Works
For modes other than thompson, the scoring formula is:
score = (cost_norm × w_cost) + (latency_norm × w_latency) + (failure_norm × w_failure) - (weight × w_capability)
Where:
cost_norm: Estimated cost normalized to 0-1 rangelatency_norm: Average latency normalized to 0-1 rangefailure_norm: Error rate from health trackerweight: Model capability weight (0-10)w_*: Mode-specific weights from the table above
Lower score = better model. Models are sorted by score and tried in order.
Thompson Sampling
The thompson mode uses a contextual bandit approach:
- Each (model, token_bucket) pair maintains Beta distribution parameters (alpha, beta)
- For each request, a reward value is sampled from each model's Beta distribution
- Models are sorted by sampled reward (highest first)
- Parameters are updated periodically from historical reward data
This approach automatically adapts to changing model performance over time.
Model Eligibility Filtering
Before scoring, the router filters models:
- Enabled: Model must be enabled
- Minimum weight: Must meet the request's
min_weightthreshold - Context capacity: Must have enough context window (with 15% headroom)
- Provider health: Provider must not be in the "down" state
- Budget: Estimated cost must be within
max_budget_usd
If no models pass filtering, the request fails with a 502 error.
Escalation and Failover
When a provider call fails, the router uses the error classification to decide what to do:
| Error Class | Action |
|---|---|
context_overflow | Find a model with a larger context window |
rate_limited | Skip to the next provider; honor Retry-After header |
transient (5xx) | Retry with exponential backoff (100ms base, 2 retries) |
fatal (4xx) | Try the next model in scored order |
The router tries up to 5 models before giving up.
Runtime Model Registry
View the current in-memory model registry and registered adapters:
curl http://localhost:8080/admin/v1/engine/models
Response:
{
"models": [
{
"id": "gpt-4",
"provider_id": "openai",
"weight": 8,
"max_context_tokens": 128000,
"input_per_1k": 0.01,
"output_per_1k": 0.03,
"enabled": true
}
],
"adapters": ["openai", "anthropic", "vllm"]
}
Audit Trail
Routing configuration changes are logged as routing-config.update in the audit trail.
API Key Management
TokenHub issues its own API keys to client applications. Provider API keys are escrowed in the vault — clients never see them. This provides a clean separation between client authentication and provider credentials.
Key Properties
| Property | Description |
|---|---|
| ID | 16-character hex identifier |
| Prefix | First 8 characters of the key for identification |
| Name | Human-readable label |
| Scopes | JSON array of allowed endpoints (chat, plan) |
| Rotation days | Recommended rotation period (0 = manual only) |
| Expiration | Optional automatic expiry |
| Enabled | Active/inactive toggle |
Operations
Create a Key
curl -X POST http://localhost:8080/admin/v1/apikeys \
-H "Content-Type: application/json" \
-d '{
"name": "production-backend",
"scopes": "[\"chat\",\"plan\"]",
"rotation_days": 90,
"expires_in": "2160h"
}'
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Human-readable name for the key |
scopes | string | No | JSON array of scopes (default: ["chat","plan"]) |
rotation_days | int | No | Recommended rotation period in days (default: 0) |
expires_in | string | No | Go duration for expiry (e.g., 720h for 30 days) |
Response:
{
"ok": true,
"key": "tokenhub_a1b2c3d4e5f6789012345678abcdef0123456789abcdef0123456789abcdef01",
"id": "a1b2c3d4e5f6g7h8",
"prefix": "tokenhub_a1b2c3d4",
"warning": "Store this key securely. It will not be shown again."
}
Important: The plaintext key is returned only at creation time. Store it securely before closing the response.
List Keys
curl http://localhost:8080/admin/v1/apikeys
Response:
[
{
"id": "a1b2c3d4e5f6g7h8",
"key_prefix": "tokenhub_a1b2c3d4",
"name": "production-backend",
"scopes": "[\"chat\",\"plan\"]",
"created_at": "2026-02-16T10:00:00Z",
"last_used_at": "2026-02-16T12:34:56Z",
"expires_at": "2026-05-16T10:00:00Z",
"rotation_days": 90,
"enabled": true
}
]
Plaintext keys are never shown in list responses.
Rotate a Key
Generate a new key value while keeping the same ID and configuration:
curl -X POST http://localhost:8080/admin/v1/apikeys/a1b2c3d4e5f6g7h8/rotate
Response:
{
"ok": true,
"key": "tokenhub_<new-64-hex-chars>",
"warning": "Store this key securely. It will not be shown again."
}
The old key immediately becomes invalid. Distribute the new key to all clients before rotating.
Update a Key
Modify key metadata without changing the key value:
curl -X PATCH http://localhost:8080/admin/v1/apikeys/a1b2c3d4e5f6g7h8 \
-H "Content-Type: application/json" \
-d '{
"name": "production-backend-v2",
"scopes": "[\"chat\"]",
"enabled": true,
"rotation_days": 60
}'
All fields are optional — only specified fields are updated.
Revoke (Delete) a Key
curl -X DELETE http://localhost:8080/admin/v1/apikeys/a1b2c3d4e5f6g7h8
This permanently removes the key. It cannot be recovered.
Security Details
Storage
- Keys are hashed with bcrypt (cost factor 10) before storage
- To reduce bcrypt overhead per-request, validated keys are cached for 5 minutes
- The SHA-256 digest of the plaintext is bcrypt-hashed (allowing keys longer than bcrypt's 72-byte limit)
Validation Flow
- Extract
Bearer tokenhub_...from Authorization header - Extract the key prefix (first 8 chars after
tokenhub_) - Check the validation cache (5-minute TTL)
- If not cached: load record by prefix, bcrypt-verify, check enabled + expiry
- Update
last_used_attimestamp - Verify the key's scopes include the requested endpoint
Scopes
| Scope | Protects |
|---|---|
chat | POST /v1/chat |
plan | POST /v1/plan |
An empty scopes array [] grants access to all endpoints.
Audit Trail
All key management operations are logged:
apikey.create— New key createdapikey.rotate— Key rotated (new value generated)apikey.update— Key metadata changedapikey.revoke— Key deleted
Best Practices
- Name keys descriptively: Use names like
staging-backend,prod-api-v2,data-pipeline - Use minimal scopes: If a client only needs chat, don't grant plan access
- Set rotation schedules: Configure
rotation_daysas a reminder to rotate - Set expiration for temporary keys: Use
expires_infor keys issued to contractors or experiments - Monitor last_used_at: Keys not used for extended periods may be candidates for revocation
- Rotate after incidents: If a key may have been compromised, rotate immediately
Monitoring & Observability
TokenHub provides multiple layers of observability: health tracking, Prometheus metrics, time-series data, request logs, audit logs, reward logs, and real-time SSE events.
Health Endpoint
curl http://localhost:8080/healthz
| Status | Meaning |
|---|---|
| 200 | System is healthy, adapters and models are registered |
| 503 | No adapters or no models are registered |
Response:
{"status": "ok", "adapters": 2, "models": 6}
Provider Health
View per-provider health status:
curl http://localhost:8080/admin/v1/health
Response:
{
"providers": [
{
"provider_id": "openai",
"state": "healthy",
"total_requests": 1234,
"total_errors": 5,
"consec_errors": 0,
"avg_latency_ms": 456.7,
"last_error": "",
"last_success_at": "2026-02-16T12:34:56Z",
"cooldown_until": "0001-01-01T00:00:00Z"
}
]
}
Health States
| State | Consecutive Errors | Behavior |
|---|---|---|
| Healthy | 0-1 | Normal routing |
| Degraded | 2-4 | Still routed but penalized in scoring |
| Down | 5+ | Excluded from routing; 30-second cooldown |
Active Health Probing
TokenHub actively probes provider health endpoints in the background:
| Provider | Health Endpoint | Success Criteria |
|---|---|---|
| OpenAI | GET /v1/models | 2xx response |
| Anthropic | GET /v1/messages | 2xx or 405 response |
| vLLM | GET /health | 2xx response |
Probes run every 30 seconds with a 10-second timeout.
Prometheus Metrics
Expose metrics at:
curl http://localhost:8080/metrics
Available Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
tokenhub_requests_total | counter | mode, model, provider, status | Total requests processed |
tokenhub_request_latency_ms | histogram | mode, model, provider | Request latency distribution |
tokenhub_cost_usd_total | counter | model, provider | Cumulative estimated cost |
Prometheus Configuration
# prometheus.yml
scrape_configs:
- job_name: tokenhub
scrape_interval: 15s
static_configs:
- targets: ['tokenhub:8080']
Example Queries
# Request rate by model
rate(tokenhub_requests_total[5m])
# P95 latency
histogram_quantile(0.95, rate(tokenhub_request_latency_ms_bucket[5m]))
# Cost per hour by provider
rate(tokenhub_cost_usd_total[1h]) * 3600
# Error rate
sum(rate(tokenhub_requests_total{status="error"}[5m])) /
sum(rate(tokenhub_requests_total[5m]))
Time-Series Database (TSDB)
TokenHub includes a lightweight SQLite-backed TSDB for historical metrics with querying and downsampling.
Query Metrics
curl "http://localhost:8080/admin/v1/tsdb/query?metric=latency&model_id=gpt-4&start=2026-02-16T00:00:00Z&end=2026-02-16T23:59:59Z&step_ms=60000"
| Parameter | Required | Description |
|---|---|---|
metric | Yes | Metric name (latency or cost) |
model_id | No | Filter by model |
provider_id | No | Filter by provider |
start | No | Start time (RFC3339) |
end | No | End time (RFC3339) |
step_ms | No | Downsample bucket in milliseconds |
List Available Metrics
curl http://localhost:8080/admin/v1/tsdb/metrics
Configure Retention
curl -X PUT http://localhost:8080/admin/v1/tsdb/retention \
-H "Content-Type: application/json" \
-d '{"retention_days": 14}'
Default retention is 7 days. Old data is automatically pruned hourly.
Manual Prune
curl -X POST http://localhost:8080/admin/v1/tsdb/prune
Request Logs
View paginated request history:
curl "http://localhost:8080/admin/v1/logs?limit=50&offset=0"
Each entry contains:
- Timestamp, request ID
- Model ID, provider ID, routing mode
- Estimated cost, latency
- HTTP status code, error class (if failed)
Audit Logs
View admin action history:
curl "http://localhost:8080/admin/v1/audit?limit=50&offset=0"
Logged actions:
vault.lock,vault.unlock,vault.rotateprovider.upsert,provider.deletemodel.upsert,model.patch,model.deleteapikey.create,apikey.rotate,apikey.update,apikey.revokerouting-config.update
Reward Logs
View contextual bandit reward data for RL-based routing analysis:
curl "http://localhost:8080/admin/v1/rewards?limit=50&offset=0"
Each entry contains: request ID, mode, model, provider, token count, token bucket (small/medium/large), latency budget, actual latency, cost, success flag, error class, and computed reward.
Aggregated Statistics
curl http://localhost:8080/admin/v1/stats
Returns global aggregates plus breakdowns by model and by provider.
Server-Sent Events (SSE)
Subscribe to real-time events:
curl -N http://localhost:8080/admin/v1/events
Event types:
| Event | Fields | When |
|---|---|---|
route_success | model_id, provider_id, latency_ms, cost_usd, reason | Request completed successfully |
route_error | latency_ms, error_class, error_msg | Request failed |
Example:
data: {"type":"route_success","model_id":"gpt-4","provider_id":"openai","latency_ms":456.7,"cost_usd":0.023,"reason":"routed-weight-8"}
Recommended Alerting Rules
| Alert | Condition | Severity |
|---|---|---|
| High error rate | Error rate > 5% over 5 minutes | Warning |
| Provider down | Provider in "down" state > 2 minutes | Critical |
| High latency | P95 latency > 10 seconds | Warning |
| Cost spike | Hourly cost > 2x 7-day average | Warning |
| Vault locked | Vault locked during business hours | Critical |
| No providers | Adapter count = 0 | Critical |
Admin UI
TokenHub includes a built-in single-page admin dashboard accessible at /admin. The UI is embedded in the binary — no separate frontend build or deployment is needed.
Accessing the UI
Navigate to:
http://localhost:8080/admin
The root URL (http://localhost:8080/) automatically redirects to /admin/.
Authentication
When TOKENHUB_ADMIN_TOKEN is set, the dashboard displays a full-screen Admin Authentication modal on first visit. Paste your admin token and press Authenticate (or Enter). The token is verified against the API before the dashboard loads; an invalid token shows an inline error.
Once authenticated, the token is stored in sessionStorage (cleared when the browser tab closes). A Sign Out button in the header clears the session and re-opens the authentication modal.
To retrieve the admin token:
tokenhubctl admin-token
Cache Busting
The admin HTML is served with Cache-Control: no-cache, must-revalidate and an ETag derived from the content hash. Static assets under /_assets/ are served with immutable cache headers and versioned URLs (?v=<hash>), ensuring browsers always get fresh assets after a rebuild without manual cache clearing.
Dashboard Panels
Vault Controls
The vault panel adapts to three states:
- First-Time Setup: When the vault has never been initialized, the UI displays a prompt to choose a master password (minimum 8 characters) with a confirmation field. Press Enter in the confirmation field or click Initialize Vault to complete setup.
- Locked: When the vault has been initialized but is locked, the UI shows a password input. Press Enter or click Unlock to unlock.
- Unlocked: Shows the unlocked status with a Lock button.
Note: The vault password encrypts your stored provider API keys. It is distinct from your admin token, which authenticates access to the admin API. You need both: the admin token to access the dashboard, and the vault password to decrypt stored credentials.
Provider Management
Full CRUD interface for providers:
- Setup Wizard: Multi-step guided onboarding for new providers — select type (OpenAI/Anthropic/vLLM), enter base URL and API key, test the connection, then discover and register available models.
- Provider Table: Shows all providers from both the persistent store and runtime engine (env vars, credentials file). Runtime-only providers are indicated with a badge. Base URLs are derived from adapter health endpoints when not stored.
- Edit Modal: Click "Edit" on any provider to change type, base URL, API key, or enabled state.
- Discover: Query a provider's API to find available models and register them.
- Delete: Remove a provider from the store.
Model Management
Full CRUD interface for models:
- Add Model Form: Create a new model with provider, weight, context window, and pricing.
- Model Table: Shows all models from both the store and engine, with their provider, weight, context, pricing, and enabled state.
- Edit Modal: Click "Edit" on any model to change weight, max context tokens, pricing, or enabled state.
- Weight Slider: Quick inline weight adjustment (0-10).
- Enable/Disable Toggle: Click the status icon to toggle a model.
- Delete: Remove a model from the store and engine.
Model Selection Graph
An interactive directed acyclic graph (DAG) showing the relationship between providers and models. Built with Cytoscape.js, it is populated on page load with all known providers and models and updates in real time as routing events arrive.
- Provider nodes (colored by health state)
- Model nodes (sized by weight)
- Edges colored by latency: green (<1s), yellow (1-3s), red (>3s)
- Edge thickness based on request volume
- Node size and border based on throughput and latency
Cost and Latency Charts
Multi-series D3.js line charts showing cost and latency trends over time:
- Per-model breakdown
- Configurable time window
- Hover for exact values
What-If Simulator
Test routing decisions without sending a live request:
- Select routing mode, token count, max budget, min weight, and model hint
- See the winning model, eligible candidates, and the routing reason
- Useful for understanding how parameter changes affect model selection
SSE Decision Feed
Live event stream showing every routing decision in real time:
- Model, provider, latency, cost, and reason for each event
- Error events with error classification
- Auto-scrolling event list
Routing Configuration
Set server-wide routing defaults:
- Default mode selector (cheap, normal, high_confidence, planning, adversarial)
- Budget input (USD)
- Latency input (milliseconds)
- Save button with validation
Provider Health
Real-time provider health display:
- State badges: Healthy (green), Degraded (yellow), Down (red)
- Consecutive error count
- Last success timestamp
- Average latency
API Keys
Key management interface:
- Create new keys (name, scopes, rotation, expiry)
- One-time key display modal with copy button
- Rotate keys (with one-time new key display)
- Enable/disable toggle
- Revoke (delete) keys
- Table showing: name, prefix, scopes, created, last used, expires, rotation days, status
Request Log
Paginated request history:
- Model, provider, mode columns
- Latency, cost, status code
- Error class (for failed requests)
- Pagination controls
Audit Log
Paginated audit trail viewer:
- Action type filter
- Timestamp, action, resource ID
- Request ID for correlation
Model Leaderboard
A ranked table of models by performance:
- Success rate
- Average latency
- Total cost
- Request count
Rewards
Contextual bandit reward data for Thompson Sampling analysis.
Workflows (Temporal)
When Temporal is enabled, shows workflow execution history:
- Workflow ID, type, status
- Start time, duration
- Status badges: Running (blue), Completed (green), Failed (red)
- Click to expand activity history
Static Assets
Static assets (Cytoscape.js, D3.js) are served from /_assets/ to avoid conflicts with the /admin/v1 API prefix. All assets are embedded in the binary via Go's embed package and served with immutable cache headers.
Customization
The admin UI is a single index.html file located at web/index.html in the source tree. To customize:
- Edit
web/index.html - Rebuild the binary (
make build) or Docker image (make package) - The updated UI is embedded automatically with fresh cache-busting hashes
tokenhubctl CLI
tokenhubctl is the command-line interface for managing TokenHub. It wraps every admin API endpoint into a convenient, scriptable tool.
Installation
make install # Builds natively and installs to ~/.local/bin
Or build inside the Docker builder container:
make build # Produces bin/tokenhub and bin/tokenhubctl
Configuration
| Variable | Default | Description |
|---|---|---|
TOKENHUB_URL | http://localhost:8080 | TokenHub server URL |
TOKENHUB_ADMIN_TOKEN | — | Bearer token for admin endpoints (see admin-token command) |
export TOKENHUB_URL="http://tokenhub.internal:8080"
export TOKENHUB_ADMIN_TOKEN="$(tokenhubctl admin-token)"
Command Reference
General
tokenhubctl admin-token # Print the admin token (env, file, or Docker)
tokenhubctl status # Server info, health, vault state
tokenhubctl health # Provider health table
tokenhubctl version # CLI version
tokenhubctl help # Full usage
Admin Token
The admin-token command retrieves the admin token by checking, in order:
TOKENHUB_ADMIN_TOKENenvironment variable~/.tokenhub/.admin-tokenfile (native deployments)docker execinto the running container to read/data/.admin-token
This avoids the need to parse server logs. The token file is written automatically by the server at startup (whether auto-generated or set via env).
Rotating the Admin Token
tokenhubctl rotate-admin-token # Generate a new random token
tokenhubctl rotate-admin-token <token> # Replace with a specific token
After rotation, update your local environment:
make _write-env # Sync token from container to ~/.tokenhub/env
The new token takes effect immediately (no restart required) and is persisted to the data directory so it survives restarts. The old token is invalidated instantly.
Vault
tokenhubctl vault unlock <password>
tokenhubctl vault lock
tokenhubctl vault rotate <old-password> <new-password>
Providers
tokenhubctl provider list
tokenhubctl provider add '<json>'
tokenhubctl provider edit <id> '<json>'
tokenhubctl provider delete <id>
tokenhubctl provider discover <id>
The list command merges providers from both the persistent store and the runtime engine, showing the source of each.
The discover command queries a provider's /v1/models endpoint to list available models and whether each is already registered in TokenHub.
Example:
# Add a new provider
tokenhubctl provider add '{
"id": "openai",
"type": "openai",
"base_url": "https://api.openai.com",
"api_key": "sk-..."
}'
# Update its base URL
tokenhubctl provider edit openai '{"base_url":"https://api.openai.com"}'
# Discover available models
tokenhubctl provider discover openai
Models
tokenhubctl model list
tokenhubctl model add '<json>'
tokenhubctl model edit <id> '<json>'
tokenhubctl model delete <id>
tokenhubctl model enable <id>
tokenhubctl model disable <id>
Model IDs can contain slashes (e.g., Qwen/Qwen2.5-Coder-32B-Instruct). The CLI handles them correctly.
Example:
# Add a model
tokenhubctl model add '{
"id": "gpt-4o",
"provider_id": "openai",
"weight": 8,
"max_context_tokens": 128000,
"input_per_1k": 0.0025,
"output_per_1k": 0.01,
"enabled": true
}'
# Adjust its weight
tokenhubctl model edit gpt-4o '{"weight": 9}'
# Temporarily disable it
tokenhubctl model disable gpt-4o
Routing
tokenhubctl routing get
tokenhubctl routing set '<json>'
Example:
tokenhubctl routing set '{"default_mode":"cheap","default_max_budget_usd":0.02,"default_max_latency_ms":10000}'
API Keys
tokenhubctl apikey list
tokenhubctl apikey create '<json>'
tokenhubctl apikey rotate <id>
tokenhubctl apikey edit <id> '<json>'
tokenhubctl apikey delete <id>
The create command prints the API key exactly once. Save it immediately.
Example:
tokenhubctl apikey create '{"name":"prod-app","scopes":"[\"chat\",\"plan\"]","monthly_budget_usd":50.0}'
Observability
tokenhubctl logs [--limit N] # Request logs
tokenhubctl audit [--limit N] # Audit trail
tokenhubctl rewards [--limit N] # Thompson Sampling reward data
tokenhubctl stats # Aggregated statistics
tokenhubctl engine models # Runtime model registry and adapter info
tokenhubctl events # Live SSE event stream (Ctrl-C to stop)
Routing Simulation
Run a what-if simulation without sending a real request:
tokenhubctl simulate '{"mode":"cheap","token_count":500}'
tokenhubctl simulate '{"mode":"high_confidence","token_count":2000,"max_budget_usd":0.10}'
TSDB
tokenhubctl tsdb metrics
tokenhubctl tsdb query metric=latency&model_id=gpt-4o&step_ms=60000
tokenhubctl tsdb prune
Output Format
Most commands produce human-readable tabular output. For programmatic use, pipe JSON responses directly from curl or parse tokenhubctl output with standard text tools.
Architecture
TokenHub is a Go application structured as a layered system with clear package boundaries and dependency injection.
Package Layout
tokenhub/
├── cmd/tokenhub/ # Entry point, signal handling, HTTP server lifecycle
├── internal/
│ ├── app/ # Server construction, config loading, wiring
│ ├── apikey/ # API key manager + auth middleware
│ ├── events/ # In-memory event bus (pub/sub for SSE)
│ ├── health/ # Provider health tracker + active prober
│ ├── httpapi/ # HTTP handlers and route mounting
│ ├── logging/ # Structured logging setup (slog)
│ ├── metrics/ # Prometheus metric registry
│ ├── providers/ # Provider adapter contract + context helpers
│ │ ├── openai/ # OpenAI adapter
│ │ ├── anthropic/ # Anthropic adapter
│ │ └── vllm/ # vLLM adapter
│ ├── router/ # Routing engine, scoring, orchestration, Thompson Sampling
│ ├── stats/ # In-memory statistics collector
│ ├── store/ # Persistence layer (SQLite)
│ ├── temporal/ # Temporal workflow integration
│ ├── tsdb/ # Time-series database (SQLite-backed)
│ └── vault/ # AES-256-GCM encrypted credential vault
├── web/ # Embedded admin UI (index.html)
└── docs/ # This documentation
Dependency Flow
cmd/tokenhub/main.go
└── internal/app.NewServer(cfg)
├── vault.New()
├── router.NewEngine()
├── store.NewSQLite()
├── health.NewTracker()
├── health.NewProber() → health.Tracker
├── loadCredentialsFile() → router.Engine
├── loadPersistedProviders() → router.Engine
├── router.NewThompsonSampler()
├── apikey.NewManager() → store.Store
├── metrics.New()
├── events.NewBus()
├── stats.NewCollector()
├── tsdb.New()
├── temporal.New() → (optional)
└── httpapi.MountRoutes() → Dependencies{...}
All dependencies flow downward. HTTP handlers receive a Dependencies struct containing all services they need.
Key Interfaces
router.Sender
The provider adapter contract:
type Sender interface {
ID() string
Send(ctx context.Context, model string, req Request) (ProviderResponse, error)
ClassifyError(err error) *ClassifiedError
}
router.StreamSender
Optional streaming extension:
type StreamSender interface {
Sender
SendStream(ctx context.Context, model string, req Request) (io.ReadCloser, error)
}
health.Probeable
Health probe interface for providers:
type Probeable interface {
ID() string
HealthEndpoint() string
}
store.Store
Persistence interface with methods for models, providers, request logs, audit logs, reward entries, API keys, vault blobs, and routing configuration.
Request Lifecycle
- HTTP handler receives the request, validates input, extracts API key
- Directive parser scans messages for
@@tokenhuboverrides and strips them - Policy resolution: Merge request policy with server defaults and directive overrides
- Token estimation: Estimate input tokens (explicit or chars/4 heuristic)
- Model selection: Filter eligible models, score by policy weights, sort
- Provider dispatch: Call the top-scored model's adapter
- Error handling: On failure, classify the error and escalate/retry/failover
- Output shaping: Apply output format (JSON schema validation, think-block stripping)
- Observability: Record metrics, TSDB points, request logs, reward entries, SSE events
- Response: Return the provider response with routing metadata
Concurrency Model
- The HTTP server uses Go's standard
net/httpwith chi router (goroutine per request) - The TSDB uses internal write buffering (batched inserts)
- The health prober runs as a background goroutine with configurable interval
- The Thompson Sampler refresh runs as a background goroutine
- The TSDB prune loop runs as a background goroutine (hourly)
- Temporal workflows (when enabled) are managed by the Temporal worker
All background goroutines are cleanly stopped via Server.Close().
Configuration
All configuration is via environment variables, loaded in internal/app/config.go. See Configuration Reference for the complete list.
Embedding
The admin UI (web/index.html) is embedded in the binary using Go's //go:embed directive in the root embed.go file. This means the entire application is a single self-contained binary.
Routing Engine
The routing engine (internal/router/engine.go) is TokenHub's core component. It manages the model registry, scores models against request policies, dispatches to provider adapters, and handles failover.
Engine Structure
type Engine struct {
adapters map[string]Sender // provider ID → adapter
models []Model // registered models
healthChecker HealthChecker // optional health state provider
banditPolicy BanditPolicy // optional Thompson Sampling
defaults EngineConfig // default mode, budget, latency
}
Model Registration
Models and adapters are registered at startup and can be modified at runtime:
eng.RegisterAdapter(openai.New("openai", apiKey, baseURL))
eng.RegisterModel(router.Model{
ID: "gpt-4", ProviderID: "openai",
Weight: 8, MaxContextTokens: 128000,
InputPer1K: 0.01, OutputPer1K: 0.03, Enabled: true,
})
Scoring Algorithm
The scoreModel() function computes a composite score for each eligible model:
score = (costNorm * w.Cost) + (latencyNorm * w.Latency) + (failureNorm * w.Failure) - (weightNorm * w.Weight)
Normalization:
costNorm:estimatedCost / maxBudgetUSD(clamped to 0-1)latencyNorm:avgLatencyMs / maxLatencyMs(from health tracker)failureNorm:errorRate(from health tracker, 0-1)weightNorm:model.Weight / 10.0
Lower scores are better. The weight term is subtracted (higher weight reduces score).
Eligibility Filtering
eligibleModels() filters the model registry:
- Must be
Enabled - Must meet
min_weightthreshold - Must have sufficient context window (estimated tokens * 1.15 headroom)
- Provider must not be in "down" health state
- Estimated cost must be within budget
For thompson mode, eligible models are reordered by Thompson Sampling instead of the scoring function.
RouteAndSend Flow
func (e *Engine) RouteAndSend(ctx context.Context, req Request, policy Policy) (Decision, ProviderResponse, error)
- Resolve defaults (fill in zero-value policy fields from server defaults)
- Get eligible models
- If
model_hintis set and the model exists, try it first - Sort remaining models by score
- For each model (up to 5 attempts):
a. Look up the adapter by
model.ProviderIDb. Calladapter.Send(ctx, model.ID, req)c. On success: return decision + response d. On error: classify the error and decide next action:ErrContextOverflow: Find a model with larger contextErrRateLimited: Skip to next provider (honorRetryAfter)ErrTransient: Retry same model with exponential backoffErrFatal: Try next model
Orchestration
Orchestrate() handles multi-model modes:
func (e *Engine) Orchestrate(ctx context.Context, req Request, dir OrchestrationDirective) (Decision, json.RawMessage, error)
See Orchestration Modes for details.
Streaming
func (e *Engine) RouteAndStream(ctx context.Context, req Request, policy Policy) (Decision, io.ReadCloser, error)
Same model selection as RouteAndSend, but calls SendStream() on adapters that implement StreamSender. Returns the raw SSE stream body for the HTTP handler to proxy.
Health Integration
The engine optionally uses a HealthChecker interface:
type HealthChecker interface {
ProviderState(providerID string) ProviderHealthState
}
This provides:
- Error rate for scoring (
failureNorm) - "Down" state for eligibility filtering
- Average latency for scoring (
latencyNorm)
Thompson Sampling Integration
When a BanditPolicy is set:
type BanditPolicy interface {
Sample(models []Model, tokenBucket string) []Model
}
In thompson mode, eligibleModels() calls banditPolicy.Sample() instead of the scoring function. The sampler draws from Beta distributions parameterized by historical reward data.
Thread Safety
The engine uses sync.RWMutex to protect the model registry and adapter map. Reads (model selection, routing) take a read lock. Writes (register/unregister) take a write lock.
Orchestration Modes
Orchestration enables multi-model reasoning patterns. The orchestration logic lives in internal/router/engine.go in the Orchestrate() method.
Architecture
Orchestrate(req, directive)
├── adversarial: Plan → Critique → Refine (loop)
├── vote: N Voters → Judge → Select best
├── refine: Generate → Refine → Refine (loop)
└── planning: Single RouteAndSend with planning profile
Model Selection for Orchestration
Each orchestration mode needs a "primary" model and optionally a "review" model. Models are selected by:
- Explicit model ID:
primary_model_id/review_model_idin the directive - Weight floor:
primary_min_weight/review_min_weightsets minimum capability - Automatic: Falls back to routing engine scoring with the appropriate policy
For review models, the policy uses high_confidence mode by default to ensure a capable judge/critic.
Adversarial Mode
Three-phase iterative refinement with a separate critique model:
// Phase 1: Plan
planResp = RouteAndSend(req with "Create a detailed plan...")
// Phase 2: Critique (loop N iterations)
critiqueResp = RouteAndSend(req with "Critique this plan: ...")
// Phase 3: Refine
refinedResp = RouteAndSend(req with "Refine based on critique: ...")
The critique and refine phases repeat for directive.Iterations (default 1).
Output schema:
{
"initial_plan": "Plan text from phase 1",
"critique": "Final critique from last iteration",
"refined_plan": "Final refined plan from last iteration"
}
Vote Mode
Multiple models respond independently, a judge selects the best:
// Phase 1: Collect votes (one per eligible model, up to 3)
for model in eligibleModels:
responses[model] = RouteAndSend(req, model)
// Phase 2: Judge
judgeResp = RouteAndSend(req with "Select the best response (1-N): ...")
selectedIdx = parseNumber(judgeResp) - 1
Output schema:
{
"responses": [
{"model": "gpt-4", "content": "...", "selected": true},
{"model": "claude-sonnet", "content": "...", "selected": false}
],
"selected": 0,
"judge": "claude-opus"
}
Refine Mode
Single model iteratively improves its own response:
// Phase 1: Initial response
resp = RouteAndSend(req)
// Phase 2: Iterative refinement (loop N iterations)
for i := 0; i < iterations; i++:
resp = RouteAndSend(req with "Review and improve: " + resp)
Output schema:
{
"refined_response": "Final refined text",
"iterations": 3,
"model": "claude-opus"
}
Planning Mode
Falls through to a standard RouteAndSend with the planning routing profile:
decision, resp, err = RouteAndSend(req, Policy{Mode: "planning"})
Cost and Latency
Orchestration makes multiple LLM calls. The Decision returned by Orchestrate() accumulates costs from all calls:
totalDecision.EstimatedCostUSD += stepDecision.EstimatedCostUSD
The routing reason is set to {mode}-orchestration (e.g., adversarial-orchestration).
Temporal Integration
When Temporal is enabled, orchestration runs as a OrchestrationWorkflow:
- Each LLM call becomes a Temporal activity
- Activities run with retry policies and timeouts
- The full execution is visible in the Temporal UI
- If Temporal is unavailable, falls back to direct orchestration
See Temporal Workflows for details.
Adding New Orchestration Modes
To add a new mode:
- Add the mode name to the validation list in
handlers_plan.go - Add a case in
Orchestrate()inengine.go - Implement the multi-call pattern following existing modes
- Return a
json.RawMessagewith the composite result - Update the
OrchestrationWorkflowintemporal/workflows.goif using Temporal
Provider Adapters
Provider adapters translate TokenHub's generic request format into provider-specific API calls. Each adapter implements the router.Sender interface.
Interface
// Sender is the core provider adapter interface.
type Sender interface {
ID() string
Send(ctx context.Context, model string, req Request) (ProviderResponse, error)
ClassifyError(err error) *ClassifiedError
}
// StreamSender extends Sender with streaming support.
type StreamSender interface {
Sender
SendStream(ctx context.Context, model string, req Request) (io.ReadCloser, error)
}
// Probeable enables active health probing.
type Probeable interface {
ID() string
HealthEndpoint() string
}
ProviderResponse is []byte (raw JSON from the provider).
Existing Adapters
OpenAI (internal/providers/openai/)
- Endpoint:
POST {baseURL}/v1/chat/completions - Health:
GET {baseURL}/v1/models - Auth:
Authorization: Bearer {apiKey} - Request translation: Maps
req.Messagesto OpenAI chat format, mergesreq.Parameters - Error classification:
- 429 →
ErrRateLimited(withRetry-Afterheader parsing) - 5xx →
ErrTransient - Body contains
context_length_exceeded→ErrContextOverflow - Other →
ErrFatal
- 429 →
Anthropic (internal/providers/anthropic/)
- Endpoint:
POST {baseURL}/v1/messages - Health:
GET {baseURL}/v1/messages(405 = healthy) - Auth:
x-api-key: {apiKey},anthropic-version: 2023-06-01 - Request translation: Splits system message from user messages (Anthropic API requires separate
systemfield), defaultsmax_tokensto 4096 if not inreq.Parameters - Error classification: Same pattern as OpenAI
vLLM (internal/providers/vllm/)
- Endpoint:
POST {endpoint}/v1/chat/completions(OpenAI-compatible) - Health:
GET {endpoint}/health - Auth: None (local deployment)
- Features: Multiple endpoints with round-robin load balancing
- Request translation: Same as OpenAI (vLLM implements OpenAI-compatible API)
Common Patterns
Parameter Forwarding
All adapters merge req.Parameters into the provider payload:
for k, v := range req.Parameters {
if k != "model" && k != "messages" {
payload[k] = v
}
}
Reserved keys (model, messages, stream) are never overridden by parameters.
Request ID Propagation
All adapters forward the request ID for distributed tracing:
if reqID := providers.GetRequestID(ctx); reqID != "" {
req.Header.Set("X-Request-ID", reqID)
}
The request ID is injected into the context by the HTTP handler using providers.WithRequestID().
Error Wrapping
Adapters wrap HTTP errors in providers.StatusError:
type StatusError struct {
StatusCode int
Body string
RetryAfterSecs float64
}
The ClassifyError() method on each adapter converts these to router.ClassifiedError for the routing engine's failover logic.
Creating a New Adapter
To add support for a new provider:
- Create
internal/providers/{name}/adapter.go - Implement
router.Sender(and optionallyrouter.StreamSenderandhealth.Probeable) - Add an
Optionpattern for configuration (timeout, endpoints, etc.) - Add a case for the new type in
registerProviderAdapter()ininternal/httpapi/handlers_admin.go - Register providers and models at runtime via the admin API or
tokenhubctl
Example skeleton:
package newprovider
import (
"context"
"github.com/jordanhubbard/tokenhub/internal/router"
)
type Adapter struct {
id string
apiKey string
// ...
}
func New(id, apiKey string) *Adapter {
return &Adapter{id: id, apiKey: apiKey}
}
func (a *Adapter) ID() string { return a.id }
func (a *Adapter) Send(ctx context.Context, model string, req router.Request) (router.ProviderResponse, error) {
// Translate req to provider format, make HTTP call, return raw JSON
}
func (a *Adapter) ClassifyError(err error) *router.ClassifiedError {
// Classify the error for failover logic
}
func (a *Adapter) HealthEndpoint() string {
return "https://api.newprovider.com/health"
}
Health System
The health system tracks provider reliability and provides both passive monitoring (based on request outcomes) and active probing (periodic HTTP checks).
Components
Health Tracker (internal/health/tracker.go)
The tracker maintains per-provider health state:
type ProviderHealthState struct {
State string // "healthy", "degraded", "down"
TotalRequests int64
TotalErrors int64
ConsecErrors int
AvgLatencyMs float64 // Exponential moving average
LastError string
LastSuccessAt time.Time
CooldownUntil time.Time
}
State Transitions
success
┌─────────────────────────────────┐
│ │
▼ 2+ consec errors │
Healthy ──────────────────────► Degraded
▲ │
│ success │ 5+ consec errors
│◄────────────────────────────────┤
│ ▼
│ Down
│ cooldown expired │
│ + success │
└─────────────────────────────────┘
Configuration
type Config struct {
DegradedThreshold int // Consecutive errors to enter degraded (default: 2)
DownThreshold int // Consecutive errors to enter down (default: 5)
CooldownDuration time.Duration // Time in down state before retry (default: 30s)
}
Recording Results
// Called after every provider request
tracker.RecordSuccess(providerID, latencyMs)
tracker.RecordError(providerID, errorMsg)
Each success resets the consecutive error counter. Each error increments it and potentially triggers a state transition.
Health Prober (internal/health/prober.go)
The prober performs active health checks against provider endpoints:
type Probeable interface {
ID() string
HealthEndpoint() string
}
Probe Logic
- Sends
GETrequests to each provider's health endpoint - Runs all probes concurrently with a per-probe timeout
- 2xx or 405 responses are considered healthy (405 is expected from some endpoints like Anthropic's
/v1/messages) - Any other response or connection error records a failure
Configuration
type ProberConfig struct {
Interval time.Duration // Time between probe rounds (default: 30s)
Timeout time.Duration // Per-probe HTTP timeout (default: 10s)
}
Provider Health Endpoints
| Provider | Endpoint | Success |
|---|---|---|
| OpenAI | GET /v1/models | 2xx |
| Anthropic | GET /v1/messages | 2xx or 405 |
| vLLM | GET /health | 2xx |
Integration with Routing
The routing engine queries health state during model selection:
- Eligibility: Models from providers in "down" state are excluded
- Scoring: The failure rate (
totalErrors / totalRequests) contributes to the model's score - Latency: The exponential moving average latency contributes to the model's score
type HealthChecker interface {
ProviderState(providerID string) ProviderHealthState
}
The tracker implements this interface and is passed to the engine via engine.SetHealthChecker().
Observability
Provider health is exposed via:
GET /admin/v1/health— JSON health state for all providers- Admin UI health panel — Visual health badges
- SSE events — Error events include provider state changes
Storage Layer
TokenHub uses SQLite for persistence, providing a zero-dependency embedded database. The storage layer is defined by the store.Store interface and implemented by store.SQLiteStore.
Interface
The Store interface (internal/store/store.go) provides methods for all persistence needs:
Models
UpsertModel(ctx, Model) error
GetModel(ctx, id) (*Model, error)
ListModels(ctx) ([]Model, error)
DeleteModel(ctx, id) error
Providers
UpsertProvider(ctx, Provider) error
ListProviders(ctx) ([]Provider, error)
DeleteProvider(ctx, id) error
Request Logs
LogRequest(ctx, RequestLog) error
ListRequestLogs(ctx, limit, offset) ([]RequestLog, error)
Audit Logs
LogAudit(ctx, AuditEntry) error
ListAuditLogs(ctx, limit, offset) ([]AuditEntry, error)
Reward Entries
LogReward(ctx, RewardEntry) error
ListRewardEntries(ctx, limit, offset) ([]RewardEntry, error)
GetRewardSummary(ctx) ([]RewardSummary, error)
API Keys
CreateAPIKey(ctx, APIKeyRecord) error
GetAPIKey(ctx, id) (*APIKeyRecord, error)
ListAPIKeys(ctx) ([]APIKeyRecord, error)
UpdateAPIKey(ctx, APIKeyRecord) error
DeleteAPIKey(ctx, id) error
Vault Blob
SaveVaultBlob(ctx, salt, data) error
LoadVaultBlob(ctx) (salt, data, error)
Routing Configuration
SaveRoutingConfig(ctx, RoutingConfig) error
LoadRoutingConfig(ctx) (RoutingConfig, error)
Schema
The database schema is created and migrated in sqlite.go's Migrate() method:
models
CREATE TABLE IF NOT EXISTS models (
id TEXT PRIMARY KEY,
provider_id TEXT NOT NULL,
weight INTEGER NOT NULL DEFAULT 5,
max_context_tokens INTEGER NOT NULL DEFAULT 4096,
input_per_1k REAL NOT NULL DEFAULT 0,
output_per_1k REAL NOT NULL DEFAULT 0,
enabled INTEGER NOT NULL DEFAULT 1
);
providers
CREATE TABLE IF NOT EXISTS providers (
id TEXT PRIMARY KEY,
type TEXT NOT NULL,
enabled INTEGER NOT NULL DEFAULT 1,
base_url TEXT NOT NULL DEFAULT '',
cred_store TEXT NOT NULL DEFAULT 'none'
);
request_logs
CREATE TABLE IF NOT EXISTS request_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
request_id TEXT NOT NULL DEFAULT '',
model_id TEXT NOT NULL DEFAULT '',
provider_id TEXT NOT NULL DEFAULT '',
mode TEXT NOT NULL DEFAULT '',
estimated_cost_usd REAL NOT NULL DEFAULT 0,
latency_ms INTEGER NOT NULL DEFAULT 0,
status_code INTEGER NOT NULL DEFAULT 0,
error_class TEXT NOT NULL DEFAULT ''
);
audit_logs
CREATE TABLE IF NOT EXISTS audit_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
action TEXT NOT NULL,
resource TEXT NOT NULL DEFAULT '',
request_id TEXT NOT NULL DEFAULT ''
);
reward_entries
CREATE TABLE IF NOT EXISTS reward_entries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
request_id TEXT NOT NULL DEFAULT '',
model_id TEXT NOT NULL DEFAULT '',
provider_id TEXT NOT NULL DEFAULT '',
mode TEXT NOT NULL DEFAULT '',
estimated_tokens INTEGER NOT NULL DEFAULT 0,
token_bucket TEXT NOT NULL DEFAULT '',
latency_budget_ms REAL NOT NULL DEFAULT 0,
latency_ms REAL NOT NULL DEFAULT 0,
cost_usd REAL NOT NULL DEFAULT 0,
success INTEGER NOT NULL DEFAULT 0,
error_class TEXT NOT NULL DEFAULT '',
reward REAL NOT NULL DEFAULT 0
);
api_keys
CREATE TABLE IF NOT EXISTS api_keys (
id TEXT PRIMARY KEY,
key_hash TEXT NOT NULL,
key_prefix TEXT NOT NULL,
name TEXT NOT NULL,
scopes TEXT NOT NULL DEFAULT '["chat","plan"]',
created_at TEXT NOT NULL,
last_used_at TEXT,
expires_at TEXT,
rotation_days INTEGER NOT NULL DEFAULT 0,
enabled INTEGER NOT NULL DEFAULT 1
);
vault_blob
CREATE TABLE IF NOT EXISTS vault_blob (
id TEXT PRIMARY KEY DEFAULT 'singleton',
salt TEXT,
data_json TEXT
);
routing_config
CREATE TABLE IF NOT EXISTS routing_config (
id TEXT PRIMARY KEY DEFAULT 'default',
default_mode TEXT NOT NULL DEFAULT '',
default_max_budget_usd REAL NOT NULL DEFAULT 0,
default_max_latency_ms INTEGER NOT NULL DEFAULT 0
);
SQLite Configuration
The default DSN includes pragmas for performance:
file:/data/tokenhub.sqlite?_pragma=busy_timeout(5000)&_pragma=journal_mode(WAL)
- busy_timeout: Wait up to 5 seconds for locks instead of failing immediately
- journal_mode(WAL): Write-Ahead Logging for concurrent read/write access
TSDB
The time-series database (internal/tsdb/) uses a separate table in the same SQLite database:
CREATE TABLE IF NOT EXISTS tsdb_points (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts INTEGER NOT NULL, -- Unix nanoseconds
metric TEXT NOT NULL,
model_id TEXT NOT NULL DEFAULT '',
provider_id TEXT NOT NULL DEFAULT '',
value REAL NOT NULL
);
Features:
- Write buffering (batch size 100)
- Automatic retention pruning (default 7 days)
- Downsampling support (configurable step size in queries)
Security Model
TokenHub implements security at multiple layers: credential encryption, client authentication, input validation, and audit logging.
Credential Security
Vault Encryption
Provider API keys are encrypted using AES-256-GCM:
- Admin provides a vault password
- Password + random salt → Argon2id key derivation → 256-bit encryption key
- Each value is encrypted with a unique nonce
- Encrypted values are stored in SQLite
Argon2id Parameters (per OWASP recommendations):
- Time: 3 iterations
- Memory: 64 MB
- Threads: 4
- Salt: 16 random bytes
Key Material Handling
- Encryption keys exist only in memory while the vault is unlocked
- Auto-lock clears the key after 30 minutes of inactivity
- Vault salt is persisted in the database for key re-derivation
- Password rotation re-encrypts all values atomically
Admin Authentication
Admin Token
All /admin/v1/* endpoints require a bearer token set via TOKENHUB_ADMIN_TOKEN.
If not set, the server auto-generates a cryptographically random 64-character hex
token at startup. The token is never logged — it is written to a file at
/data/.admin-token (or ~/.tokenhub/.admin-token for native deployments) and
can be retrieved with:
tokenhubctl admin-token
Client Authentication
API Key Security
- Keys are hashed with bcrypt (cost 10) before storage
- SHA-256 pre-hash allows keys longer than bcrypt's 72-byte input limit
- 5-minute validation cache reduces bcrypt overhead
- Plaintext is shown only once at creation/rotation
Key Validation Flow
Request → Extract Bearer token → Check cache (5min TTL)
├── Cache hit → Check scopes → Allow/Deny
└── Cache miss → Load by prefix → bcrypt verify → Check enabled → Check expiry
├── Valid → Update cache + last_used_at → Check scopes → Allow/Deny
└── Invalid → 401 Unauthorized
Input Validation
All API inputs are validated before processing:
Chat/Plan Endpoints
- Messages array: required, non-empty
max_budget_usd: 0-100 rangemax_latency_ms: 0-300000 rangemin_weight: 0-10 range- Orchestration
iterations: 0-10 range - Orchestration
mode: must be a known value
Admin Endpoints
- Routing config mode: must be a known value
- Routing config budget/latency: same ranges as consumer API
- Model weight: reasonable range
- API key name: required
Request Isolation
- Each request gets its own context with a unique request ID
- Provider API keys are never exposed to clients
- Client API key records are attached to context but not serialized in responses
- Request parameters are validated before forwarding to providers
Audit Trail
All administrative mutations are logged:
type AuditEntry struct {
Timestamp time.Time
Action string // e.g., "vault.unlock", "model.patch"
Resource string // Resource identifier
RequestID string // For correlation
}
Auditable actions:
- Vault operations (lock, unlock, rotate)
- Provider CRUD
- Model CRUD
- API key lifecycle (create, rotate, update, revoke)
- Routing configuration changes
Network Security
TokenHub itself does not implement TLS. In production:
- Use a reverse proxy (nginx, Caddy, Traefik) for TLS termination
- Restrict admin endpoints to internal networks or VPN
- Use CORS appropriately (currently allows all origins for development)
Recommendations
- Vault password: Use a strong, unique password (16+ characters)
- API key rotation: Rotate keys every 90 days (configurable via
rotation_days) - Network segmentation: Keep admin endpoints behind a VPN or firewall
- TLS everywhere: Terminate TLS at a reverse proxy in front of TokenHub
- Database backups: SQLite file contains encrypted credentials and configuration
- Monitor audit logs: Set up alerting on unexpected admin actions
Temporal Workflows
TokenHub optionally integrates with Temporal for durable workflow execution. When enabled, every chat and orchestration request is dispatched as a Temporal workflow, providing visibility, retry guarantees, and execution history.
Architecture
HTTP Handler
│
├── Temporal Enabled?
│ ├── Yes → Start Temporal Workflow → Wait for result → Return response
│ └── No → Direct engine call → Return response
│
└── Temporal Unavailable (runtime)
└── Fall back to direct engine call
Configuration
| Env Var | Default | Description |
|---|---|---|
TOKENHUB_TEMPORAL_ENABLED | false | Enable Temporal dispatch |
TOKENHUB_TEMPORAL_HOST | localhost:7233 | Temporal server address |
TOKENHUB_TEMPORAL_NAMESPACE | tokenhub | Temporal namespace |
TOKENHUB_TEMPORAL_TASK_QUEUE | tokenhub-tasks | Worker task queue name |
Components
Manager (internal/temporal/manager.go)
The manager creates and manages the Temporal client and worker:
type Manager struct {
client client.Client
worker worker.Worker
}
New(cfg, activities)— Creates Temporal client, registers workflows and activitiesStart()— Starts the worker (non-blocking)Client()— Returns the Temporal client for HTTP handlersStop()— Gracefully stops worker and closes client
Types (internal/temporal/types.go)
Input/output types for workflows:
type ChatInput struct {
RequestID string
APIKeyID string
Request router.Request
Policy router.Policy
}
type ChatOutput struct {
Decision router.Decision
Response json.RawMessage
LatencyMs int64
Error string
}
type OrchestrationInput struct {
RequestID string
APIKeyID string
Request router.Request
Directive router.OrchestrationDirective
}
Activities (internal/temporal/activities.go)
Activities are the atomic units of work. They receive injected dependencies:
type Activities struct {
Engine *router.Engine
Store store.Store
Health *health.Tracker
Metrics *metrics.Registry
EventBus *events.Bus
Stats *stats.Collector
TSDB *tsdb.Store
}
Key activities:
- ChatActivity: Calls
engine.RouteAndSend()and returns the result - LogResultActivity: Persists metrics, request logs, reward entries, TSDB points, and SSE events
Workflows (internal/temporal/workflows.go)
- ChatWorkflow: Calls ChatActivity then LogResultActivity
- OrchestrationWorkflow: Calls ChatActivity for orchestration, then LogResultActivity
HTTP Handler Integration
Handlers check for a Temporal client and dispatch accordingly:
if d.TemporalClient != nil {
run, err := d.TemporalClient.ExecuteWorkflow(ctx, opts, ChatWorkflow, input)
if err != nil {
// Temporal unavailable — fall back
decision, resp, err = d.Engine.RouteAndSend(ctx, req, policy)
} else {
var output ChatOutput
err = run.Get(ctx, &output)
// Use output
}
} else {
decision, resp, err = d.Engine.RouteAndSend(ctx, req, policy)
}
The fallback ensures TokenHub continues to work even if Temporal becomes unavailable at runtime.
Workflow Visibility
Admin endpoints expose Temporal workflow data:
GET /admin/v1/workflows?limit=50&status=RUNNING— List workflowsGET /admin/v1/workflows/{id}— Describe workflowGET /admin/v1/workflows/{id}/history— Activity history
Status values: RUNNING, COMPLETED, FAILED, CANCELED, TERMINATED, CONTINUED_AS_NEW, TIMED_OUT
Docker Compose Setup
temporal:
image: temporalio/auto-setup:latest
ports:
- "7233:7233"
environment:
- DB=sqlite
temporal-ui:
image: temporalio/ui:latest
ports:
- "8233:8080"
environment:
- TEMPORAL_ADDRESS=temporal:7233
Access the Temporal Web UI at http://localhost:8233.
Streaming Note
Streaming requests (stream: true) bypass Temporal and use direct engine dispatch. This is because streaming requires a persistent HTTP connection for SSE, which is incompatible with Temporal's request-response workflow model.
Extending TokenHub
This guide covers common extension points for adding functionality to TokenHub.
Adding a New Provider
- Create the adapter package:
internal/providers/newprovider/
├── adapter.go # Sender implementation
└── adapter_test.go # Tests
- Implement the interfaces:
package newprovider
type Adapter struct {
id string
apiKey string
baseURL string
client *http.Client
}
// Required: router.Sender
func (a *Adapter) ID() string { return a.id }
func (a *Adapter) Send(ctx context.Context, model string, req router.Request) (router.ProviderResponse, error) { ... }
func (a *Adapter) ClassifyError(err error) *router.ClassifiedError { ... }
// Optional: router.StreamSender
func (a *Adapter) SendStream(ctx context.Context, model string, req router.Request) (io.ReadCloser, error) { ... }
// Optional: health.Probeable
func (a *Adapter) HealthEndpoint() string { return a.baseURL + "/health" }
- Register via the admin API (providers and models are registered at runtime, not compiled in):
curl -X POST http://localhost:8080/admin/v1/providers \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"id":"newprovider","type":"openai","base_url":"https://api.newprovider.com","api_key":"..."}'
- Add adapter construction in
registerProviderAdapter()inhandlers_admin.go:
case "newprovider":
d.Engine.RegisterAdapter(newprovider.New(p.ID, apiKey, p.BaseURL, newprovider.WithTimeout(timeout)))
Adding a New Routing Mode
- Define the weight profile in
internal/router/engine.go:
var modeWeights = map[string]weights{
// ...existing modes...
"mymode": {Cost: 0.3, Latency: 0.2, Failure: 0.2, Weight: 0.3},
}
- Add validation in
internal/httpapi/handlers_chat.goandhandlers_plan.go:
case "mymode":
// valid
- Add to routing config validation in
handlers_routing.go.
Adding a New Orchestration Mode
- Add the case in
engine.Orchestrate():
case "mymode":
// Implement multi-call pattern
result, err := json.Marshal(map[string]any{...})
return totalDecision, result, err
-
Add validation in
handlers_plan.go. -
Update Temporal if using workflows:
// In OrchestrationWorkflow
case "mymode":
// Implement as Temporal activities
Adding New Admin Endpoints
- Create handler in
internal/httpapi/handlers_newfeature.go:
func NewFeatureHandler(d Dependencies) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
// Handler logic
}
}
- Mount route in
internal/httpapi/routes.go:
r.Get("/admin/v1/newfeature", NewFeatureHandler(d))
- Add to Dependencies if new services are needed.
Adding New Metrics
In internal/metrics/metrics.go:
type Registry struct {
// ...existing metrics...
NewMetric *prometheus.CounterVec
}
func New() *Registry {
r := &Registry{
NewMetric: prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "tokenhub",
Name: "new_metric_total",
Help: "Description of the new metric",
}, []string{"label1", "label2"}),
}
// Register with Prometheus
return r
}
Adding New Store Operations
- Add to the interface in
internal/store/store.go - Implement in SQLite in
internal/store/sqlite.go - Add migration in
Migrate()if new tables are needed - Write tests in
internal/store/sqlite_test.go
Testing
TokenHub uses Go's standard testing package. Key test patterns:
- Unit tests: Each package has
*_test.gofiles - Integration tests:
internal/httpapi/handlers_test.gotests the full HTTP stack - Mock adapters:
mockSenderin handler tests simulates provider responses - In-memory SQLite: Tests use
:memory:DSN for isolated databases
Run all tests:
make test # Standard tests
make test-race # With race detector
Build
make build # Build to bin/tokenhub
make package # Build Docker image
make lint # Run linter (requires golangci-lint)
make vet # Go vet
Configuration Reference
TokenHub is configured entirely via environment variables. All variables are optional and have sensible defaults.
Environment Variables
Server
| Variable | Default | Description |
|---|---|---|
TOKENHUB_LISTEN_ADDR | :8080 | HTTP server listen address (binds all interfaces) |
TOKENHUB_LOG_LEVEL | info | Log level: debug, info, warn, error |
TOKENHUB_DB_DSN | /data/tokenhub.sqlite | SQLite database path |
TOKENHUB_VAULT_ENABLED | true | Enable encrypted credential vault |
TOKENHUB_VAULT_PASSWORD | — | Auto-unlock vault at startup (headless mode) |
TOKENHUB_PROVIDER_TIMEOUT_SECS | 30 | HTTP timeout for provider API calls |
Routing Defaults
| Variable | Default | Description |
|---|---|---|
TOKENHUB_DEFAULT_MODE | normal | Default routing mode |
TOKENHUB_DEFAULT_MAX_BUDGET_USD | 0.05 | Default max cost per request (USD) |
TOKENHUB_DEFAULT_MAX_LATENCY_MS | 20000 | Default max latency (milliseconds) |
Security & Hardening
| Variable | Default | Description |
|---|---|---|
TOKENHUB_ADMIN_TOKEN | — | Bearer token for /admin/v1/* access (required in production) |
TOKENHUB_CORS_ORIGINS | * | Comma-separated allowed CORS origins |
TOKENHUB_RATE_LIMIT_RPS | 60 | Max requests per second per IP |
TOKENHUB_RATE_LIMIT_BURST | 120 | Burst capacity per IP |
Credentials
| Variable | Default | Description |
|---|---|---|
TOKENHUB_CREDENTIALS_FILE | ~/.tokenhub/credentials | Path to external credentials JSON file |
Providers are registered at startup via ~/.tokenhub/credentials or at runtime via the admin API, tokenhubctl, or the admin UI. At least one provider must be registered for TokenHub to route requests.
Temporal (Optional)
| Variable | Default | Description |
|---|---|---|
TOKENHUB_TEMPORAL_ENABLED | false | Enable Temporal workflow dispatch |
TOKENHUB_TEMPORAL_HOST | localhost:7233 | Temporal server host:port |
TOKENHUB_TEMPORAL_NAMESPACE | tokenhub | Temporal namespace |
TOKENHUB_TEMPORAL_TASK_QUEUE | tokenhub-tasks | Temporal task queue name |
OpenTelemetry (Optional)
| Variable | Default | Description |
|---|---|---|
TOKENHUB_OTEL_ENABLED | false | Enable OpenTelemetry tracing |
TOKENHUB_OTEL_ENDPOINT | localhost:4318 | OTLP exporter endpoint |
TOKENHUB_OTEL_SERVICE_NAME | tokenhub | Service name for traces |
External Credentials File
The ~/.tokenhub/credentials file is the primary mechanism for bootstrapping
providers and models. It is processed at startup — providers are persisted to
the database and API keys are stored in the vault (when TOKENHUB_VAULT_PASSWORD
is set). The file must have 0600 permissions.
{
"providers": [
{
"id": "openai",
"type": "openai",
"base_url": "https://api.openai.com",
"api_key": "sk-..."
},
{
"id": "vllm-local",
"type": "vllm",
"base_url": "http://localhost:8000"
}
],
"models": [
{
"id": "gpt-4o",
"provider_id": "openai",
"weight": 8,
"max_context_tokens": 128000,
"input_per_1k": 0.0025,
"output_per_1k": 0.01
}
]
}
The file is idempotent — providers and models are upserted, so it can remain
in place across restarts. api_key is optional for keyless providers (vLLM,
Ollama). All providers default to enabled: true unless explicitly set to false.
Example Configuration
Minimal
./bin/tokenhub
# Then register providers via ~/.tokenhub/credentials, admin API, or UI.
Full Production
export TOKENHUB_LISTEN_ADDR=":8080"
export TOKENHUB_LOG_LEVEL="info"
export TOKENHUB_DB_DSN="/data/tokenhub.sqlite"
export TOKENHUB_VAULT_ENABLED="true"
export TOKENHUB_PROVIDER_TIMEOUT_SECS="30"
# Security
export TOKENHUB_ADMIN_TOKEN="your-secret-admin-token"
export TOKENHUB_CORS_ORIGINS="https://app.example.com"
export TOKENHUB_RATE_LIMIT_RPS="100"
export TOKENHUB_RATE_LIMIT_BURST="200"
# Routing
export TOKENHUB_DEFAULT_MODE="normal"
export TOKENHUB_DEFAULT_MAX_BUDGET_USD="0.10"
export TOKENHUB_DEFAULT_MAX_LATENCY_MS="30000"
# Temporal (optional)
export TOKENHUB_TEMPORAL_ENABLED="true"
export TOKENHUB_TEMPORAL_HOST="temporal:7233"
# OpenTelemetry (optional)
export TOKENHUB_OTEL_ENABLED="true"
export TOKENHUB_OTEL_ENDPOINT="otel-collector:4318"
./bin/tokenhub
# Providers are loaded from ~/.tokenhub/credentials, or registered via admin API/UI.
Runtime Configuration
The following settings can be changed at runtime via the admin API or tokenhubctl without restarting:
- Routing defaults:
PUT /admin/v1/routing-configortokenhubctl routing set - Models:
POST/PATCH/DELETE /admin/v1/modelsortokenhubctl model add/edit/delete - Providers:
POST/PATCH/DELETE /admin/v1/providersortokenhubctl provider add/edit/delete - API keys:
POST/PATCH/DELETE /admin/v1/apikeysortokenhubctl apikey create/edit/delete - TSDB retention:
PUT /admin/v1/tsdb/retentionortokenhubctl tsdb
Docker & Compose
TokenHub provides a Dockerfile for container builds and a Docker Compose file for local development with all dependencies.
Docker Image
Build
make package
# or
docker buildx build --load -t tokenhub .
The Dockerfile uses a multi-stage build:
- Build stage:
golang:1.24-alpine— compiles the Go binary and builds mdbook documentation - Runtime stage:
alpine:3.21— lightweight runtime with curl for health checks
The final image runs as a non-root tokenhub user.
Run
docker run -d \
-p 8080:8080 \
-e TOKENHUB_ADMIN_TOKEN="your-admin-token" \
-v tokenhub_data:/data \
tokenhub
The container expects:
- Port 8080: HTTP server (binds all interfaces by default)
- Volume
/data: SQLite database persistence
Docker Compose
Full Stack
docker compose up -d
This starts:
| Service | Port | Description |
|---|---|---|
tokenhub | 8080 | TokenHub server |
temporal | 7233 | Temporal server (gRPC) |
temporal-ui | 8233 | Temporal Web UI |
Services
TokenHub
tokenhub:
image: tokenhub:latest
ports:
- "8080:8080"
environment:
- TOKENHUB_LISTEN_ADDR=:8080
- TOKENHUB_DB_DSN=/data/tokenhub.sqlite
- TOKENHUB_VAULT_ENABLED=true
- TOKENHUB_VAULT_PASSWORD=${TOKENHUB_VAULT_PASSWORD}
- TOKENHUB_ADMIN_TOKEN=${TOKENHUB_ADMIN_TOKEN}
volumes:
- tokenhub_data:/data
restart: unless-stopped
Set TOKENHUB_VAULT_PASSWORD to auto-unlock the vault at startup (headless mode). If not set, unlock interactively via UI or tokenhubctl. Providers are loaded from ~/.tokenhub/credentials at startup, or registered at runtime via the admin API, tokenhubctl, or the admin UI.
Note: The TOKENHUB_DB_DSN should be a plain path (e.g., /data/tokenhub.sqlite) when using modernc.org/sqlite (the pure-Go driver). SQLite pragmas are applied programmatically, not via DSN query parameters.
Temporal
temporal:
image: temporalio/auto-setup:latest
ports:
- "7233:7233"
environment:
- DB=sqlite
volumes:
- temporal_data:/etc/temporal/data
temporal-ui:
image: temporalio/ui:latest
ports:
- "8233:8080"
environment:
- TEMPORAL_ADDRESS=temporal:7233
Environment File
Create a .env file for sensitive values:
TOKENHUB_ADMIN_TOKEN=your-secret-admin-token
Without Temporal
To run without Temporal:
docker compose up -d tokenhub
Or set TOKENHUB_TEMPORAL_ENABLED=false.
Provider Bootstrapping
Providers are loaded from ~/.tokenhub/credentials at startup. For Docker,
mount the credentials file into the container or use the host path if running
via Docker Compose with a volume mount. See Provider Management
for the file format.
Health Check
The Docker health check uses the /healthz endpoint:
curl -f http://localhost:8080/healthz
Returns 200 when adapters and models are registered, 503 otherwise.
Data Persistence
All persistent data is stored in SQLite at the path configured by TOKENHUB_DB_DSN. In Docker, mount a volume to /data:
volumes:
- tokenhub_data:/data
This persists:
- Model and provider configurations
- Vault salt and encrypted credentials
- Request logs, audit logs, reward entries
- API keys
- Routing configuration
- TSDB time-series data
Resource Requirements
TokenHub is lightweight:
- Memory: ~50MB baseline, scales with request concurrency
- CPU: Minimal (most time is spent waiting on provider APIs)
- Disk: Depends on log retention; ~1MB per 10,000 requests
Production Checklist
Use this checklist when deploying TokenHub to production.
Pre-Deployment
- Set a strong vault password (16+ characters, mixed case, numbers, symbols)
- Configure at least one provider via environment variable
- Set appropriate routing defaults for your use case
- Create API keys for all client applications
- Configure TSDB retention appropriate for your storage budget
Security Hardening
-
Set
TOKENHUB_ADMIN_TOKEN: Stable Bearer token for/admin/v1/*endpoints (auto-generated if not set — check logs for the token) -
Set
TOKENHUB_CORS_ORIGINS: Restrict CORS to your domain(s) (e.g.,https://app.example.com) -
Rate limiting: Review
TOKENHUB_RATE_LIMIT_RPS(default: 60/s) andTOKENHUB_RATE_LIMIT_BURST(default: 120) for your traffic patterns
Network Security
- TLS termination: Place TokenHub behind a reverse proxy (nginx, Caddy, Traefik) with TLS
- Firewall rules: Only allow inbound traffic on the configured listen port
Example nginx Configuration
server {
listen 443 ssl;
server_name tokenhub.example.com;
ssl_certificate /etc/letsencrypt/live/tokenhub.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/tokenhub.example.com/privkey.pem;
# Consumer API - publicly accessible with API key auth
location /v1/ {
proxy_pass http://tokenhub:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Request-ID $request_id;
# SSE streaming support
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
}
# Health check
location /healthz {
proxy_pass http://tokenhub:8080;
}
# Metrics (restrict to monitoring network)
location /metrics {
allow 10.0.0.0/8;
deny all;
proxy_pass http://tokenhub:8080;
}
# Admin endpoints (restrict to admin VPN)
location /admin {
allow 10.100.0.0/16;
deny all;
proxy_pass http://tokenhub:8080;
}
}
Database
- Mount a persistent volume for the SQLite database
-
Set WAL mode: Include
_pragma=journal_mode(WAL)in the DSN -
Set busy timeout: Include
_pragma=busy_timeout(5000)in the DSN - Schedule backups: Periodically copy the SQLite file (safe with WAL mode)
Backup Script
#!/bin/bash
# Safe SQLite backup using the .backup command
sqlite3 /data/tokenhub.sqlite ".backup /backups/tokenhub-$(date +%Y%m%d-%H%M%S).sqlite"
Monitoring
-
Prometheus scraping: Configure Prometheus to scrape
/metrics - Set up alerts based on the recommended alerting rules
- Log aggregation: Forward structured JSON logs to your log management system
- Monitor TSDB size: Set appropriate retention to prevent unbounded growth
Key Metrics to Watch
| Metric | Alert Threshold | Severity |
|---|---|---|
| Error rate | > 5% over 5 min | Warning |
| P95 latency | > 10s | Warning |
| Provider down | > 2 min | Critical |
| Cost spike | > 2x weekly average | Warning |
| Vault locked | During business hours | Critical |
| Disk usage | > 80% | Warning |
Graceful Shutdown
TokenHub handles SIGINT and SIGTERM for graceful shutdown:
- Stop accepting new connections
- Drain in-flight requests (30-second timeout)
- Stop background goroutines (prober, Thompson Sampling refresh, TSDB prune)
- Stop Temporal worker (if enabled)
- Close database connection
In Kubernetes, set terminationGracePeriodSeconds: 35 to allow the full drain.
Scaling Considerations
TokenHub is a single-process application with SQLite. For higher throughput:
- Horizontal: Run multiple instances with separate SQLite databases (no shared state; each instance routes independently)
- Temporal: Enable Temporal for durable workflow execution across restarts
- Read replicas: Not applicable (SQLite is embedded)
- Connection pooling: SQLite WAL mode supports concurrent reads natively
For very high throughput (>1000 req/s), consider migrating the store to PostgreSQL (implement the Store interface for a new backend).
CLI Administration
Use tokenhubctl for scriptable administration and health checks:
# Quick status check
tokenhubctl status
# Verify providers and models
tokenhubctl provider list
tokenhubctl model list
# Watch for issues in real time
tokenhubctl events
See tokenhubctl CLI for the full command reference.
Environment Variables Summary
See Configuration Reference for the complete list of all environment variables and their defaults.
API Reference
Complete reference for all TokenHub HTTP endpoints.
Consumer Endpoints
POST /v1/chat
Send a chat completion request with automatic model routing.
Authentication: Required (Bearer token)
Request Body:
{
"request": {
"messages": [{"role": "string", "content": "string"}],
"model_hint": "string",
"estimated_input_tokens": 0,
"parameters": {},
"stream": false,
"meta": {},
"output_schema": {}
},
"capabilities": {"planning": false},
"policy": {
"mode": "normal",
"max_budget_usd": 0.05,
"max_latency_ms": 20000,
"min_weight": 0
},
"output_format": {
"type": "json",
"schema": "string",
"max_tokens": 0,
"strip_think": false
}
}
Response: 200 OK
{
"negotiated_model": "string",
"estimated_cost_usd": 0.0,
"routing_reason": "string",
"response": {}
}
Errors: 400, 401, 403, 502
POST /v1/plan
Send an orchestrated multi-model request.
Authentication: Required (Bearer token)
Request Body:
{
"request": {
"messages": [{"role": "string", "content": "string"}]
},
"orchestration": {
"mode": "adversarial",
"iterations": 2,
"primary_model_id": "string",
"review_model_id": "string",
"primary_min_weight": 0,
"review_min_weight": 0,
"return_plan_only": false,
"output_schema": "string"
}
}
Response: 200 OK
{
"negotiated_model": "string",
"estimated_cost_usd": 0.0,
"routing_reason": "string",
"response": {}
}
Errors: 400, 401, 403, 502
Health
GET /healthz
System health check.
Response: 200 OK or 503 Service Unavailable
{
"status": "ok",
"adapters": 2,
"models": 6
}
GET /metrics
Prometheus metrics endpoint.
Response: 200 OK (text/plain, Prometheus exposition format)
Admin - Vault
POST /admin/v1/vault/unlock
Body: {"admin_password": "string"}
Response: 200 OK → {"ok": true}
POST /admin/v1/vault/lock
Response: 200 OK → {"ok": true, "already_locked": false}
POST /admin/v1/vault/rotate
Body: {"old_password": "string", "new_password": "string"}
Response: 200 OK → {"ok": true}
Admin - Providers
POST /admin/v1/providers
Create or update a provider.
Body: {"id": "string", "type": "openai|anthropic|vllm", "enabled": true, "base_url": "string", "cred_store": "vault|none", "api_key": "string"}
Response: 200 OK → {"ok": true, "cred_store": "vault"}
GET /admin/v1/providers
List all providers (from the persistent store).
Query: ?limit=N&offset=N
Response: 200 OK → {"items": [{provider objects}], "total": N, "limit": N, "offset": N}
PATCH /admin/v1/providers/
Partial update of a provider. Runtime-only providers (not in the store) are automatically created in the store when first patched.
Body: {"type": "string", "base_url": "string", "enabled": true, "api_key": "string", "cred_store": "string"}
Response: 200 OK → {"ok": true, "provider": {updated provider}}
DELETE /admin/v1/providers/
Delete a provider.
Response: 200 OK → {"ok": true}
GET /admin/v1/providers/{id}/discover
Discover models available from a provider by querying its /v1/models endpoint.
Response: 200 OK → {"models": [{"id": "string", "registered": false}]}
Admin - Models
POST /admin/v1/models
Create or update a model. Registers the model in both the runtime engine and persistent store.
Body: {"id": "string", "provider_id": "string", "weight": 5, "max_context_tokens": 128000, "input_per_1k": 0.01, "output_per_1k": 0.03, "enabled": true}
Response: 200 OK → {"ok": true}
GET /admin/v1/models
List all models (from the persistent store).
Query: ?limit=N&offset=N
Response: 200 OK → {"items": [{model objects}], "total": N, "limit": N, "offset": N}
PATCH /admin/v1/models/
Partial model update. Model IDs can contain slashes (e.g., Qwen/Qwen2.5-Coder-32B-Instruct). Runtime-only models are automatically seeded into the store from engine data on first patch.
Body: {"weight": 7, "enabled": true, "input_per_1k": 0.015, "output_per_1k": 0.035, "max_context_tokens": 128000}
Response: 200 OK → {"ok": true, "model": {updated model}}
DELETE /admin/v1/models/
Delete a model. Model IDs with slashes are supported.
Response: 200 OK → {"ok": true}
Admin - Routing
GET /admin/v1/routing-config
Get current routing defaults.
Response: 200 OK → {"default_mode": "string", "default_max_budget_usd": 0.05, "default_max_latency_ms": 20000}
PUT /admin/v1/routing-config
Set routing defaults.
Body: {"default_mode": "string", "default_max_budget_usd": 0.1, "default_max_latency_ms": 30000}
Response: 200 OK → {"ok": true}
POST /admin/v1/routing/simulate
Run a what-if routing simulation without sending a real request.
Body: {"mode": "string", "token_count": 500, "max_budget_usd": 0.05, "min_weight": 0, "model_hint": "string"}
Response: 200 OK → {"decision": {decision object}, "eligible": [{model objects}]}
Admin - API Keys
POST /admin/v1/apikeys
Create a new API key.
Body: {"name": "string", "scopes": "[\"chat\",\"plan\"]", "rotation_days": 0, "expires_in": "720h", "monthly_budget_usd": 50.0}
Response: 200 OK → {"ok": true, "key": "tokenhub_...", "id": "string", "prefix": "string", "warning": "string"}
GET /admin/v1/apikeys
List all API keys (no plaintext).
Response: 200 OK → [{key objects without plaintext}]
POST /admin/v1/apikeys/{id}/rotate
Rotate an API key.
Response: 200 OK → {"ok": true, "key": "tokenhub_...", "warning": "string"}
PATCH /admin/v1/apikeys/
Update API key metadata.
Body: {"name": "string", "scopes": "string", "rotation_days": 0, "enabled": true}
Response: 200 OK → {"ok": true}
DELETE /admin/v1/apikeys/
Revoke (delete) an API key.
Response: 200 OK → {"ok": true}
Admin - Observability
GET /admin/v1/health
Provider health status.
Response: 200 OK → {"providers": [{health state objects}]}
GET /admin/v1/stats
Aggregated request statistics.
Response: 200 OK → {"global": {}, "by_model": {}, "by_provider": {}}
GET /admin/v1/logs?limit=100&offset=0
Paginated request logs.
GET /admin/v1/audit?limit=100&offset=0
Paginated audit logs.
GET /admin/v1/rewards?limit=100&offset=0
Paginated reward entries.
GET /admin/v1/engine/models
Runtime model registry, adapter list, and adapter metadata.
Response: 200 OK
{
"models": [{model objects}],
"total": 7,
"adapters": ["openai", "anthropic", "vllm"],
"adapter_info": [
{"id": "openai", "health_endpoint": "https://api.openai.com/v1/models"},
{"id": "vllm", "health_endpoint": "http://vllm-1:8000/health"}
]
}
Admin - TSDB
GET /admin/v1/tsdb/query?metric=latency&model_id=gpt-4&start=...&end=...&step_ms=60000
Query time-series data.
GET /admin/v1/tsdb/metrics
List available TSDB metrics.
POST /admin/v1/tsdb/prune
Manually prune old TSDB data.
PUT /admin/v1/tsdb/retention
Set TSDB retention period.
Body: {"retention_days": 7}
Admin - Workflows (Temporal)
GET /admin/v1/workflows?limit=50&status=RUNNING
List Temporal workflow executions.
GET /admin/v1/workflows/
Describe a workflow execution.
GET /admin/v1/workflows/{id}/history
Get workflow event history.
Admin - Events
GET /admin/v1/events
Server-Sent Events stream.
Content-Type: text/event-stream
Events: route_success, route_error
Admin UI
GET /admin
Serves the embedded admin SPA. The root URL (/) redirects here.
GET /admin/v1/info
Admin status information. Requires admin token authentication (Bearer header or ?token= query parameter).
Response: 200 OK
{
"tokenhub": "admin",
"vault_locked": true,
"vault_initialized": false
}
The vault_initialized field indicates whether the vault has ever been set up (salt exists). The UI uses this to distinguish first-time setup from a normal unlock prompt.
Prometheus Metrics
TokenHub exports Prometheus metrics at the /metrics endpoint.
Available Metrics
tokenhub_requests_total
Type: Counter
Total number of requests processed.
Labels:
| Label | Values | Description |
|---|---|---|
mode | cheap, normal, high_confidence, planning, adversarial, thompson | Routing mode used |
model | gpt-4, claude-opus, etc. | Model that handled the request |
provider | openai, anthropic, vllm | Provider adapter |
status | ok, error | Request outcome |
Examples:
# Total successful requests
tokenhub_requests_total{status="ok"}
# Request rate by provider
rate(tokenhub_requests_total[5m])
# Error rate
sum(rate(tokenhub_requests_total{status="error"}[5m]))
/
sum(rate(tokenhub_requests_total[5m]))
tokenhub_request_latency_ms
Type: Histogram
Request latency distribution in milliseconds.
Labels:
| Label | Values | Description |
|---|---|---|
mode | cheap, normal, etc. | Routing mode |
model | gpt-4, etc. | Model ID |
provider | openai, etc. | Provider ID |
Buckets: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120 ms (exponential, base 2)
Examples:
# Median latency
histogram_quantile(0.5, rate(tokenhub_request_latency_ms_bucket[5m]))
# P95 latency
histogram_quantile(0.95, rate(tokenhub_request_latency_ms_bucket[5m]))
# P99 latency by model
histogram_quantile(0.99, sum(rate(tokenhub_request_latency_ms_bucket[5m])) by (model, le))
# Average latency
rate(tokenhub_request_latency_ms_sum[5m]) / rate(tokenhub_request_latency_ms_count[5m])
tokenhub_cost_usd_total
Type: Counter
Cumulative estimated cost in USD.
Labels:
| Label | Values | Description |
|---|---|---|
model | gpt-4, etc. | Model ID |
provider | openai, etc. | Provider ID |
Examples:
# Total cost in the last hour
increase(tokenhub_cost_usd_total[1h])
# Cost rate (USD per second)
rate(tokenhub_cost_usd_total[5m])
# Cost per hour by model
rate(tokenhub_cost_usd_total[1h]) * 3600
# Most expensive model
topk(3, sum(rate(tokenhub_cost_usd_total[1h])) by (model))
Grafana Dashboard
Suggested Panels
| Panel | Query | Visualization |
|---|---|---|
| Request Rate | sum(rate(tokenhub_requests_total[5m])) | Time series |
| Error Rate | Error rate formula above | Gauge (0-100%) |
| P95 Latency | P95 formula above | Time series |
| Cost per Hour | Cost rate * 3600 | Stat |
| Requests by Model | sum by (model) (rate(tokenhub_requests_total[5m])) | Pie chart |
| Latency Heatmap | tokenhub_request_latency_ms_bucket | Heatmap |
Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: tokenhub
scrape_interval: 15s
metrics_path: /metrics
static_configs:
- targets: ['tokenhub:8080']
For Docker Compose, use the service name as the target.
Error Classification
TokenHub classifies provider errors to enable intelligent failover. Each error from a provider is classified into one of four categories that determine the routing engine's next action.
Error Classes
context_overflow
The request exceeds the model's context window.
Triggers:
- HTTP 413 from provider
- Response body contains
context_length_exceeded
Router action: Escalate to a model with a larger context window. If no larger model is available, try the next model in scored order.
rate_limited
The provider is throttling requests.
Triggers:
- HTTP 429 from provider
Router action: Skip to a different provider. If the response includes a Retry-After header, the delay is recorded in the classified error for optional use by the caller.
transient
A temporary server-side failure.
Triggers:
- HTTP 5xx from provider
Router action: Retry the same model with exponential backoff:
- Base delay: 100ms
- Maximum retries: 2
- Backoff multiplier: 2x (100ms, 200ms)
After retries are exhausted, try the next model.
fatal
An unrecoverable client error.
Triggers:
- HTTP 4xx (except 429 and 413)
- Any other unclassified error
Router action: Skip to the next model in scored order. No retry.
Error Flow
Provider returns error
│
├── adapter.ClassifyError(err) → ClassifiedError{Class, RetryAfter}
│
└── Router handles based on class:
├── context_overflow → Find bigger model
├── rate_limited → Different provider (respect RetryAfter)
├── transient → Retry with backoff (up to 2x)
└── fatal → Next model
ClassifiedError Type
type ClassifiedError struct {
Err error
Class ErrorClass // "context_overflow", "rate_limited", "transient", "fatal"
RetryAfter float64 // Seconds to wait (from Retry-After header, 429 only)
}
HTTP Error Responses
Consumer API Errors
| Status | Meaning | When |
|---|---|---|
| 400 | Bad Request | Invalid JSON, missing messages, validation failure |
| 401 | Unauthorized | Missing or invalid API key |
| 403 | Forbidden | Valid key but insufficient scopes |
| 502 | Bad Gateway | All models failed, no eligible models, or provider errors |
Admin API Errors
| Status | Meaning | When |
|---|---|---|
| 400 | Bad Request | Invalid parameters or validation failure |
| 404 | Not Found | Resource not found (model, key, provider) |
| 500 | Internal Server Error | Database or vault errors |
Provider-Specific Classification
OpenAI
| HTTP Status | Body Pattern | Error Class |
|---|---|---|
| 429 | — | rate_limited |
| 500-599 | — | transient |
| 400 | context_length_exceeded | context_overflow |
| Other 4xx | — | fatal |
Anthropic
| HTTP Status | Body Pattern | Error Class |
|---|---|---|
| 429 | — | rate_limited |
| 500-599 | — | transient |
| 400 | context_length_exceeded | context_overflow |
| Other 4xx | — | fatal |
vLLM
| HTTP Status | Body Pattern | Error Class |
|---|---|---|
| 429 | — | rate_limited |
| 500-599 | — | transient |
| 400 | context_length_exceeded | context_overflow |
| Other 4xx | — | fatal |
Reward Impact
Error classification affects the contextual bandit reward system:
- Successful requests: Reward computed from latency and cost
- Failed requests: Reward = 0.0 (regardless of error class)
- Error class is stored in reward entries for analysis
This ensures the Thompson Sampling policy learns to avoid unreliable models over time.