Introduction

TokenHub is an intelligent LLM routing proxy that sits between your applications and multiple AI providers. It provides a unified API for chat and planning requests while automatically selecting the best model based on cost, latency, capability, and provider health.

What TokenHub Does

Unified API: Single endpoint for OpenAI, Anthropic, and vLLM models
Intelligent Routing: Multi-objective model selection considering cost, latency, capability weight, and provider health
Orchestration: Multi-model reasoning with adversarial critique, voting, and iterative refinement modes
Credential Security: AES-256-GCM encrypted vault for provider API keys with auto-lock and password rotation
Client Key Management: Issue, rotate, and revoke API keys for your applications
Real-Time Monitoring: Prometheus metrics, time-series database, audit logs, and a built-in admin UI
Streaming: Server-Sent Events (SSE) streaming pass-through to all providers
Reinforcement Learning: Thompson Sampling bandit policy for adaptive model routing

Architecture at a Glance

┌─────────────┐     ┌──────────────────────────────────────────────┐
│  Client App  │────▶│                  TokenHub                    │
│              │◀────│                                              │
└─────────────┘     │  ┌─────────┐  ┌────────┐  ┌──────────────┐  │
                    │  │ Router  │──│ Health  │  │  Admin API   │  │
                    │  │ Engine  │  │ Tracker │  │  + UI (SPA)  │  │
                    │  └────┬────┘  └────────┘  └──────────────┘  │
                    │       │                                      │
                    │  ┌────┴──────────────────────────┐           │
                    │  │        Provider Adapters       │           │
                    │  │  ┌────────┐┌─────────┐┌────┐  │           │
                    │  │  │ OpenAI ││Anthropic││vLLM│  │           │
                    │  │  └────────┘└─────────┘└────┘  │           │
                    │  └───────────────────────────────┘           │
                    │                                              │
                    │  ┌─────────┐ ┌──────┐ ┌──────┐ ┌─────────┐  │
                    │  │ SQLite  │ │ TSDB │ │Vault │ │Temporal │  │
                    │  └─────────┘ └──────┘ └──────┘ └─────────┘  │
                    └──────────────────────────────────────────────┘

Who This Documentation Is For

Users / Application Developers: Learn how to send requests through TokenHub and use features like streaming, directives, and output formatting. Start with the User Guide.
Administrators: Configure providers, manage credentials, set routing policies, and monitor the system. Start with the Administrator Guide.
Developers / Contributors: Understand the internals, extend provider support, or contribute to the project. Start with the Developer Guide.

Quick Links

Task	Where to Go
Send your first request	Quick Start
Configure providers	Provider Management
Set up API keys	API Key Management
Command-line admin	tokenhubctl CLI
Deploy with Docker	Docker & Compose
Full API reference	API Reference
Monitor the system	Monitoring

Quick Start

This guide gets TokenHub running and serving your first request in under five minutes.

Prerequisites

Docker (for Docker Compose), or Go 1.24+ (for building from source)
At least one LLM provider endpoint and API key

TokenHub works with any OpenAI-compatible API, the Anthropic API, or vLLM. This includes services like NVIDIA NIM, Azure OpenAI, Together AI, Groq, Fireworks, Mistral, local Ollama instances — anything that speaks the OpenAI /v1/chat/completions protocol.

1. Start the Server

Docker Compose (recommended)

git clone https://github.com/jordanhubbard/tokenhub.git
cd tokenhub
docker compose up -d tokenhub

Build from Source

git clone https://github.com/jordanhubbard/tokenhub.git
cd tokenhub
make install      # builds and installs tokenhub + tokenhubctl to ~/.local/bin
tokenhub

TokenHub starts on port 8080 by default. Docker Compose maps this to host port 8090. Adjust the examples below accordingly.

2. Register Providers

A freshly started TokenHub has no providers configured. You need to tell it where your LLM endpoints are. There are several ways to do this. Pick whichever fits your workflow.

Option A: Credentials file (recommended)

The ~/.tokenhub/credentials file is a declarative JSON file that seeds providers and models at startup. It lives outside the source tree, requires 0600 permissions, and is processed before the service accepts requests.

API keys are automatically stored in the vault (when TOKENHUB_VAULT_PASSWORD is set) and providers are persisted to the database on first boot. The file is idempotent — it can stay in place across restarts.

mkdir -p ~/.tokenhub
chmod 700 ~/.tokenhub
cat > ~/.tokenhub/credentials << 'EOF'
{
  "providers": [
    {
      "id": "ollama",
      "type": "openai",
      "base_url": "http://localhost:11434"
    },
    {
      "id": "nvidia",
      "type": "openai",
      "base_url": "https://integrate.api.nvidia.com",
      "api_key": "nvapi-..."
    }
  ],
  "models": [
    {
      "id": "llama3.1:8b",
      "provider_id": "ollama",
      "weight": 5,
      "max_context_tokens": 8192,
      "input_per_1k": 0.0,
      "output_per_1k": 0.0
    },
    {
      "id": "meta/llama-3.1-70b-instruct",
      "provider_id": "nvidia",
      "weight": 8,
      "max_context_tokens": 128000,
      "input_per_1k": 0.0003,
      "output_per_1k": 0.0003
    }
  ]
}
EOF
chmod 600 ~/.tokenhub/credentials

Then start the server:

make run    # builds image, starts compose, tails logs

Override the default path with TOKENHUB_CREDENTIALS_FILE.

Option B: tokenhubctl (interactive)

With the server already running, use the CLI directly:

export TOKENHUB_URL="http://localhost:8090"

# Register a provider
tokenhubctl provider add '{
    "id": "openai",
    "type": "openai",
    "base_url": "https://api.openai.com",
    "api_key": "sk-..."
}'

# Register a model on that provider
tokenhubctl model add '{
    "id": "gpt-4o",
    "provider_id": "openai",
    "weight": 8,
    "max_context_tokens": 128000,
    "input_per_1k": 0.0025,
    "output_per_1k": 0.01,
    "enabled": true
}'

Option C: Admin UI

Open http://localhost:8090/admin in your browser. The setup wizard walks you through adding your first provider: select the type, enter the base URL and API key, test the connection, then discover and register available models — all without touching the command line.

Option D: Admin API (curl)

# Register a provider
curl -X POST http://localhost:8090/admin/v1/providers \
  -H "Content-Type: application/json" \
  -d '{
    "id": "anthropic",
    "type": "anthropic",
    "base_url": "https://api.anthropic.com",
    "api_key": "sk-ant-...",
    "enabled": true
  }'

# Register a model
curl -X POST http://localhost:8090/admin/v1/models \
  -H "Content-Type: application/json" \
  -d '{
    "id": "claude-sonnet-4-5-20250514",
    "provider_id": "anthropic",
    "weight": 8,
    "max_context_tokens": 200000,
    "input_per_1k": 0.003,
    "output_per_1k": 0.015,
    "enabled": true
  }'

Providers persist across restarts. Once registered via the credentials file, the API, tokenhubctl, or the UI, providers and models are stored in the database and restored automatically on restart. You only need to configure them once. API keys for vault-backed providers require the vault to be unlocked after restart (set TOKENHUB_VAULT_PASSWORD for automatic unlock).

3. Verify It's Running

curl http://localhost:8090/healthz

Or:

tokenhubctl status

Expected response:

{"status": "ok", "adapters": 2, "models": 2}

4. Create an API Key

TokenHub issues its own API keys to clients. Provider keys stay on the server.

tokenhubctl apikey create '{"name":"my-first-key","scopes":"[\"chat\",\"plan\"]"}'

Or via curl:

curl -X POST http://localhost:8090/admin/v1/apikeys \
  -H "Content-Type: application/json" \
  -d '{"name": "my-first-key", "scopes": "[\"chat\",\"plan\"]"}'

Save the returned key value — it is shown only once:

{
  "ok": true,
  "key": "tokenhub_a1b2c3d4...",
  "id": "a1b2c3d4e5f6g7h8",
  "prefix": "tokenhub_a1b2c3d4"
}

5. Send Your First Request

curl -X POST http://localhost:8090/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_a1b2c3d4..." \
  -d '{
    "request": {
      "messages": [
        {"role": "user", "content": "What is the capital of France?"}
      ]
    }
  }'

TokenHub selects the best available model based on its routing policy and returns the response:

{
  "negotiated_model": "gpt-4o",
  "estimated_cost_usd": 0.0023,
  "routing_reason": "routed-weight-8",
  "response": {
    "choices": [{
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      }
    }]
  }
}

6. Explore

# See all registered providers and models
tokenhubctl provider list
tokenhubctl model list

# Watch routing decisions in real time
tokenhubctl events

# Open the admin dashboard
open http://localhost:8090/admin

Next Steps

Provider Management for provider types, credential storage, and model discovery
Chat API for request options, routing policies, and parameters
Routing Configuration to tune model selection behavior
tokenhubctl CLI for command-line administration
Configuration Reference for all environment variables

User Guide Overview

This section is for application developers integrating with TokenHub. TokenHub exposes two main endpoints:

Endpoint	Purpose
`POST /v1/chat`	Single-turn or multi-turn chat completion
`POST /v1/plan`	Multi-model orchestrated reasoning

Both endpoints accept a unified request format and return the provider's response along with routing metadata (which model was chosen, estimated cost, and routing reason).

Key Concepts

Routing Policies

Every request can include a policy that guides model selection:

cheap — Minimize cost (prefer smaller, cheaper models)
normal — Balance cost, latency, capability, and reliability
high_confidence — Prefer the most capable models regardless of cost
planning — Optimized for planning and reasoning tasks
thompson — Adaptive selection using reinforcement learning

If no policy is specified, the server's default routing mode applies.

Model Selection

TokenHub maintains a registry of models from all configured providers. Each model has:

Weight (0-10): Higher weight = more capable
Context window: Maximum tokens the model can process
Pricing: Cost per 1,000 input and output tokens
Health status: Based on recent success/failure rates

The routing engine scores all eligible models and selects the best match for your request.

Authentication

All /v1 requests require an API key in the Authorization header:

Authorization: Bearer tokenhub_<key>

API keys are created and managed by administrators. Each key has scopes controlling which endpoints it can access (chat, plan, or both).

Provider Transparency

You interact only with TokenHub. The underlying provider (OpenAI, Anthropic, vLLM) is selected automatically and its API key is never exposed. The response includes which model and provider were used in the negotiated_model field.

Sections

Chat API — Detailed guide to /v1/chat
Plan API — Multi-model orchestration via /v1/plan
Streaming — Server-Sent Events streaming
Directives — In-band routing overrides embedded in messages
Output Formats — JSON Schema validation, Markdown, XML output shaping
Authentication — API key usage and scopes

Chat API

The chat endpoint provides single-turn or multi-turn completions with automatic model routing.

Endpoint: POST /v1/chat

Request Format

{
  "request": {
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "model_hint": "gpt-4",
    "estimated_input_tokens": 500,
    "parameters": {
      "temperature": 0.7,
      "max_tokens": 1024,
      "top_p": 0.9
    },
    "stream": false,
    "meta": {
      "user_id": "u123",
      "session": "abc"
    }
  },
  "capabilities": {
    "planning": true
  },
  "policy": {
    "mode": "normal",
    "max_budget_usd": 0.05,
    "max_latency_ms": 15000,
    "min_weight": 5
  },
  "output_format": {
    "type": "json",
    "schema": "{\"type\":\"object\",\"properties\":{\"answer\":{\"type\":\"string\"}}}",
    "max_tokens": 500,
    "strip_think": true
  }
}

Request Fields

`request` (required)

Field	Type	Required	Description
`messages`	array	Yes	Array of `{role, content}` message objects
`model_hint`	string	No	Preferred model ID; tried first before scoring. Use `*` to let TokenHub assign a wildcard model.
`estimated_input_tokens`	int	No	Token count hint for routing decisions
`parameters`	object	No	Provider parameters forwarded as-is (temperature, max_tokens, top_p, etc.)
`stream`	bool	No	Enable SSE streaming response
`meta`	object	No	Arbitrary metadata for logging and tracing
`output_schema`	JSON	No	JSON Schema for structured output validation

`policy` (optional)

Controls model selection behavior. All fields are optional and fall back to server defaults.

Field	Type	Default	Range	Description
`mode`	string	`normal`	See below	Routing mode
`max_budget_usd`	float	0.05	0-100	Maximum cost per request
`max_latency_ms`	int	20000	0-300000	Maximum acceptable latency
`min_weight`	int	0	0-10	Minimum model capability weight

Routing modes:

Mode	Cost Weight	Latency Weight	Failure Weight	Capability Weight
`cheap`	0.7	0.1	0.1	0.1
`normal`	0.25	0.25	0.25	0.25
`high_confidence`	0.05	0.1	0.15	0.7
`planning`	0.1	0.1	0.2	0.6
`thompson`	N/A	N/A	N/A	N/A

The thompson mode uses reinforcement learning (Thompson Sampling with Beta distributions) to adaptively select models based on historical reward data.

`capabilities` (optional)

Field	Type	Description
`planning`	bool	Indicates request needs planning capability

Capabilities influence which routing mode profile is used when no explicit mode is set.

`output_format` (optional)

Field	Type	Description
`type`	string	Output format: `json`, `markdown`, `text`, `xml`
`schema`	string	JSON Schema string for validating structured output
`max_tokens`	int	Maximum output tokens to request from provider
`strip_think`	bool	Remove `<think>...</think>` blocks from response

Response Format

{
  "negotiated_model": "gpt-4",
  "estimated_cost_usd": 0.0023,
  "routing_reason": "routed-weight-8",
  "response": {
    "id": "chatcmpl-...",
    "choices": [{
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses..."
      }
    }],
    "usage": {
      "prompt_tokens": 45,
      "completion_tokens": 128,
      "total_tokens": 173
    }
  }
}

Field	Description
`negotiated_model`	The model ID that was selected
`estimated_cost_usd`	Estimated cost based on model pricing and token counts
`routing_reason`	Why this model was chosen (see Routing Reasons)
`response`	Raw JSON response from the selected provider

Routing Reasons

Reason	Description
`routed-weight-N`	Selected by scoring; N is the model's weight
`model-hint`	Client's model hint was used
`escalated-context-overflow`	Escalated to a model with a larger context window
`retried-transient`	Retried after a transient provider error

Error Responses

Status	Body	Cause
400	`"bad json"`	Malformed request body
400	`"messages required"`	Empty messages array
400	`"max_budget_usd must be between 0 and 100"`	Policy validation failure
401	`"missing or invalid api key"`	Missing or invalid Authorization header
403	`"scope not allowed"`	API key lacks `chat` scope
502	Error message	All models failed or no eligible models

Examples

Minimal Request

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Hello!"}]
    }
  }'

Cost-Optimized Request

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Summarize this text..."}]
    },
    "policy": {
      "mode": "cheap",
      "max_budget_usd": 0.001
    }
  }'

Request with Model Hint

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Write a poem about the ocean."}],
      "model_hint": "claude-opus",
      "parameters": {
        "temperature": 0.9,
        "max_tokens": 2048
      }
    }
  }'

Request with Wildcard Model Assignment

When model_hint is *, TokenHub assigns a model server-side. If no explicit * alias is configured, wildcard requests use an ordered fail-down ladder. Administrators can replace that ladder with PUT /admin/v1/wildcard-models or seed it at startup with TOKENHUB_WILDCARD_MODELS_FILE.

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Review this plan."}],
      "model_hint": "*"
    }
  }'

Structured JSON Output

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "List 3 programming languages with their year of creation"}]
    },
    "output_format": {
      "type": "json",
      "schema": "{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"year\":{\"type\":\"integer\"}}}}"
    }
  }'

Plan API

The plan endpoint provides multi-model orchestrated reasoning. It coordinates multiple LLM calls using different strategies to produce higher-quality outputs than a single model call.

Endpoint: POST /v1/plan

Request Format

{
  "request": {
    "messages": [
      {"role": "user", "content": "Design a REST API for a task management app"}
    ]
  },
  "orchestration": {
    "mode": "adversarial",
    "iterations": 2,
    "primary_model_id": "claude-opus",
    "review_model_id": "gpt-4",
    "primary_min_weight": 5,
    "review_min_weight": 8,
    "return_plan_only": false,
    "output_schema": "{\"type\":\"object\"}"
  }
}

Orchestration Modes

Adversarial

A three-phase plan-critique-refine loop:

Plan: Primary model generates an initial plan
Critique: Review model analyzes the plan and provides feedback
Refine: Primary model improves the plan based on the critique

The critique-refine cycle repeats for the configured number of iterations.

{
  "orchestration": {
    "mode": "adversarial",
    "iterations": 2
  }
}

Response:

{
  "negotiated_model": "claude-opus",
  "estimated_cost_usd": 0.15,
  "routing_reason": "adversarial-orchestration",
  "response": {
    "initial_plan": "Here is the initial API design...",
    "critique": "The design has these issues: ...",
    "refined_plan": "Here is the improved design addressing the feedback..."
  }
}

Vote

Multiple models respond independently, then a judge model selects the best:

N models (voters) each generate a response to the same prompt
A judge model reviews all responses and selects the best one

{
  "orchestration": {
    "mode": "vote"
  }
}

Response:

{
  "negotiated_model": "gpt-4",
  "estimated_cost_usd": 0.08,
  "routing_reason": "vote-orchestration",
  "response": {
    "responses": [
      {"model": "gpt-4", "content": "Response A...", "selected": true},
      {"model": "claude-sonnet", "content": "Response B...", "selected": false},
      {"model": "gpt-3.5-turbo", "content": "Response C...", "selected": false}
    ],
    "selected": 0,
    "judge": "claude-opus"
  }
}

Refine

A single model iteratively improves its own response:

Model generates an initial response
Model reviews and refines its own response (repeats for N iterations)

{
  "orchestration": {
    "mode": "refine",
    "iterations": 3
  }
}

Response:

{
  "negotiated_model": "claude-opus",
  "estimated_cost_usd": 0.12,
  "routing_reason": "refine-orchestration",
  "response": {
    "refined_response": "Final refined response...",
    "iterations": 3,
    "model": "claude-opus"
  }
}

Planning

Simple single-route with the planning weight profile (prioritizes capable models):

{
  "orchestration": {
    "mode": "planning"
  }
}

Orchestration Fields

Field	Type	Default	Range	Description
`mode`	string	`planning`	See above	Orchestration strategy
`iterations`	int	1-2	0-10	Number of refinement iterations
`primary_model_id`	string	—	—	Explicit model for primary phase
`review_model_id`	string	—	—	Explicit model for review/judge phase
`primary_min_weight`	int	0	0-10	Minimum weight for primary model
`review_min_weight`	int	0	0-10	Minimum weight for review model
`return_plan_only`	bool	false	—	Return plan without executing refinement
`output_schema`	string	—	—	JSON Schema for structured output validation

Explicit Model Selection

By default, TokenHub selects models using its routing engine. You can override this with explicit model IDs:

{
  "orchestration": {
    "mode": "adversarial",
    "primary_model_id": "claude-opus",
    "review_model_id": "gpt-4"
  }
}

Alternatively, use primary_min_weight and review_min_weight to set capability floors without specifying exact models:

{
  "orchestration": {
    "mode": "adversarial",
    "primary_min_weight": 7,
    "review_min_weight": 9
  }
}

Error Responses

Status	Body	Cause
400	`"messages required"`	Empty messages array
400	`"iterations must be between 0 and 10"`	Invalid iteration count
400	`"unknown orchestration mode"`	Unrecognized mode value
401	`"missing or invalid api key"`	Authentication failure
403	`"scope not allowed"`	API key lacks `plan` scope
502	Error message	Orchestration failed (all models failed)

Cost Considerations

Orchestration modes make multiple LLM calls. Approximate cost multipliers:

Mode	Calls per Request	Typical Cost Multiplier
Planning	1	1x
Adversarial (2 iter)	5 (plan + 2x(critique + refine))	5x
Vote (3 voters)	4 (3 voters + 1 judge)	4x
Refine (3 iter)	4 (initial + 3 refinements)	4x

Budget accordingly when setting max_budget_usd in your policy.

Streaming

TokenHub supports Server-Sent Events (SSE) streaming for chat requests. When streaming is enabled, tokens are delivered incrementally as they are generated by the provider.

Enabling Streaming

Set stream: true in your request:

{
  "request": {
    "messages": [{"role": "user", "content": "Tell me a story..."}],
    "stream": true
  }
}

Response Format

Streaming responses use the text/event-stream content type. Each event is a line prefixed with data: :

data: {"choices":[{"delta":{"content":"Once"},"index":0}]}

data: {"choices":[{"delta":{"content":" upon"},"index":0}]}

data: {"choices":[{"delta":{"content":" a"},"index":0}]}

data: {"choices":[{"delta":{"content":" time"},"index":0}]}

data: [DONE]

The stream ends with data: [DONE].

Response Headers

Streaming responses include these headers:

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
X-TokenHub-Model: gpt-4
X-TokenHub-Provider: openai
X-TokenHub-Reason: routed-weight-8

The X-TokenHub-* headers provide routing metadata that would normally be in the JSON response envelope.

Example with curl

curl -N -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Count from 1 to 10 slowly."}],
      "stream": true
    }
  }'

The -N flag disables output buffering so tokens appear as they arrive.

Example with Python

import requests
import json

response = requests.post(
    "http://localhost:8080/v1/chat",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer tokenhub_..."
    },
    json={
        "request": {
            "messages": [{"role": "user", "content": "Tell me a story."}],
            "stream": True
        }
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        text = line.decode("utf-8")
        if text.startswith("data: ") and text != "data: [DONE]":
            chunk = json.loads(text[6:])
            delta = chunk["choices"][0].get("delta", {})
            if "content" in delta:
                print(delta["content"], end="", flush=True)

Provider Compatibility

All three provider adapters support streaming:

Provider	Streaming Protocol
OpenAI	SSE (native)
Anthropic	SSE (native)
vLLM	SSE (OpenAI-compatible)

TokenHub passes the SSE stream through directly from the selected provider. The event format matches the provider's native format.

Failover Behavior

Streaming uses the same model selection and failover logic as non-streaming requests. If the selected model fails to establish a stream, TokenHub falls back through eligible models in scored order.

However, once streaming has begun (first bytes sent to the client), failover is not possible. If the provider disconnects mid-stream, the stream ends with an error event.

Limitations

Streaming is only available on /v1/chat, not /v1/plan
Output format validation (output_format.schema) is not applied to streaming responses
Cost estimation in streaming responses may be less accurate since token counts are not known until the stream completes
When Temporal workflows are enabled, streaming bypasses Temporal and uses direct engine dispatch

In-Band Directives

TokenHub supports embedding routing directives directly in message content. This allows clients to override routing policy without changing the request structure, which is useful when working through intermediary systems that pass messages through unchanged.

Single-Line Directive

Embed a directive anywhere in a message's content using the @@tokenhub prefix:

@@tokenhub mode=cheap budget=0.01 latency=5000 min_weight=5

Example in a full request:

{
  "request": {
    "messages": [
      {
        "role": "user",
        "content": "@@tokenhub mode=cheap budget=0.005\nSummarize this document..."
      }
    ]
  }
}

Block Directive

For complex directives (especially those containing JSON schemas), use the block format:

@@tokenhub
mode=high_confidence
budget=0.10
latency=30000
min_weight=8
output_schema={"type":"object","properties":{"answer":{"type":"string"},"confidence":{"type":"number"}}}
@@end

The block starts with @@tokenhub on its own line and ends with @@end.

Supported Keys

Key	Type	Maps To	Description
`mode`	string	`policy.mode`	Routing mode (cheap, normal, high_confidence, planning, adversarial)
`budget`	float	`policy.max_budget_usd`	Maximum cost in USD
`latency`	int	`policy.max_latency_ms`	Maximum latency in milliseconds
`min_weight`	int	`policy.min_weight`	Minimum model capability weight
`output_schema`	JSON	`request.output_schema`	JSON Schema for structured output

Processing Rules

Scanning: TokenHub scans all messages for directives. The last directive found takes precedence.
Stripping: Directives are removed from message content before forwarding to the provider. The LLM never sees @@tokenhub text.
Override: Directive values override both server defaults and request-level policy fields.
Partial override: You can set only the fields you want to override. Unspecified fields retain their values from the request policy or server defaults.

Examples

Cost-optimize a specific request

@@tokenhub mode=cheap budget=0.001
What is 2 + 2?

Force high-quality response

@@tokenhub mode=high_confidence min_weight=9
Write a detailed analysis of the economic implications of quantum computing.

Structured output via directive

@@tokenhub
output_schema={"type":"object","properties":{"name":{"type":"string"},"population":{"type":"integer"}}}
@@end
What is the most populous city in Japan?

Output Formats

TokenHub can shape provider responses into specific output formats. This is useful for applications that need structured data from LLM responses.

Configuration

Set the output_format field in your chat request:

{
  "output_format": {
    "type": "json",
    "schema": "{\"type\":\"object\",\"properties\":{\"answer\":{\"type\":\"string\"}}}",
    "max_tokens": 500,
    "strip_think": true
  }
}

Format Types

JSON

Validates the response against a JSON Schema. If the provider's output doesn't match the schema, TokenHub returns a validation error.

{
  "output_format": {
    "type": "json",
    "schema": "{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"value\":{\"type\":\"number\"}}}}"
  }
}

The schema is passed as a string (not a nested object) to allow maximum flexibility.

Markdown

Requests the provider to format its response as Markdown:

{
  "output_format": {
    "type": "markdown"
  }
}

Text

Plain text output with optional truncation:

{
  "output_format": {
    "type": "text",
    "max_tokens": 200
  }
}

XML

Requests XML-formatted output:

{
  "output_format": {
    "type": "xml"
  }
}

Output Format Fields

Field	Type	Description
`type`	string	Output format: `json`, `markdown`, `text`, `xml`
`schema`	string	JSON Schema for validation (only with `type: "json"`)
`max_tokens`	int	Maximum output tokens to request from the provider
`strip_think`	bool	Remove `<think>...</think>` reasoning blocks from the response

Think Block Stripping

Some models (particularly those with chain-of-thought reasoning) wrap their internal reasoning in <think>...</think> tags. Setting strip_think: true removes these blocks from the final response:

Before stripping:

<think>
The user wants to know the capital of France. This is a straightforward factual question.
</think>
The capital of France is Paris.

After stripping:

The capital of France is Paris.

JSON Schema Validation

When type: "json" is specified with a schema, TokenHub:

Sends the request to the provider (with a system message hint to produce JSON)
Parses the provider's response as JSON
Validates against the provided JSON Schema
Returns the validated JSON in the response

If validation fails, the error is returned in the response body with a 502 status.

Authentication

All requests to TokenHub's consumer API (/v1/*) require authentication via API keys.

API Key Format

TokenHub API keys follow this format:

tokenhub_<64 hex characters>

Example: tokenhub_a1b2c3d4e5f6789012345678abcdef0123456789abcdef0123456789abcdef01

Using API Keys

Include the key in the Authorization header as a Bearer token:

curl -X POST http://localhost:8080/v1/chat \
  -H "Authorization: Bearer tokenhub_a1b2c3d4..." \
  -H "Content-Type: application/json" \
  -d '{"request": {"messages": [{"role": "user", "content": "Hello"}]}}'

Scopes

Each API key has scopes that control which endpoints it can access:

Scope	Endpoint	Description
`chat`	`POST /v1/chat`	Chat completion requests
`plan`	`POST /v1/plan`	Orchestrated planning requests

A key with scopes ["chat", "plan"] can access both endpoints. A key with only ["chat"] receives a 403 Forbidden when calling /v1/plan.

If scopes are empty ([]), the key has access to all endpoints.

Error Responses

Status	Message	Cause
401	`"missing or invalid api key"`	No Authorization header, invalid format, wrong key, expired, or disabled
403	`"scope not allowed"`	Valid key but lacks the required scope

Key Lifecycle

Created by an administrator via the admin API or UI
Distributed to the client application (plaintext shown only once at creation)
Used by the client for all /v1 requests
Rotated periodically (manually or on a configured schedule)
Revoked when no longer needed

Keys can be configured with:

Expiration: Automatic expiry after a set duration
Rotation schedule: Recommended rotation period in days
Enable/disable: Temporarily deactivate without deleting

Security Properties

Plaintext is never stored: Only a bcrypt hash is persisted
Shown once: The plaintext key is returned only at creation and rotation
Provider isolation: Clients authenticate with TokenHub keys. Provider API keys are stored encrypted in the vault and never exposed.
Validation cache: A 5-minute TTL cache reduces bcrypt overhead without compromising security

See API Key Management for the administrator's guide to creating and managing keys.

Administrator Guide Overview

This section covers how to configure, manage, and monitor a TokenHub deployment.

Administration Model

TokenHub uses a three-tier security model:

Admin token (TOKENHUB_ADMIN_TOKEN): Authenticates access to the admin API (/admin/v1/*) and the admin dashboard. The UI requires the token at login; all admin API calls include it as Authorization: Bearer <token>. Retrieve it with tokenhubctl admin-token.
Vault password: A separate secret that encrypts provider API keys at rest. Even a valid admin token cannot decrypt the vault — the vault must be explicitly unlocked after each restart (or set TOKENHUB_VAULT_PASSWORD for auto-unlock).
API keys: Issued to client applications for /v1 endpoint access. Managed via the admin API or UI.

In production, always set TOKENHUB_ADMIN_TOKEN and restrict network access to /admin/* at the firewall, VPN, or reverse proxy level.

Administration Tools

Admin UI

The built-in web dashboard at /admin provides a graphical interface for all admin operations. See Admin UI.

tokenhubctl

A command-line tool for scripting and quick administration. Covers all admin API operations. See tokenhubctl CLI.

curl / Admin API

All operations are available via the REST API at /admin/v1/*. See API Reference.

Admin Endpoints

Category	Endpoints	Purpose
Vault	`/admin/v1/vault/*`	Lock, unlock, rotate vault password
Providers	`/admin/v1/providers`	Register, edit, and manage LLM providers
Models	`/admin/v1/models`	Register, edit, and manage model configurations
Discovery	`/admin/v1/providers/{id}/discover`	Discover models from a provider's API
Simulation	`/admin/v1/routing/simulate`	What-if routing simulation
Routing	`/admin/v1/routing-config`	Set default routing policy
API Keys	`/admin/v1/apikeys`	Create, rotate, revoke client API keys
Health	`/admin/v1/health`	View provider health status
Stats	`/admin/v1/stats`	View aggregated request statistics
Logs	`/admin/v1/logs`	View request logs
Audit	`/admin/v1/audit`	View audit trail
Rewards	`/admin/v1/rewards`	View contextual bandit reward data
Engine	`/admin/v1/engine/models`	View runtime model registry and adapter info
TSDB	`/admin/v1/tsdb/*`	Query time-series metrics
Workflows	`/admin/v1/workflows`	View Temporal workflow executions
Events	`/admin/v1/events`	SSE stream of real-time events

Sections

Vault & Credentials — Encrypted credential storage
Provider Management — Configure LLM providers
Model Management — Configure model registry
Routing Configuration — Tune model selection
API Key Management — Issue and manage client keys
Monitoring & Observability — Health, metrics, logs, and alerts
Admin UI — Built-in dashboard
tokenhubctl CLI — Command-line administration

Vault & Credentials

TokenHub includes an AES-256-GCM encrypted vault for storing provider API keys securely. Provider credentials are encrypted at rest and only decrypted in memory when the vault is unlocked.

Vault password vs. admin token: The vault password is not the same as your admin token. The admin token authenticates HTTP requests to the admin API. The vault password derives the encryption key used to protect stored credentials. Both are required in a production deployment: the admin token to access the API, and the vault password to decrypt provider keys.

How It Works

An administrator sets a vault password when first configuring TokenHub
The password is run through Argon2id key derivation (OWASP-recommended parameters) to produce an encryption key
Provider API keys are encrypted with AES-256-GCM and stored in SQLite
A random salt is generated per vault instance and persisted alongside the encrypted data
After server restart, the vault must be unlocked with the same password before provider requests can be made

Vault States

State	Description
Not initialized	First-time setup required — choose a master password
Locked	Credentials encrypted; provider requests will fail
Unlocked	Credentials decrypted in memory; requests are served normally

Auto-Unlock (Headless)

Set TOKENHUB_VAULT_PASSWORD to unlock the vault automatically at startup. This is required for automated/headless deployments where no operator is present to enter the password interactively.

export TOKENHUB_VAULT_PASSWORD="your-secure-password"

On first boot this also initializes the vault, so no interactive setup is needed.

Operations

Unlock the Vault

Via the admin UI (recommended for first-time setup — the UI asks for the password twice to prevent typos), or via API/CLI:

tokenhubctl vault unlock "your-secure-password"

Or via curl:

curl -X POST http://localhost:8080/admin/v1/vault/unlock \
  -H "Content-Type: application/json" \
  -d '{"admin_password": "your-secure-password"}'

Response:

{"ok": true}

Lock the Vault

curl -X POST http://localhost:8080/admin/v1/vault/lock

Response:

{"ok": true, "already_locked": false}

Rotate the Vault Password

Re-encrypts all stored credentials with a new password:

curl -X POST http://localhost:8080/admin/v1/vault/rotate \
  -H "Content-Type: application/json" \
  -d '{
    "old_password": "current-password",
    "new_password": "new-secure-password"
  }'

This operation is atomic — all credentials are re-encrypted in a single transaction.

Auto-Lock

The vault automatically locks after 30 minutes of inactivity. Every successful credential access resets the timer.

When the vault auto-locks:

In-flight requests that have already retrieved credentials continue normally
New requests will fail with a provider error until the vault is unlocked again
An audit log entry is recorded

Credential Storage

When you register a provider with cred_store: "vault", TokenHub stores the API key encrypted in the vault under the key provider:{provider_id}:api_key.

The credential lifecycle:

Admin provides API key when creating/updating a provider
Key is encrypted and stored in the vault
Key is also persisted (encrypted) in the database for recovery after restart
When the vault is unlocked, the salt and encrypted blob are loaded from the database
Keys are decrypted only in memory

Security Parameters

Parameter	Value
Encryption	AES-256-GCM
Key derivation	Argon2id
Argon2id time	3 iterations
Argon2id memory	64 MB
Argon2id threads	4
Salt	16 bytes, random per vault
Auto-lock timeout	30 minutes

Best Practices

Use a strong vault password: At least 16 characters with mixed case, numbers, and symbols
Use TOKENHUB_VAULT_PASSWORD for automated deployments so the vault unlocks on restart
Rotate regularly: Use the rotate endpoint to change the vault password periodically
Monitor auto-lock: Set up alerts if the vault locks unexpectedly during business hours
Backup the database: The vault salt and encrypted blob are stored in SQLite. Back up the database file to ensure credential recovery
Network isolation: Restrict access to vault admin endpoints to trusted networks

Provider Management

Providers are the LLM services that TokenHub routes requests to. TokenHub ships with adapter support for OpenAI, Anthropic, and vLLM (OpenAI-compatible).

Registration Methods

Credentials File (recommended)

The ~/.tokenhub/credentials file is a declarative JSON file processed at startup. Providers are persisted to the database and API keys are stored in the vault (when unlocked via TOKENHUB_VAULT_PASSWORD). The file is idempotent — it can remain in place across restarts.

The file must have 0600 permissions and live outside the source tree.

{
  "providers": [
    {
      "id": "openai",
      "type": "openai",
      "base_url": "https://api.openai.com",
      "api_key": "sk-..."
    },
    {
      "id": "anthropic",
      "type": "anthropic",
      "base_url": "https://api.anthropic.com",
      "api_key": "sk-ant-..."
    },
    {
      "id": "ollama-local",
      "type": "openai",
      "base_url": "http://localhost:11434"
    }
  ],
  "models": [
    {
      "id": "gpt-4o",
      "provider_id": "openai",
      "weight": 8,
      "max_context_tokens": 128000,
      "input_per_1k": 0.0025,
      "output_per_1k": 0.01
    }
  ]
}

Field	Type	Required	Description
`id`	string	Yes	Unique provider identifier
`type`	string	Yes	Provider type: `openai`, `anthropic`, or `vllm`
`base_url`	string	Yes	Provider API base URL
`api_key`	string	No	API key (stored in vault when available, omit for keyless providers)
`enabled`	bool	No	Whether the provider is active (default: true)

Override the default path with TOKENHUB_CREDENTIALS_FILE.

Admin API / tokenhubctl

Providers can be registered and managed dynamically via the admin API or tokenhubctl at any time after the service starts.

Admin UI

The setup wizard at /admin walks through adding providers interactively.

API Operations

Create or Update a Provider

curl -X POST http://localhost:8080/admin/v1/providers \
  -H "Content-Type: application/json" \
  -d '{
    "id": "openai-prod",
    "type": "openai",
    "enabled": true,
    "base_url": "https://api.openai.com",
    "cred_store": "vault",
    "api_key": "sk-..."
  }'

Or with tokenhubctl:

tokenhubctl provider add '{"id":"openai-prod","type":"openai","base_url":"https://api.openai.com","api_key":"sk-..."}'

Field	Type	Required	Description
`id`	string	Yes	Unique provider identifier
`type`	string	Yes	Provider type: `openai`, `anthropic`, or `vllm`
`enabled`	bool	No	Whether the provider is active (default: true)
`base_url`	string	Yes	Provider API base URL
`cred_store`	string	No	Where to store credentials: `vault` or `none`
`api_key`	string	No	API key (stored according to `cred_store`)

List Providers

curl http://localhost:8080/admin/v1/providers
tokenhubctl provider list

The tokenhubctl provider list command merges providers from both the persistent store and the runtime engine, showing base URLs derived from adapter health endpoints and indicating whether each provider is store-persisted or runtime-only.

API keys are never returned in list responses.

Edit a Provider

Partial updates via PATCH:

curl -X PATCH http://localhost:8080/admin/v1/providers/openai \
  -H "Content-Type: application/json" \
  -d '{"base_url": "https://api.openai.com", "enabled": true}'

Or:

tokenhubctl provider edit openai '{"base_url":"https://api.openai.com","enabled":true}'

Patchable fields: type, base_url, enabled, api_key, cred_store.

Delete a Provider

curl -X DELETE http://localhost:8080/admin/v1/providers/openai-staging
tokenhubctl provider delete openai-staging

Discover Models

Query a provider's API to discover available models:

curl http://localhost:8080/admin/v1/providers/openai/discover
tokenhubctl provider discover openai

This calls the provider's /v1/models endpoint (using the stored API key from the vault if available) and returns the list of models with a registered flag indicating which are already configured in TokenHub.

Credential Storage Options

`cred_store`	Description
`vault`	API key is encrypted and stored in the vault (default when `api_key` is provided)
`none`	No credentials needed (e.g., local vLLM/Ollama without auth)

When using vault, the API key is encrypted with AES-256-GCM and only available when the vault is unlocked.

Supported Provider Types

OpenAI (`openai`)

API endpoint: /v1/chat/completions
Health probe: GET /v1/models
Streaming: SSE (native)
Authentication: Authorization: Bearer <key>

Anthropic (`anthropic`)

API endpoint: /v1/messages
Health probe: GET /v1/messages (405 response = healthy)
Streaming: SSE (native)
Authentication: x-api-key: <key>, anthropic-version: 2023-06-01

vLLM (`vllm`)

API endpoint: /v1/chat/completions (OpenAI-compatible)
Health probe: GET /health
Streaming: SSE (OpenAI-compatible)
Authentication: None (or custom header if configured)
Multi-endpoint: Supports multiple endpoints with round-robin load balancing

Audit Trail

All provider mutations are logged in the audit trail:

provider.upsert — Provider created or updated
provider.patch — Provider partially updated
provider.delete — Provider removed

Model Management

Models are the LLM model definitions that TokenHub uses for routing decisions. Each model is associated with a provider and has properties that affect routing: capability weight, context window size, and pricing.

Default Models

TokenHub registers these default models at startup:

Model ID	Provider	Weight	Context	Input $/1K	Output $/1K
`gpt-4`	openai	8	128,000	$0.010	$0.030
`gpt-3.5-turbo`	openai	3	16,385	$0.0005	$0.0015
`claude-opus`	anthropic	10	200,000	$0.015	$0.075
`claude-sonnet`	anthropic	7	200,000	$0.003	$0.015

Defaults are overridden if persisted models exist in the database or are registered via the credentials file.

API Operations

Create or Update a Model

curl -X POST http://localhost:8080/admin/v1/models \
  -H "Content-Type: application/json" \
  -d '{
    "id": "gpt-4-turbo",
    "provider_id": "openai",
    "weight": 7,
    "max_context_tokens": 128000,
    "input_per_1k": 0.01,
    "output_per_1k": 0.03,
    "enabled": true
  }'

Or with tokenhubctl:

tokenhubctl model add '{"id":"gpt-4-turbo","provider_id":"openai","weight":7,"max_context_tokens":128000,"input_per_1k":0.01,"output_per_1k":0.03,"enabled":true}'

Field	Type	Required	Description
`id`	string	Yes	Model identifier (must match provider's model name)
`provider_id`	string	Yes	ID of the registered provider
`weight`	int	Yes	Capability weight (0-10); higher = more capable
`max_context_tokens`	int	Yes	Maximum context window in tokens
`input_per_1k`	float	Yes	Cost per 1,000 input tokens in USD
`output_per_1k`	float	Yes	Cost per 1,000 output tokens in USD
`enabled`	bool	Yes	Whether the model is available for routing

Model IDs can contain slashes (e.g., Qwen/Qwen2.5-Coder-32B-Instruct, nvidia/openai/gpt-oss-20b). The API handles them correctly.

List Models

curl http://localhost:8080/admin/v1/models
tokenhubctl model list

The tokenhubctl model list command merges models from both the persistent store and the runtime engine, so models registered via environment variables or the credentials file are also shown.

Patch a Model

Update individual fields without resending the full configuration:

curl -X PATCH http://localhost:8080/admin/v1/models/gpt-4o \
  -H "Content-Type: application/json" \
  -d '{
    "weight": 9,
    "enabled": true,
    "input_per_1k": 0.012
  }'

Or:

tokenhubctl model edit gpt-4o '{"weight":9}'

Patchable fields: weight, enabled, input_per_1k, output_per_1k, max_context_tokens.

Runtime-only models (those registered via env vars or credentials file but not in the store) can also be patched. The first patch creates a store record seeded from the engine's runtime data.

Enable / Disable a Model

Quick shortcuts via tokenhubctl:

tokenhubctl model enable gpt-4o
tokenhubctl model disable gpt-4o-legacy

Delete a Model

curl -X DELETE http://localhost:8080/admin/v1/models/gpt-4-legacy
tokenhubctl model delete gpt-4-legacy

Weight Guidelines

The model weight is the primary indicator of model capability used in routing decisions:

Weight	Intended For
1-3	Simple tasks, low cost (e.g., GPT-3.5 Turbo)
4-6	General purpose (e.g., GPT-4 Turbo, Claude Sonnet)
7-8	Complex reasoning (e.g., GPT-4, Claude Opus)
9-10	Highest capability (e.g., next-gen frontier models)

Different routing modes weight the capability score differently:

cheap mode barely considers weight (0.1 factor)
high_confidence and planning modes heavily favor higher weights (0.6-0.7 factor)
normal mode balances weight equally with cost, latency, and reliability (0.25 each)

Context Window

The max_context_tokens field tells the router whether a model can handle a given request size. The router applies a 15% headroom buffer — a model with 128,000 tokens can handle requests estimated up to ~108,000 tokens.

Token estimation uses estimated_input_tokens from the request if provided, otherwise falls back to a characters / 4 heuristic.

Pricing

Model pricing is used for:

Cost estimation: Returned in the response as estimated_cost_usd
Budget filtering: Models exceeding the request's max_budget_usd are excluded
Cost scoring: In routing modes that consider cost (especially cheap mode)

Keep pricing up to date as providers change their rates.

Audit Trail

Model mutations are logged:

model.upsert — Model created or updated
model.patch — Model partially updated
model.delete — Model removed

Routing Configuration

TokenHub's routing engine uses a multi-objective scoring function to select the best model for each request. Administrators can configure the default routing behavior that applies when clients don't specify a policy.

Default Routing Settings

View Current Defaults

curl http://localhost:8080/admin/v1/routing-config

Response:

{
  "default_mode": "normal",
  "default_max_budget_usd": 0.05,
  "default_max_latency_ms": 20000
}

Update Defaults

curl -X PUT http://localhost:8080/admin/v1/routing-config \
  -H "Content-Type: application/json" \
  -d '{
    "default_mode": "normal",
    "default_max_budget_usd": 0.10,
    "default_max_latency_ms": 30000
  }'

Field	Type	Range	Description
`default_mode`	string	See below	Default routing mode
`default_max_budget_usd`	float	0-100	Default cost ceiling per request
`default_max_latency_ms`	int	0-300000	Default latency ceiling

Changes take effect immediately for new requests and are persisted to the database.

Wildcard Model Ladder

Requests with model="*" use an ordered ladder: TokenHub starts with the first registered, enabled model in the list and fails downward when a wildcard candidate reports budget or quota exhaustion. Update the ladder at runtime:

curl -X PUT http://localhost:8080/admin/v1/wildcard-models \
  -H "Content-Type: application/yaml" \
  --data-binary @wildcard-models.yaml

models:
  - gpt-5.5
  - gpt-5.4-mini
  - minimaxai/minimax-m2.7

Use GET /admin/v1/wildcard-models to see the effective ladder.

Routing Modes

Each mode applies different weights to the four scoring objectives:

Mode	Cost	Latency	Failure Rate	Capability	Use Case
`cheap`	0.7	0.1	0.1	0.1	Minimize costs for simple tasks
`normal`	0.25	0.25	0.25	0.25	Balanced operation
`high_confidence`	0.05	0.1	0.15	0.7	Complex tasks needing strong models
`planning`	0.1	0.1	0.2	0.6	Multi-step reasoning tasks
`adversarial`	0.1	0.1	0.2	0.6	Adversarial orchestration
`thompson`	—	—	—	—	Adaptive RL-based selection

How Scoring Works

For modes other than thompson, the scoring formula is:

score = (cost_norm × w_cost) + (latency_norm × w_latency) + (failure_norm × w_failure) - (weight × w_capability)

Where:

cost_norm: Estimated cost normalized to 0-1 range
latency_norm: Average latency normalized to 0-1 range
failure_norm: Error rate from health tracker
weight: Model capability weight (0-10)
w_*: Mode-specific weights from the table above

Lower score = better model. Models are sorted by score and tried in order.

Thompson Sampling

The thompson mode uses a contextual bandit approach:

Each (model, token_bucket) pair maintains Beta distribution parameters (alpha, beta)
For each request, a reward value is sampled from each model's Beta distribution
Models are sorted by sampled reward (highest first)
Parameters are updated periodically from historical reward data

This approach automatically adapts to changing model performance over time.

Model Eligibility Filtering

Before scoring, the router filters models:

Enabled: Model must be enabled
Minimum weight: Must meet the request's min_weight threshold
Context capacity: Must have enough context window (with 15% headroom)
Provider health: Provider must not be in the "down" state
Budget: Estimated cost must be within max_budget_usd

If no models pass filtering, the request fails with a 502 error.

Escalation and Failover

When a provider call fails, the router uses the error classification to decide what to do:

Error Class	Action
`context_overflow`	Find a model with a larger context window
`rate_limited`	Skip to the next provider; honor `Retry-After` header
`transient` (5xx)	Retry with exponential backoff (100ms base, 2 retries)
`fatal` (4xx)	Try the next model in scored order

The router tries up to 5 models before giving up.

Runtime Model Registry

View the current in-memory model registry and registered adapters:

curl http://localhost:8080/admin/v1/engine/models

Response:

{
  "models": [
    {
      "id": "gpt-4",
      "provider_id": "openai",
      "weight": 8,
      "max_context_tokens": 128000,
      "input_per_1k": 0.01,
      "output_per_1k": 0.03,
      "enabled": true
    }
  ],
  "adapters": ["openai", "anthropic", "vllm"]
}

Audit Trail

Routing configuration changes are logged as routing-config.update; wildcard ladder changes are logged as wildcard-models.update.

API Key Management

TokenHub issues its own API keys to client applications. Provider API keys are escrowed in the vault — clients never see them. This provides a clean separation between client authentication and provider credentials.

Key Properties

Property	Description
ID	16-character hex identifier
Prefix	First 8 characters of the key for identification
Name	Human-readable label
Scopes	JSON array of allowed endpoints (`chat`, `plan`)
Rotation days	Recommended rotation period (0 = manual only)
Expiration	Optional automatic expiry
Enabled	Active/inactive toggle

Operations

Create a Key

curl -X POST http://localhost:8080/admin/v1/apikeys \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production-backend",
    "scopes": "[\"chat\",\"plan\"]",
    "rotation_days": 90,
    "expires_in": "2160h"
  }'

Field	Type	Required	Description
`name`	string	Yes	Human-readable name for the key
`scopes`	string	No	JSON array of scopes (default: `["chat","plan"]`)
`rotation_days`	int	No	Recommended rotation period in days (default: 0)
`expires_in`	string	No	Go duration for expiry (e.g., `720h` for 30 days)

Response:

{
  "ok": true,
  "key": "tokenhub_a1b2c3d4e5f6789012345678abcdef0123456789abcdef0123456789abcdef01",
  "id": "a1b2c3d4e5f6g7h8",
  "prefix": "tokenhub_a1b2c3d4",
  "warning": "Store this key securely. It will not be shown again."
}

Important: The plaintext key is returned only at creation time. Store it securely before closing the response.

List Keys

curl http://localhost:8080/admin/v1/apikeys

Response:

[
  {
    "id": "a1b2c3d4e5f6g7h8",
    "key_prefix": "tokenhub_a1b2c3d4",
    "name": "production-backend",
    "scopes": "[\"chat\",\"plan\"]",
    "created_at": "2026-02-16T10:00:00Z",
    "last_used_at": "2026-02-16T12:34:56Z",
    "expires_at": "2026-05-16T10:00:00Z",
    "rotation_days": 90,
    "enabled": true
  }
]

Plaintext keys are never shown in list responses.

Rotate a Key

Generate a new key value while keeping the same ID and configuration:

curl -X POST http://localhost:8080/admin/v1/apikeys/a1b2c3d4e5f6g7h8/rotate

Response:

{
  "ok": true,
  "key": "tokenhub_<new-64-hex-chars>",
  "warning": "Store this key securely. It will not be shown again."
}

The old key immediately becomes invalid. Distribute the new key to all clients before rotating.

Update a Key

Modify key metadata without changing the key value:

curl -X PATCH http://localhost:8080/admin/v1/apikeys/a1b2c3d4e5f6g7h8 \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production-backend-v2",
    "scopes": "[\"chat\"]",
    "enabled": true,
    "rotation_days": 60
  }'

All fields are optional — only specified fields are updated.

Revoke (Delete) a Key

curl -X DELETE http://localhost:8080/admin/v1/apikeys/a1b2c3d4e5f6g7h8

This permanently removes the key. It cannot be recovered.

Security Details

Storage

Keys are hashed with bcrypt (cost factor 10) before storage
To reduce bcrypt overhead per-request, validated keys are cached for 5 minutes
The SHA-256 digest of the plaintext is bcrypt-hashed (allowing keys longer than bcrypt's 72-byte limit)

Validation Flow

Extract Bearer tokenhub_... from Authorization header
Extract the key prefix (first 8 chars after tokenhub_)
Check the validation cache (5-minute TTL)
If not cached: load record by prefix, bcrypt-verify, check enabled + expiry
Update last_used_at timestamp
Verify the key's scopes include the requested endpoint

Scopes

Scope	Protects
`chat`	`POST /v1/chat`
`plan`	`POST /v1/plan`

An empty scopes array [] grants access to all endpoints.

Audit Trail

All key management operations are logged:

apikey.create — New key created
apikey.rotate — Key rotated (new value generated)
apikey.update — Key metadata changed
apikey.revoke — Key deleted

Best Practices

Name keys descriptively: Use names like staging-backend, prod-api-v2, data-pipeline
Use minimal scopes: If a client only needs chat, don't grant plan access
Set rotation schedules: Configure rotation_days as a reminder to rotate
Set expiration for temporary keys: Use expires_in for keys issued to contractors or experiments
Monitor last_used_at: Keys not used for extended periods may be candidates for revocation
Rotate after incidents: If a key may have been compromised, rotate immediately

Monitoring & Observability

TokenHub provides multiple layers of observability: health tracking, Prometheus metrics, time-series data, request logs, audit logs, reward logs, and real-time SSE events.

Health Endpoint

curl http://localhost:8080/healthz

Status	Meaning
200	System is healthy, adapters and models are registered
503	No adapters or no models are registered

Response:

{"status": "ok", "adapters": 2, "models": 6}

Provider Health

View per-provider health status:

curl http://localhost:8080/admin/v1/health

Response:

{
  "providers": [
    {
      "provider_id": "openai",
      "state": "healthy",
      "total_requests": 1234,
      "total_errors": 5,
      "consec_errors": 0,
      "avg_latency_ms": 456.7,
      "last_error": "",
      "last_success_at": "2026-02-16T12:34:56Z",
      "cooldown_until": "0001-01-01T00:00:00Z"
    }
  ]
}

Health States

State	Consecutive Errors	Behavior
Healthy	0-1	Normal routing
Degraded	2-4	Still routed but penalized in scoring
Down	5+	Excluded from routing; 30-second cooldown

Active Health Probing

TokenHub actively probes provider health endpoints in the background:

Provider	Health Endpoint	Success Criteria
OpenAI	`GET /v1/models`	2xx response
Anthropic	`GET /v1/messages`	2xx or 405 response
vLLM	`GET /health`	2xx response

Probes run every 30 seconds with a 10-second timeout.

Prometheus Metrics

Expose metrics at:

curl http://localhost:8080/metrics

Available Metrics

Metric	Type	Labels	Description
`tokenhub_requests_total`	counter	mode, model, provider, status	Total requests processed
`tokenhub_request_latency_ms`	histogram	mode, model, provider	Request latency distribution
`tokenhub_cost_usd_total`	counter	model, provider	Cumulative estimated cost

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: tokenhub
    scrape_interval: 15s
    static_configs:
      - targets: ['tokenhub:8080']

Example Queries

# Request rate by model
rate(tokenhub_requests_total[5m])

# P95 latency
histogram_quantile(0.95, rate(tokenhub_request_latency_ms_bucket[5m]))

# Cost per hour by provider
rate(tokenhub_cost_usd_total[1h]) * 3600

# Error rate
sum(rate(tokenhub_requests_total{status="error"}[5m])) /
sum(rate(tokenhub_requests_total[5m]))

Time-Series Database (TSDB)

TokenHub includes a lightweight SQLite-backed TSDB for historical metrics with querying and downsampling.

Query Metrics

curl "http://localhost:8080/admin/v1/tsdb/query?metric=latency&model_id=gpt-4&start=2026-02-16T00:00:00Z&end=2026-02-16T23:59:59Z&step_ms=60000"

Parameter	Required	Description
`metric`	Yes	Metric name (`latency` or `cost`)
`model_id`	No	Filter by model
`provider_id`	No	Filter by provider
`start`	No	Start time (RFC3339)
`end`	No	End time (RFC3339)
`step_ms`	No	Downsample bucket in milliseconds

List Available Metrics

curl http://localhost:8080/admin/v1/tsdb/metrics

Configure Retention

curl -X PUT http://localhost:8080/admin/v1/tsdb/retention \
  -H "Content-Type: application/json" \
  -d '{"retention_days": 14}'

Default retention is 7 days. Old data is automatically pruned hourly.

Manual Prune

curl -X POST http://localhost:8080/admin/v1/tsdb/prune

Request Logs

View paginated request history:

curl "http://localhost:8080/admin/v1/logs?limit=50&offset=0"

Each entry contains:

Timestamp, request ID
Model ID, provider ID, routing mode
Estimated cost, latency
HTTP status code, error class (if failed)

Audit Logs

View admin action history:

curl "http://localhost:8080/admin/v1/audit?limit=50&offset=0"

Logged actions:

vault.lock, vault.unlock, vault.rotate
provider.upsert, provider.delete
model.upsert, model.patch, model.delete
apikey.create, apikey.rotate, apikey.update, apikey.revoke
routing-config.update

Reward Logs

View contextual bandit reward data for RL-based routing analysis:

curl "http://localhost:8080/admin/v1/rewards?limit=50&offset=0"

Each entry contains: request ID, mode, model, provider, token count, token bucket (small/medium/large), latency budget, actual latency, cost, success flag, error class, and computed reward.

Aggregated Statistics

curl http://localhost:8080/admin/v1/stats

Returns global aggregates plus breakdowns by model and by provider.

Server-Sent Events (SSE)

Subscribe to real-time events:

curl -N http://localhost:8080/admin/v1/events

Event types:

Event	Fields	When
`route_success`	model_id, provider_id, latency_ms, cost_usd, reason	Request completed successfully
`route_error`	latency_ms, error_class, error_msg	Request failed

Example:

data: {"type":"route_success","model_id":"gpt-4","provider_id":"openai","latency_ms":456.7,"cost_usd":0.023,"reason":"routed-weight-8"}

Recommended Alerting Rules

Alert	Condition	Severity
High error rate	Error rate > 5% over 5 minutes	Warning
Provider down	Provider in "down" state > 2 minutes	Critical
High latency	P95 latency > 10 seconds	Warning
Cost spike	Hourly cost > 2x 7-day average	Warning
Vault locked	Vault locked during business hours	Critical
No providers	Adapter count = 0	Critical

Admin UI

TokenHub includes a built-in single-page admin dashboard accessible at /admin. The UI is embedded in the binary — no separate frontend build or deployment is needed.

Accessing the UI

Navigate to:

http://localhost:8080/admin

The root URL (http://localhost:8080/) automatically redirects to /admin/.

Authentication

When TOKENHUB_ADMIN_TOKEN is set, the dashboard displays a full-screen Admin Authentication modal on first visit. Paste your admin token and press Authenticate (or Enter). The token is verified against the API before the dashboard loads; an invalid token shows an inline error.

Once authenticated, the token is stored in sessionStorage (cleared when the browser tab closes). A Sign Out button in the header clears the session and re-opens the authentication modal.

To retrieve the admin token:

tokenhubctl admin-token

Cache Busting

The admin HTML is served with Cache-Control: no-cache, must-revalidate and an ETag derived from the content hash. Static assets under /_assets/ are served with immutable cache headers and versioned URLs (?v=<hash>), ensuring browsers always get fresh assets after a rebuild without manual cache clearing.

Dashboard Panels

Vault Controls

The vault panel adapts to three states:

First-Time Setup: When the vault has never been initialized, the UI displays a prompt to choose a master password (minimum 8 characters) with a confirmation field. Press Enter in the confirmation field or click Initialize Vault to complete setup.
Locked: When the vault has been initialized but is locked, the UI shows a password input. Press Enter or click Unlock to unlock.
Unlocked: Shows the unlocked status with a Lock button.

Note: The vault password encrypts your stored provider API keys. It is distinct from your admin token, which authenticates access to the admin API. You need both: the admin token to access the dashboard, and the vault password to decrypt stored credentials.

Provider Management

Full CRUD interface for providers:

Setup Wizard: Multi-step guided onboarding for new providers — select type (OpenAI/Anthropic/vLLM), enter base URL and API key, test the connection, then discover and register available models.
Provider Table: Shows all providers from both the persistent store and runtime engine (env vars, credentials file). Runtime-only providers are indicated with a badge. Base URLs are derived from adapter health endpoints when not stored.
Edit Modal: Click "Edit" on any provider to change type, base URL, API key, or enabled state.
Discover: Query a provider's API to find available models and register them.
Delete: Remove a provider from the store.

Model Management

Full CRUD interface for models:

Add Model Form: Create a new model with provider, weight, context window, and pricing.
Model Table: Shows all models from both the store and engine, with their provider, weight, context, pricing, and enabled state.
Edit Modal: Click "Edit" on any model to change weight, max context tokens, pricing, or enabled state.
Weight Slider: Quick inline weight adjustment (0-10).
Enable/Disable Toggle: Click the status icon to toggle a model.
Delete: Remove a model from the store and engine.

Model Selection Graph

An interactive directed acyclic graph (DAG) showing the relationship between providers and models. Built with Cytoscape.js, it is populated on page load with all known providers and models and updates in real time as routing events arrive.

Provider nodes (colored by health state)
Model nodes (sized by weight)
Edges colored by latency: green (<1s), yellow (1-3s), red (>3s)
Edge thickness based on request volume
Node size and border based on throughput and latency

Cost and Latency Charts

Multi-series D3.js line charts showing cost and latency trends over time:

Per-model breakdown
Configurable time window
Hover for exact values

What-If Simulator

Test routing decisions without sending a live request:

Select routing mode, token count, max budget, min weight, and model hint
See the winning model, eligible candidates, and the routing reason
Useful for understanding how parameter changes affect model selection

SSE Decision Feed

Live event stream showing every routing decision in real time:

Model, provider, latency, cost, and reason for each event
Error events with error classification
Auto-scrolling event list

Routing Configuration

Set server-wide routing defaults:

Default mode selector (cheap, normal, high_confidence, planning, adversarial)
Budget input (USD)
Latency input (milliseconds)
Save button with validation

Provider Health

Real-time provider health display:

State badges: Healthy (green), Degraded (yellow), Down (red)
Consecutive error count
Last success timestamp
Average latency

API Keys

Key management interface:

Create new keys (name, scopes, rotation, expiry)
One-time key display modal with copy button
Rotate keys (with one-time new key display)
Enable/disable toggle
Revoke (delete) keys
Table showing: name, prefix, scopes, created, last used, expires, rotation days, status

Request Log

Paginated request history:

Model, provider, mode columns
Latency, cost, status code
Error class (for failed requests)
Pagination controls

Audit Log

Paginated audit trail viewer:

Action type filter
Timestamp, action, resource ID
Request ID for correlation

Model Leaderboard

A ranked table of models by performance:

Success rate
Average latency
Total cost
Request count

Rewards

Contextual bandit reward data for Thompson Sampling analysis.

Workflows (Temporal)

When Temporal is enabled, shows workflow execution history:

Workflow ID, type, status
Start time, duration
Status badges: Running (blue), Completed (green), Failed (red)
Click to expand activity history

Static Assets

Static assets (Cytoscape.js, D3.js) are served from /_assets/ to avoid conflicts with the /admin/v1 API prefix. All assets are embedded in the binary via Go's embed package and served with immutable cache headers.

Customization

The admin UI is a single index.html file located at web/index.html in the source tree. To customize:

Edit web/index.html
Rebuild the binary (make build) or Docker image (make package)
The updated UI is embedded automatically with fresh cache-busting hashes

tokenhubctl CLI

tokenhubctl is the command-line interface for managing TokenHub. It wraps every admin API endpoint into a convenient, scriptable tool.

Installation

make install    # Builds natively and installs to ~/.local/bin

Or build inside the Docker builder container:

make build      # Produces bin/tokenhub and bin/tokenhubctl

Configuration

Variable	Default	Description
`TOKENHUB_URL`	`http://localhost:8080`	TokenHub server URL
`TOKENHUB_ADMIN_TOKEN`	—	Bearer token for admin endpoints (see `admin-token` command)

export TOKENHUB_URL="http://tokenhub.internal:8080"
export TOKENHUB_ADMIN_TOKEN="$(tokenhubctl admin-token)"

Command Reference

General

tokenhubctl admin-token         # Print the admin token (env, file, or Docker)
tokenhubctl status              # Server info, health, vault state
tokenhubctl health              # Provider health table
tokenhubctl version             # CLI version
tokenhubctl help                # Full usage

Admin Token

The admin-token command retrieves the admin token by checking, in order:

TOKENHUB_ADMIN_TOKEN environment variable
~/.tokenhub/.admin-token file (native deployments)
docker exec into the running container to read /data/.admin-token

This avoids the need to parse server logs. The token file is written automatically by the server at startup (whether auto-generated or set via env).

Rotating the Admin Token

tokenhubctl rotate-admin-token           # Generate a new random token
tokenhubctl rotate-admin-token <token>   # Replace with a specific token

After rotation, update your local environment:

make _write-env   # Sync token from container to ~/.tokenhub/env

The new token takes effect immediately (no restart required) and is persisted to the data directory so it survives restarts. The old token is invalidated instantly.

Vault

tokenhubctl vault unlock <password>
tokenhubctl vault lock
tokenhubctl vault rotate <old-password> <new-password>

Providers

tokenhubctl provider list
tokenhubctl provider add '<json>'
tokenhubctl provider edit <id> '<json>'
tokenhubctl provider delete <id>
tokenhubctl provider discover <id>

The list command merges providers from both the persistent store and the runtime engine, showing the source of each.

The discover command queries a provider's /v1/models endpoint to list available models and whether each is already registered in TokenHub.

Example:

# Add a new provider
tokenhubctl provider add '{
  "id": "openai",
  "type": "openai",
  "base_url": "https://api.openai.com",
  "api_key": "sk-..."
}'

# Update its base URL
tokenhubctl provider edit openai '{"base_url":"https://api.openai.com"}'

# Discover available models
tokenhubctl provider discover openai

Models

tokenhubctl model list
tokenhubctl model add '<json>'
tokenhubctl model edit <id> '<json>'
tokenhubctl model delete <id>
tokenhubctl model enable <id>
tokenhubctl model disable <id>

Model IDs can contain slashes (e.g., Qwen/Qwen2.5-Coder-32B-Instruct). The CLI handles them correctly.

Example:

# Add a model
tokenhubctl model add '{
  "id": "gpt-4o",
  "provider_id": "openai",
  "weight": 8,
  "max_context_tokens": 128000,
  "input_per_1k": 0.0025,
  "output_per_1k": 0.01,
  "enabled": true
}'

# Adjust its weight
tokenhubctl model edit gpt-4o '{"weight": 9}'

# Temporarily disable it
tokenhubctl model disable gpt-4o

Routing

tokenhubctl routing get
tokenhubctl routing set '<json>'

Example:

tokenhubctl routing set '{"default_mode":"cheap","default_max_budget_usd":0.02,"default_max_latency_ms":10000}'

API Keys

tokenhubctl apikey list
tokenhubctl apikey create '<json>'
tokenhubctl apikey rotate <id>
tokenhubctl apikey edit <id> '<json>'
tokenhubctl apikey delete <id>

The create command prints the API key exactly once. Save it immediately.

Example:

tokenhubctl apikey create '{"name":"prod-app","scopes":"[\"chat\",\"plan\"]","monthly_budget_usd":50.0}'

Observability

tokenhubctl logs [--limit N]       # Request logs
tokenhubctl audit [--limit N]      # Audit trail
tokenhubctl rewards [--limit N]    # Thompson Sampling reward data
tokenhubctl stats                  # Aggregated statistics
tokenhubctl engine models          # Runtime model registry and adapter info
tokenhubctl events                 # Live SSE event stream (Ctrl-C to stop)

Routing Simulation

Run a what-if simulation without sending a real request:

tokenhubctl simulate '{"mode":"cheap","token_count":500}'
tokenhubctl simulate '{"mode":"high_confidence","token_count":2000,"max_budget_usd":0.10}'

TSDB

tokenhubctl tsdb metrics
tokenhubctl tsdb query metric=latency&model_id=gpt-4o&step_ms=60000
tokenhubctl tsdb prune

Output Format

Most commands produce human-readable tabular output. For programmatic use, pipe JSON responses directly from curl or parse tokenhubctl output with standard text tools.

Architecture

TokenHub is a Go application structured as a layered system with clear package boundaries and dependency injection.

Package Layout

tokenhub/
├── cmd/tokenhub/          # Entry point, signal handling, HTTP server lifecycle
├── internal/
│   ├── app/               # Server construction, config loading, wiring
│   ├── apikey/            # API key manager + auth middleware
│   ├── events/            # In-memory event bus (pub/sub for SSE)
│   ├── health/            # Provider health tracker + active prober
│   ├── httpapi/           # HTTP handlers and route mounting
│   ├── logging/           # Structured logging setup (slog)
│   ├── metrics/           # Prometheus metric registry
│   ├── providers/         # Provider adapter contract + context helpers
│   │   ├── openai/        # OpenAI adapter
│   │   ├── anthropic/     # Anthropic adapter
│   │   └── vllm/          # vLLM adapter
│   ├── router/            # Routing engine, scoring, orchestration, Thompson Sampling
│   ├── stats/             # In-memory statistics collector
│   ├── store/             # Persistence layer (SQLite)
│   ├── temporal/          # Temporal workflow integration
│   ├── tsdb/              # Time-series database (SQLite-backed)
│   └── vault/             # AES-256-GCM encrypted credential vault
├── web/                   # Embedded admin UI (index.html)
└── docs/                  # This documentation

Dependency Flow

cmd/tokenhub/main.go
  └── internal/app.NewServer(cfg)
        ├── vault.New()
        ├── router.NewEngine()
        ├── store.NewSQLite()
        ├── health.NewTracker()
        ├── health.NewProber()         → health.Tracker
        ├── loadCredentialsFile()      → router.Engine
        ├── loadPersistedProviders()   → router.Engine
        ├── router.NewThompsonSampler()
        ├── apikey.NewManager()        → store.Store
        ├── metrics.New()
        ├── events.NewBus()
        ├── stats.NewCollector()
        ├── tsdb.New()
        ├── temporal.New()             → (optional)
        └── httpapi.MountRoutes()      → Dependencies{...}

All dependencies flow downward. HTTP handlers receive a Dependencies struct containing all services they need.

Key Interfaces

`router.Sender`

The provider adapter contract:

type Sender interface {
    ID() string
    Send(ctx context.Context, model string, req Request) (ProviderResponse, error)
    ClassifyError(err error) *ClassifiedError
}

`router.StreamSender`

Optional streaming extension:

type StreamSender interface {
    Sender
    SendStream(ctx context.Context, model string, req Request) (io.ReadCloser, error)
}

`health.Probeable`

Health probe interface for providers:

type Probeable interface {
    ID() string
    HealthEndpoint() string
}

`store.Store`

Persistence interface with methods for models, providers, request logs, audit logs, reward entries, API keys, vault blobs, and routing configuration.

Request Lifecycle

HTTP handler receives the request, validates input, extracts API key
Directive parser scans messages for @@tokenhub overrides and strips them
Policy resolution: Merge request policy with server defaults and directive overrides
Token estimation: Estimate input tokens (explicit or chars/4 heuristic)
Model selection: Filter eligible models, score by policy weights, sort
Provider dispatch: Call the top-scored model's adapter
Error handling: On failure, classify the error and escalate/retry/failover
Output shaping: Apply output format (JSON schema validation, think-block stripping)
Observability: Record metrics, TSDB points, request logs, reward entries, SSE events
Response: Return the provider response with routing metadata

Concurrency Model

The HTTP server uses Go's standard net/http with chi router (goroutine per request)
The TSDB uses internal write buffering (batched inserts)
The health prober runs as a background goroutine with configurable interval
The Thompson Sampler refresh runs as a background goroutine
The TSDB prune loop runs as a background goroutine (hourly)
Temporal workflows (when enabled) are managed by the Temporal worker

All background goroutines are cleanly stopped via Server.Close().

Configuration

All configuration is via environment variables, loaded in internal/app/config.go. See Configuration Reference for the complete list.

Embedding

The admin UI (web/index.html) is embedded in the binary using Go's //go:embed directive in the root embed.go file. This means the entire application is a single self-contained binary.

Routing Engine

The routing engine (internal/router/engine.go) is TokenHub's core component. It manages the model registry, scores models against request policies, dispatches to provider adapters, and handles failover.

Engine Structure

type Engine struct {
    adapters      map[string]Sender         // provider ID → adapter
    models        []Model                   // registered models
    healthChecker HealthChecker             // optional health state provider
    banditPolicy  BanditPolicy              // optional Thompson Sampling
    defaults      EngineConfig              // default mode, budget, latency
}

Model Registration

Models and adapters are registered at startup and can be modified at runtime:

eng.RegisterAdapter(openai.New("openai", apiKey, baseURL))
eng.RegisterModel(router.Model{
    ID: "gpt-4", ProviderID: "openai",
    Weight: 8, MaxContextTokens: 128000,
    InputPer1K: 0.01, OutputPer1K: 0.03, Enabled: true,
})

Scoring Algorithm

The scoreModel() function computes a composite score for each eligible model:

score = (costNorm * w.Cost) + (latencyNorm * w.Latency) + (failureNorm * w.Failure) - (weightNorm * w.Weight)

Normalization:

costNorm: estimatedCost / maxBudgetUSD (clamped to 0-1)
latencyNorm: avgLatencyMs / maxLatencyMs (from health tracker)
failureNorm: errorRate (from health tracker, 0-1)
weightNorm: model.Weight / 10.0

Lower scores are better. The weight term is subtracted (higher weight reduces score).

Eligibility Filtering

eligibleModels() filters the model registry:

Must be Enabled
Must meet min_weight threshold
Must have sufficient context window (estimated tokens * 1.15 headroom)
Provider must not be in "down" health state
Estimated cost must be within budget

For thompson mode, eligible models are reordered by Thompson Sampling instead of the scoring function.

RouteAndSend Flow

func (e *Engine) RouteAndSend(ctx context.Context, req Request, policy Policy) (Decision, ProviderResponse, error)

Resolve defaults (fill in zero-value policy fields from server defaults)
Get eligible models
If model_hint is set and the model exists, try it first
Sort remaining models by score
For each model (up to 5 attempts): a. Look up the adapter by model.ProviderID b. Call adapter.Send(ctx, model.ID, req) c. On success: return decision + response d. On error: classify the error and decide next action:
- ErrContextOverflow: Find a model with larger context
- ErrRateLimited: Skip to next provider (honor RetryAfter)
- ErrTransient: Retry same model with exponential backoff
- ErrFatal: Try next model

Orchestration

Orchestrate() handles multi-model modes:

func (e *Engine) Orchestrate(ctx context.Context, req Request, dir OrchestrationDirective) (Decision, json.RawMessage, error)

See Orchestration Modes for details.

Streaming

func (e *Engine) RouteAndStream(ctx context.Context, req Request, policy Policy) (Decision, io.ReadCloser, error)

Same model selection as RouteAndSend, but calls SendStream() on adapters that implement StreamSender. Returns the raw SSE stream body for the HTTP handler to proxy.

Health Integration

The engine optionally uses a HealthChecker interface:

type HealthChecker interface {
    ProviderState(providerID string) ProviderHealthState
}

This provides:

Error rate for scoring (failureNorm)
"Down" state for eligibility filtering
Average latency for scoring (latencyNorm)

Thompson Sampling Integration

When a BanditPolicy is set:

type BanditPolicy interface {
    Sample(models []Model, tokenBucket string) []Model
}

In thompson mode, eligibleModels() calls banditPolicy.Sample() instead of the scoring function. The sampler draws from Beta distributions parameterized by historical reward data.

Thread Safety

The engine uses sync.RWMutex to protect the model registry and adapter map. Reads (model selection, routing) take a read lock. Writes (register/unregister) take a write lock.

Orchestration Modes

Orchestration enables multi-model reasoning patterns. The orchestration logic lives in internal/router/engine.go in the Orchestrate() method.

Architecture

Orchestrate(req, directive)
  ├── adversarial: Plan → Critique → Refine (loop)
  ├── vote:        N Voters → Judge → Select best
  ├── refine:      Generate → Refine → Refine (loop)
  └── planning:    Single RouteAndSend with planning profile

Model Selection for Orchestration

Each orchestration mode needs a "primary" model and optionally a "review" model. Models are selected by:

Explicit model ID: primary_model_id / review_model_id in the directive
Weight floor: primary_min_weight / review_min_weight sets minimum capability
Automatic: Falls back to routing engine scoring with the appropriate policy

For review models, the policy uses high_confidence mode by default to ensure a capable judge/critic.

Adversarial Mode

Three-phase iterative refinement with a separate critique model:

// Phase 1: Plan
planResp = RouteAndSend(req with "Create a detailed plan...")
// Phase 2: Critique (loop N iterations)
critiqueResp = RouteAndSend(req with "Critique this plan: ...")
// Phase 3: Refine
refinedResp = RouteAndSend(req with "Refine based on critique: ...")

The critique and refine phases repeat for directive.Iterations (default 1).

Output schema:

{
  "initial_plan": "Plan text from phase 1",
  "critique": "Final critique from last iteration",
  "refined_plan": "Final refined plan from last iteration"
}

Vote Mode

Multiple models respond independently, a judge selects the best:

// Phase 1: Collect votes (one per eligible model, up to 3)
for model in eligibleModels:
    responses[model] = RouteAndSend(req, model)

// Phase 2: Judge
judgeResp = RouteAndSend(req with "Select the best response (1-N): ...")
selectedIdx = parseNumber(judgeResp) - 1

Output schema:

{
  "responses": [
    {"model": "gpt-4", "content": "...", "selected": true},
    {"model": "claude-sonnet", "content": "...", "selected": false}
  ],
  "selected": 0,
  "judge": "claude-opus"
}

Refine Mode

Single model iteratively improves its own response:

// Phase 1: Initial response
resp = RouteAndSend(req)

// Phase 2: Iterative refinement (loop N iterations)
for i := 0; i < iterations; i++:
    resp = RouteAndSend(req with "Review and improve: " + resp)

Output schema:

{
  "refined_response": "Final refined text",
  "iterations": 3,
  "model": "claude-opus"
}

Planning Mode

Falls through to a standard RouteAndSend with the planning routing profile:

decision, resp, err = RouteAndSend(req, Policy{Mode: "planning"})

Cost and Latency

Orchestration makes multiple LLM calls. The Decision returned by Orchestrate() accumulates costs from all calls:

totalDecision.EstimatedCostUSD += stepDecision.EstimatedCostUSD

The routing reason is set to {mode}-orchestration (e.g., adversarial-orchestration).

Temporal Integration

When Temporal is enabled, orchestration runs as a OrchestrationWorkflow:

Each LLM call becomes a Temporal activity
Activities run with retry policies and timeouts
The full execution is visible in the Temporal UI
If Temporal is unavailable, falls back to direct orchestration

See Temporal Workflows for details.

Adding New Orchestration Modes

To add a new mode:

Add the mode name to the validation list in handlers_plan.go
Add a case in Orchestrate() in engine.go
Implement the multi-call pattern following existing modes
Return a json.RawMessage with the composite result
Update the OrchestrationWorkflow in temporal/workflows.go if using Temporal

Provider Adapters

Provider adapters translate TokenHub's generic request format into provider-specific API calls. Each adapter implements the router.Sender interface.

Interface

// Sender is the core provider adapter interface.
type Sender interface {
    ID() string
    Send(ctx context.Context, model string, req Request) (ProviderResponse, error)
    ClassifyError(err error) *ClassifiedError
}

// StreamSender extends Sender with streaming support.
type StreamSender interface {
    Sender
    SendStream(ctx context.Context, model string, req Request) (io.ReadCloser, error)
}

// Probeable enables active health probing.
type Probeable interface {
    ID() string
    HealthEndpoint() string
}

ProviderResponse is []byte (raw JSON from the provider).

Existing Adapters

OpenAI (`internal/providers/openai/`)

Endpoint: POST {baseURL}/v1/chat/completions
Health: GET {baseURL}/v1/models
Auth: Authorization: Bearer {apiKey}
Request translation: Maps req.Messages to OpenAI chat format, merges req.Parameters
Error classification:
- 429 → ErrRateLimited (with Retry-After header parsing)
- 5xx → ErrTransient
- Body contains context_length_exceeded → ErrContextOverflow
- Other → ErrFatal

Anthropic (`internal/providers/anthropic/`)

Endpoint: POST {baseURL}/v1/messages
Health: GET {baseURL}/v1/messages (405 = healthy)
Auth: x-api-key: {apiKey}, anthropic-version: 2023-06-01
Request translation: Splits system message from user messages (Anthropic API requires separate system field), defaults max_tokens to 4096 if not in req.Parameters
Error classification: Same pattern as OpenAI

vLLM (`internal/providers/vllm/`)

Endpoint: POST {endpoint}/v1/chat/completions (OpenAI-compatible)
Health: GET {endpoint}/health
Auth: None (local deployment)
Features: Multiple endpoints with round-robin load balancing
Request translation: Same as OpenAI (vLLM implements OpenAI-compatible API)

Common Patterns

Parameter Forwarding

All adapters merge req.Parameters into the provider payload:

for k, v := range req.Parameters {
    if k != "model" && k != "messages" {
        payload[k] = v
    }
}

Reserved keys (model, messages, stream) are never overridden by parameters.

Request ID Propagation

All adapters forward the request ID for distributed tracing:

if reqID := providers.GetRequestID(ctx); reqID != "" {
    req.Header.Set("X-Request-ID", reqID)
}

The request ID is injected into the context by the HTTP handler using providers.WithRequestID().

Error Wrapping

Adapters wrap HTTP errors in providers.StatusError:

type StatusError struct {
    StatusCode    int
    Body          string
    RetryAfterSecs float64
}

The ClassifyError() method on each adapter converts these to router.ClassifiedError for the routing engine's failover logic.

Creating a New Adapter

To add support for a new provider:

Create internal/providers/{name}/adapter.go
Implement router.Sender (and optionally router.StreamSender and health.Probeable)
Add an Option pattern for configuration (timeout, endpoints, etc.)
Add a case for the new type in registerProviderAdapter() in internal/httpapi/handlers_admin.go
Register providers and models at runtime via the admin API or tokenhubctl

Example skeleton:

package newprovider

import (
    "context"
    "github.com/jordanhubbard/tokenhub/internal/router"
)

type Adapter struct {
    id     string
    apiKey string
    // ...
}

func New(id, apiKey string) *Adapter {
    return &Adapter{id: id, apiKey: apiKey}
}

func (a *Adapter) ID() string { return a.id }

func (a *Adapter) Send(ctx context.Context, model string, req router.Request) (router.ProviderResponse, error) {
    // Translate req to provider format, make HTTP call, return raw JSON
}

func (a *Adapter) ClassifyError(err error) *router.ClassifiedError {
    // Classify the error for failover logic
}

func (a *Adapter) HealthEndpoint() string {
    return "https://api.newprovider.com/health"
}

Health System

The health system tracks provider reliability and provides both passive monitoring (based on request outcomes) and active probing (periodic HTTP checks).

Components

Health Tracker (`internal/health/tracker.go`)

The tracker maintains per-provider health state:

type ProviderHealthState struct {
    State         string    // "healthy", "degraded", "down"
    TotalRequests int64
    TotalErrors   int64
    ConsecErrors  int
    AvgLatencyMs  float64   // Exponential moving average
    LastError     string
    LastSuccessAt time.Time
    CooldownUntil time.Time
}

State Transitions

                   success
   ┌─────────────────────────────────┐
   │                                 │
   ▼          2+ consec errors       │
Healthy ──────────────────────► Degraded
   ▲                                 │
   │          success                │ 5+ consec errors
   │◄────────────────────────────────┤
   │                                 ▼
   │                               Down
   │          cooldown expired       │
   │          + success              │
   └─────────────────────────────────┘

Configuration

type Config struct {
    DegradedThreshold int           // Consecutive errors to enter degraded (default: 2)
    DownThreshold     int           // Consecutive errors to enter down (default: 5)
    CooldownDuration  time.Duration // Time in down state before retry (default: 30s)
}

Recording Results

// Called after every provider request
tracker.RecordSuccess(providerID, latencyMs)
tracker.RecordError(providerID, errorMsg)

Each success resets the consecutive error counter. Each error increments it and potentially triggers a state transition.

Health Prober (`internal/health/prober.go`)

The prober performs active health checks against provider endpoints:

type Probeable interface {
    ID() string
    HealthEndpoint() string
}

Probe Logic

Sends GET requests to each provider's health endpoint
Runs all probes concurrently with a per-probe timeout
2xx or 405 responses are considered healthy (405 is expected from some endpoints like Anthropic's /v1/messages)
Any other response or connection error records a failure

Configuration

type ProberConfig struct {
    Interval time.Duration // Time between probe rounds (default: 30s)
    Timeout  time.Duration // Per-probe HTTP timeout (default: 10s)
}

Provider Health Endpoints

Provider	Endpoint	Success
OpenAI	`GET /v1/models`	2xx
Anthropic	`GET /v1/messages`	2xx or 405
vLLM	`GET /health`	2xx

Integration with Routing

The routing engine queries health state during model selection:

Eligibility: Models from providers in "down" state are excluded
Scoring: The failure rate (totalErrors / totalRequests) contributes to the model's score
Latency: The exponential moving average latency contributes to the model's score

type HealthChecker interface {
    ProviderState(providerID string) ProviderHealthState
}

The tracker implements this interface and is passed to the engine via engine.SetHealthChecker().

Observability

Provider health is exposed via:

GET /admin/v1/health — JSON health state for all providers
Admin UI health panel — Visual health badges
SSE events — Error events include provider state changes

Storage Layer

TokenHub uses SQLite for persistence, providing a zero-dependency embedded database. The storage layer is defined by the store.Store interface and implemented by store.SQLiteStore.

Interface

The Store interface (internal/store/store.go) provides methods for all persistence needs:

Models

UpsertModel(ctx, Model) error
GetModel(ctx, id) (*Model, error)
ListModels(ctx) ([]Model, error)
DeleteModel(ctx, id) error

Providers

UpsertProvider(ctx, Provider) error
ListProviders(ctx) ([]Provider, error)
DeleteProvider(ctx, id) error

Request Logs

LogRequest(ctx, RequestLog) error
ListRequestLogs(ctx, limit, offset) ([]RequestLog, error)

Audit Logs

LogAudit(ctx, AuditEntry) error
ListAuditLogs(ctx, limit, offset) ([]AuditEntry, error)

Reward Entries

LogReward(ctx, RewardEntry) error
ListRewardEntries(ctx, limit, offset) ([]RewardEntry, error)
GetRewardSummary(ctx) ([]RewardSummary, error)

API Keys

CreateAPIKey(ctx, APIKeyRecord) error
GetAPIKey(ctx, id) (*APIKeyRecord, error)
ListAPIKeys(ctx) ([]APIKeyRecord, error)
UpdateAPIKey(ctx, APIKeyRecord) error
DeleteAPIKey(ctx, id) error

Vault Blob

SaveVaultBlob(ctx, salt, data) error
LoadVaultBlob(ctx) (salt, data, error)

Routing Configuration

SaveRoutingConfig(ctx, RoutingConfig) error
LoadRoutingConfig(ctx) (RoutingConfig, error)

Schema

The database schema is created and migrated in sqlite.go's Migrate() method:

`models`

CREATE TABLE IF NOT EXISTS models (
    id TEXT PRIMARY KEY,
    provider_id TEXT NOT NULL,
    weight INTEGER NOT NULL DEFAULT 5,
    max_context_tokens INTEGER NOT NULL DEFAULT 4096,
    input_per_1k REAL NOT NULL DEFAULT 0,
    output_per_1k REAL NOT NULL DEFAULT 0,
    enabled INTEGER NOT NULL DEFAULT 1
);

`providers`

CREATE TABLE IF NOT EXISTS providers (
    id TEXT PRIMARY KEY,
    type TEXT NOT NULL,
    enabled INTEGER NOT NULL DEFAULT 1,
    base_url TEXT NOT NULL DEFAULT '',
    cred_store TEXT NOT NULL DEFAULT 'none'
);

`request_logs`

CREATE TABLE IF NOT EXISTS request_logs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT NOT NULL,
    request_id TEXT NOT NULL DEFAULT '',
    model_id TEXT NOT NULL DEFAULT '',
    provider_id TEXT NOT NULL DEFAULT '',
    mode TEXT NOT NULL DEFAULT '',
    estimated_cost_usd REAL NOT NULL DEFAULT 0,
    latency_ms INTEGER NOT NULL DEFAULT 0,
    status_code INTEGER NOT NULL DEFAULT 0,
    error_class TEXT NOT NULL DEFAULT ''
);

`audit_logs`

CREATE TABLE IF NOT EXISTS audit_logs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT NOT NULL,
    action TEXT NOT NULL,
    resource TEXT NOT NULL DEFAULT '',
    request_id TEXT NOT NULL DEFAULT ''
);

`reward_entries`

CREATE TABLE IF NOT EXISTS reward_entries (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT NOT NULL,
    request_id TEXT NOT NULL DEFAULT '',
    model_id TEXT NOT NULL DEFAULT '',
    provider_id TEXT NOT NULL DEFAULT '',
    mode TEXT NOT NULL DEFAULT '',
    estimated_tokens INTEGER NOT NULL DEFAULT 0,
    token_bucket TEXT NOT NULL DEFAULT '',
    latency_budget_ms REAL NOT NULL DEFAULT 0,
    latency_ms REAL NOT NULL DEFAULT 0,
    cost_usd REAL NOT NULL DEFAULT 0,
    success INTEGER NOT NULL DEFAULT 0,
    error_class TEXT NOT NULL DEFAULT '',
    reward REAL NOT NULL DEFAULT 0
);

`api_keys`

CREATE TABLE IF NOT EXISTS api_keys (
    id TEXT PRIMARY KEY,
    key_hash TEXT NOT NULL,
    key_prefix TEXT NOT NULL,
    name TEXT NOT NULL,
    scopes TEXT NOT NULL DEFAULT '["chat","plan"]',
    created_at TEXT NOT NULL,
    last_used_at TEXT,
    expires_at TEXT,
    rotation_days INTEGER NOT NULL DEFAULT 0,
    enabled INTEGER NOT NULL DEFAULT 1
);

`vault_blob`

CREATE TABLE IF NOT EXISTS vault_blob (
    id TEXT PRIMARY KEY DEFAULT 'singleton',
    salt TEXT,
    data_json TEXT
);

`routing_config`

CREATE TABLE IF NOT EXISTS routing_config (
    id TEXT PRIMARY KEY DEFAULT 'default',
    default_mode TEXT NOT NULL DEFAULT '',
    default_max_budget_usd REAL NOT NULL DEFAULT 0,
    default_max_latency_ms INTEGER NOT NULL DEFAULT 0
);

SQLite Configuration

The default DSN includes pragmas for performance:

file:/data/tokenhub.sqlite?_pragma=busy_timeout(5000)&_pragma=journal_mode(WAL)

busy_timeout: Wait up to 5 seconds for locks instead of failing immediately
journal_mode(WAL): Write-Ahead Logging for concurrent read/write access

TSDB

The time-series database (internal/tsdb/) uses a separate table in the same SQLite database:

CREATE TABLE IF NOT EXISTS tsdb_points (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ts INTEGER NOT NULL,        -- Unix nanoseconds
    metric TEXT NOT NULL,
    model_id TEXT NOT NULL DEFAULT '',
    provider_id TEXT NOT NULL DEFAULT '',
    value REAL NOT NULL
);

Features:

Write buffering (batch size 100)
Automatic retention pruning (default 7 days)
Downsampling support (configurable step size in queries)

Security Model

TokenHub implements security at multiple layers: credential encryption, client authentication, input validation, and audit logging.

Credential Security

Vault Encryption

Provider API keys are encrypted using AES-256-GCM:

Admin provides a vault password
Password + random salt → Argon2id key derivation → 256-bit encryption key
Each value is encrypted with a unique nonce
Encrypted values are stored in SQLite

Argon2id Parameters (per OWASP recommendations):

Time: 3 iterations
Memory: 64 MB
Threads: 4
Salt: 16 random bytes

Key Material Handling

Encryption keys exist only in memory while the vault is unlocked
Auto-lock clears the key after 30 minutes of inactivity
Vault salt is persisted in the database for key re-derivation
Password rotation re-encrypts all values atomically

Admin Authentication

Admin Token

All /admin/v1/* endpoints require a bearer token set via TOKENHUB_ADMIN_TOKEN. If not set, the server auto-generates a cryptographically random 64-character hex token at startup. The token is never logged — it is written to a file at /data/.admin-token (or ~/.tokenhub/.admin-token for native deployments) and can be retrieved with:

tokenhubctl admin-token

Client Authentication

API Key Security

Keys are hashed with bcrypt (cost 10) before storage
SHA-256 pre-hash allows keys longer than bcrypt's 72-byte input limit
5-minute validation cache reduces bcrypt overhead
Plaintext is shown only once at creation/rotation

Key Validation Flow

Request → Extract Bearer token → Check cache (5min TTL)
  ├── Cache hit → Check scopes → Allow/Deny
  └── Cache miss → Load by prefix → bcrypt verify → Check enabled → Check expiry
       ├── Valid → Update cache + last_used_at → Check scopes → Allow/Deny
       └── Invalid → 401 Unauthorized

Input Validation

All API inputs are validated before processing:

Chat/Plan Endpoints

Messages array: required, non-empty
max_budget_usd: 0-100 range
max_latency_ms: 0-300000 range
min_weight: 0-10 range
Orchestration iterations: 0-10 range
Orchestration mode: must be a known value

Admin Endpoints

Routing config mode: must be a known value
Routing config budget/latency: same ranges as consumer API
Model weight: reasonable range
API key name: required

Request Isolation

Each request gets its own context with a unique request ID
Provider API keys are never exposed to clients
Client API key records are attached to context but not serialized in responses
Request parameters are validated before forwarding to providers

Audit Trail

All administrative mutations are logged:

type AuditEntry struct {
    Timestamp time.Time
    Action    string  // e.g., "vault.unlock", "model.patch"
    Resource  string  // Resource identifier
    RequestID string  // For correlation
}

Auditable actions:

Vault operations (lock, unlock, rotate)
Provider CRUD
Model CRUD
API key lifecycle (create, rotate, update, revoke)
Routing configuration changes

Network Security

TokenHub itself does not implement TLS. In production:

Use a reverse proxy (nginx, Caddy, Traefik) for TLS termination
Restrict admin endpoints to internal networks or VPN
Use CORS appropriately (currently allows all origins for development)

Recommendations

Vault password: Use a strong, unique password (16+ characters)
API key rotation: Rotate keys every 90 days (configurable via rotation_days)
Network segmentation: Keep admin endpoints behind a VPN or firewall
TLS everywhere: Terminate TLS at a reverse proxy in front of TokenHub
Database backups: SQLite file contains encrypted credentials and configuration
Monitor audit logs: Set up alerting on unexpected admin actions

Temporal Workflows

TokenHub optionally integrates with Temporal for durable workflow execution. When enabled, every chat and orchestration request is dispatched as a Temporal workflow, providing visibility, retry guarantees, and execution history.

Architecture

HTTP Handler
  │
  ├── Temporal Enabled?
  │     ├── Yes → Start Temporal Workflow → Wait for result → Return response
  │     └── No  → Direct engine call → Return response
  │
  └── Temporal Unavailable (runtime)
        └── Fall back to direct engine call

Configuration

Env Var	Default	Description
`TOKENHUB_TEMPORAL_ENABLED`	`false`	Enable Temporal dispatch
`TOKENHUB_TEMPORAL_HOST`	`localhost:7233`	Temporal server address
`TOKENHUB_TEMPORAL_NAMESPACE`	`tokenhub`	Temporal namespace
`TOKENHUB_TEMPORAL_TASK_QUEUE`	`tokenhub-tasks`	Worker task queue name

Components

Manager (`internal/temporal/manager.go`)

The manager creates and manages the Temporal client and worker:

type Manager struct {
    client client.Client
    worker worker.Worker
}

New(cfg, activities) — Creates Temporal client, registers workflows and activities
Start() — Starts the worker (non-blocking)
Client() — Returns the Temporal client for HTTP handlers
Stop() — Gracefully stops worker and closes client

Types (`internal/temporal/types.go`)

Input/output types for workflows:

type ChatInput struct {
    RequestID string
    APIKeyID  string
    Request   router.Request
    Policy    router.Policy
}

type ChatOutput struct {
    Decision  router.Decision
    Response  json.RawMessage
    LatencyMs int64
    Error     string
}

type OrchestrationInput struct {
    RequestID string
    APIKeyID  string
    Request   router.Request
    Directive router.OrchestrationDirective
}

Activities (`internal/temporal/activities.go`)

Activities are the atomic units of work. They receive injected dependencies:

type Activities struct {
    Engine   *router.Engine
    Store    store.Store
    Health   *health.Tracker
    Metrics  *metrics.Registry
    EventBus *events.Bus
    Stats    *stats.Collector
    TSDB     *tsdb.Store
}

Key activities:

ChatActivity: Calls engine.RouteAndSend() and returns the result
LogResultActivity: Persists metrics, request logs, reward entries, TSDB points, and SSE events

Workflows (`internal/temporal/workflows.go`)

ChatWorkflow: Calls ChatActivity then LogResultActivity
OrchestrationWorkflow: Calls ChatActivity for orchestration, then LogResultActivity

HTTP Handler Integration

Handlers check for a Temporal client and dispatch accordingly:

if d.TemporalClient != nil {
    run, err := d.TemporalClient.ExecuteWorkflow(ctx, opts, ChatWorkflow, input)
    if err != nil {
        // Temporal unavailable — fall back
        decision, resp, err = d.Engine.RouteAndSend(ctx, req, policy)
    } else {
        var output ChatOutput
        err = run.Get(ctx, &output)
        // Use output
    }
} else {
    decision, resp, err = d.Engine.RouteAndSend(ctx, req, policy)
}

The fallback ensures TokenHub continues to work even if Temporal becomes unavailable at runtime.

Workflow Visibility

Admin endpoints expose Temporal workflow data:

GET /admin/v1/workflows?limit=50&status=RUNNING — List workflows
GET /admin/v1/workflows/{id} — Describe workflow
GET /admin/v1/workflows/{id}/history — Activity history

Status values: RUNNING, COMPLETED, FAILED, CANCELED, TERMINATED, CONTINUED_AS_NEW, TIMED_OUT

Docker Compose Setup

temporal:
  image: temporalio/auto-setup:latest
  ports:
    - "7233:7233"
  environment:
    - DB=sqlite

temporal-ui:
  image: temporalio/ui:latest
  ports:
    - "8233:8080"
  environment:
    - TEMPORAL_ADDRESS=temporal:7233

Access the Temporal Web UI at http://localhost:8233.

Streaming Note

Streaming requests (stream: true) bypass Temporal and use direct engine dispatch. This is because streaming requires a persistent HTTP connection for SSE, which is incompatible with Temporal's request-response workflow model.

Extending TokenHub

This guide covers common extension points for adding functionality to TokenHub.

Adding a New Provider

Create the adapter package:

internal/providers/newprovider/
├── adapter.go      # Sender implementation
└── adapter_test.go # Tests

Implement the interfaces:

package newprovider

type Adapter struct {
    id      string
    apiKey  string
    baseURL string
    client  *http.Client
}

// Required: router.Sender
func (a *Adapter) ID() string { return a.id }
func (a *Adapter) Send(ctx context.Context, model string, req router.Request) (router.ProviderResponse, error) { ... }
func (a *Adapter) ClassifyError(err error) *router.ClassifiedError { ... }

// Optional: router.StreamSender
func (a *Adapter) SendStream(ctx context.Context, model string, req router.Request) (io.ReadCloser, error) { ... }

// Optional: health.Probeable
func (a *Adapter) HealthEndpoint() string { return a.baseURL + "/health" }

Register via the admin API (providers and models are registered at runtime, not compiled in):

curl -X POST http://localhost:8080/admin/v1/providers \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"id":"newprovider","type":"openai","base_url":"https://api.newprovider.com","api_key":"..."}'

Add adapter construction in registerProviderAdapter() in handlers_admin.go:

case "newprovider":
    d.Engine.RegisterAdapter(newprovider.New(p.ID, apiKey, p.BaseURL, newprovider.WithTimeout(timeout)))

Adding a New Routing Mode

Define the weight profile in internal/router/engine.go:

var modeWeights = map[string]weights{
    // ...existing modes...
    "mymode": {Cost: 0.3, Latency: 0.2, Failure: 0.2, Weight: 0.3},
}

Add validation in internal/httpapi/handlers_chat.go and handlers_plan.go:

case "mymode":
    // valid

Add to routing config validation in handlers_routing.go.

Adding a New Orchestration Mode

Add the case in engine.Orchestrate():

case "mymode":
    // Implement multi-call pattern
    result, err := json.Marshal(map[string]any{...})
    return totalDecision, result, err

Add validation in handlers_plan.go.
Update Temporal if using workflows:

// In OrchestrationWorkflow
case "mymode":
    // Implement as Temporal activities

Adding New Admin Endpoints

Create handler in internal/httpapi/handlers_newfeature.go:

func NewFeatureHandler(d Dependencies) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        // Handler logic
    }
}

Mount route in internal/httpapi/routes.go:

r.Get("/admin/v1/newfeature", NewFeatureHandler(d))

Add to Dependencies if new services are needed.

Adding New Metrics

In internal/metrics/metrics.go:

type Registry struct {
    // ...existing metrics...
    NewMetric *prometheus.CounterVec
}

func New() *Registry {
    r := &Registry{
        NewMetric: prometheus.NewCounterVec(prometheus.CounterOpts{
            Namespace: "tokenhub",
            Name:      "new_metric_total",
            Help:      "Description of the new metric",
        }, []string{"label1", "label2"}),
    }
    // Register with Prometheus
    return r
}

Adding New Store Operations

Add to the interface in internal/store/store.go
Implement in SQLite in internal/store/sqlite.go
Add migration in Migrate() if new tables are needed
Write tests in internal/store/sqlite_test.go

Testing

TokenHub uses Go's standard testing package. Key test patterns:

Unit tests: Each package has *_test.go files
Integration tests: internal/httpapi/handlers_test.go tests the full HTTP stack
Mock adapters: mockSender in handler tests simulates provider responses
In-memory SQLite: Tests use :memory: DSN for isolated databases

Run all tests:

make test        # Standard tests
make test-race   # With race detector

Build

make build       # Build to bin/tokenhub
make package     # Build Docker image
make lint        # Run linter (requires golangci-lint)
make vet         # Go vet

Configuration Reference

TokenHub is configured entirely via environment variables. All variables are optional and have sensible defaults.

Environment Variables

Server

Variable	Default	Description
`TOKENHUB_LISTEN_ADDR`	`:8080`	HTTP server listen address (binds all interfaces)
`TOKENHUB_LOG_LEVEL`	`info`	Log level: `debug`, `info`, `warn`, `error`
`TOKENHUB_DB_DSN`	`/data/tokenhub.sqlite`	SQLite database path
`TOKENHUB_VAULT_ENABLED`	`true`	Enable encrypted credential vault
`TOKENHUB_VAULT_PASSWORD`	—	Auto-unlock vault at startup (headless mode)
`TOKENHUB_PROVIDER_TIMEOUT_SECS`	`30`	HTTP timeout for provider API calls

Routing Defaults

Variable	Default	Description
`TOKENHUB_DEFAULT_MODE`	`normal`	Default routing mode
`TOKENHUB_DEFAULT_MAX_BUDGET_USD`	`0.05`	Default max cost per request (USD)
`TOKENHUB_DEFAULT_MAX_LATENCY_MS`	`20000`	Default max latency (milliseconds)

Security & Hardening

Variable	Default	Description
`TOKENHUB_ADMIN_TOKEN`	—	Bearer token for `/admin/v1/*` access (required in production)
`TOKENHUB_CORS_ORIGINS`	`*`	Comma-separated allowed CORS origins
`TOKENHUB_RATE_LIMIT_RPS`	`60`	Max requests per second per IP
`TOKENHUB_RATE_LIMIT_BURST`	`120`	Burst capacity per IP

Credentials

Variable	Default	Description
`TOKENHUB_CREDENTIALS_FILE`	`~/.tokenhub/credentials`	Path to external credentials JSON file

Providers are registered at startup via ~/.tokenhub/credentials or at runtime via the admin API, tokenhubctl, or the admin UI. At least one provider must be registered for TokenHub to route requests.

Temporal (Optional)

Variable	Default	Description
`TOKENHUB_TEMPORAL_ENABLED`	`false`	Enable Temporal workflow dispatch
`TOKENHUB_TEMPORAL_HOST`	`localhost:7233`	Temporal server host:port
`TOKENHUB_TEMPORAL_NAMESPACE`	`tokenhub`	Temporal namespace
`TOKENHUB_TEMPORAL_TASK_QUEUE`	`tokenhub-tasks`	Temporal task queue name

OpenTelemetry (Optional)

Variable	Default	Description
`TOKENHUB_OTEL_ENABLED`	`false`	Enable OpenTelemetry tracing
`TOKENHUB_OTEL_ENDPOINT`	`localhost:4318`	OTLP exporter endpoint
`TOKENHUB_OTEL_SERVICE_NAME`	`tokenhub`	Service name for traces

External Credentials File

The ~/.tokenhub/credentials file is the primary mechanism for bootstrapping providers and models. It is processed at startup — providers are persisted to the database and API keys are stored in the vault (when TOKENHUB_VAULT_PASSWORD is set). The file must have 0600 permissions.

{
  "providers": [
    {
      "id": "openai",
      "type": "openai",
      "base_url": "https://api.openai.com",
      "api_key": "sk-..."
    },
    {
      "id": "vllm-local",
      "type": "vllm",
      "base_url": "http://localhost:8000"
    }
  ],
  "models": [
    {
      "id": "gpt-4o",
      "provider_id": "openai",
      "weight": 8,
      "max_context_tokens": 128000,
      "input_per_1k": 0.0025,
      "output_per_1k": 0.01
    }
  ]
}

The file is idempotent — providers and models are upserted, so it can remain in place across restarts. api_key is optional for keyless providers (vLLM, Ollama). All providers default to enabled: true unless explicitly set to false.

Wildcard Model Ladder

TOKENHUB_WILDCARD_MODELS_FILE optionally points at a JSON or YAML file that seeds the ordered fail-down ladder used when clients request model="*".

models:
  - gpt-5.5
  - gpt-5.4-mini
  - minimaxai/minimax-m2.7

The same list can be refreshed at runtime with PUT /admin/v1/wildcard-models; runtime changes are persisted to SQLite.

Example Configuration

Minimal

./bin/tokenhub
# Then register providers via ~/.tokenhub/credentials, admin API, or UI.

Full Production

export TOKENHUB_LISTEN_ADDR=":8080"
export TOKENHUB_LOG_LEVEL="info"
export TOKENHUB_DB_DSN="/data/tokenhub.sqlite"
export TOKENHUB_VAULT_ENABLED="true"
export TOKENHUB_PROVIDER_TIMEOUT_SECS="30"

# Security
export TOKENHUB_ADMIN_TOKEN="your-secret-admin-token"
export TOKENHUB_CORS_ORIGINS="https://app.example.com"
export TOKENHUB_RATE_LIMIT_RPS="100"
export TOKENHUB_RATE_LIMIT_BURST="200"

# Routing
export TOKENHUB_DEFAULT_MODE="normal"
export TOKENHUB_DEFAULT_MAX_BUDGET_USD="0.10"
export TOKENHUB_DEFAULT_MAX_LATENCY_MS="30000"

# Temporal (optional)
export TOKENHUB_TEMPORAL_ENABLED="true"
export TOKENHUB_TEMPORAL_HOST="temporal:7233"

# OpenTelemetry (optional)
export TOKENHUB_OTEL_ENABLED="true"
export TOKENHUB_OTEL_ENDPOINT="otel-collector:4318"

./bin/tokenhub
# Providers are loaded from ~/.tokenhub/credentials, or registered via admin API/UI.

Runtime Configuration

The following settings can be changed at runtime via the admin API or tokenhubctl without restarting:

Routing defaults: PUT /admin/v1/routing-config or tokenhubctl routing set
Wildcard model ladder: PUT /admin/v1/wildcard-models
Models: POST/PATCH/DELETE /admin/v1/models or tokenhubctl model add/edit/delete
Providers: POST/PATCH/DELETE /admin/v1/providers or tokenhubctl provider add/edit/delete
API keys: POST/PATCH/DELETE /admin/v1/apikeys or tokenhubctl apikey create/edit/delete
TSDB retention: PUT /admin/v1/tsdb/retention or tokenhubctl tsdb

Docker & Compose

TokenHub provides a Dockerfile for container builds and a Docker Compose file for local development with all dependencies.

Docker Image

Build

make package
# or
docker buildx build --load -t tokenhub .

The Dockerfile uses a multi-stage build:

Build stage: golang:1.24-alpine — compiles the Go binary and builds mdbook documentation
Runtime stage: alpine:3.21 — lightweight runtime with curl for health checks

The final image runs as a non-root tokenhub user.

Run

docker run -d \
  -p 8080:8080 \
  -e TOKENHUB_ADMIN_TOKEN="your-admin-token" \
  -v tokenhub_data:/data \
  tokenhub

The container expects:

Port 8080: HTTP server (binds all interfaces by default)
Volume /data: SQLite database persistence

Docker Compose

Full Stack

docker compose up -d

This starts:

Service	Port	Description
`tokenhub`	8080	TokenHub server
`temporal`	7233	Temporal server (gRPC)
`temporal-ui`	8233	Temporal Web UI

Services

TokenHub

tokenhub:
  image: tokenhub:latest
  ports:
    - "8080:8080"
  environment:
    - TOKENHUB_LISTEN_ADDR=:8080
    - TOKENHUB_DB_DSN=/data/tokenhub.sqlite
    - TOKENHUB_VAULT_ENABLED=true
    - TOKENHUB_VAULT_PASSWORD=${TOKENHUB_VAULT_PASSWORD}
    - TOKENHUB_ADMIN_TOKEN=${TOKENHUB_ADMIN_TOKEN}
  volumes:
    - tokenhub_data:/data
  restart: unless-stopped

Set TOKENHUB_VAULT_PASSWORD to auto-unlock the vault at startup (headless mode). If not set, unlock interactively via UI or tokenhubctl. Providers are loaded from ~/.tokenhub/credentials at startup, or registered at runtime via the admin API, tokenhubctl, or the admin UI.

Note: The TOKENHUB_DB_DSN should be a plain path (e.g., /data/tokenhub.sqlite) when using modernc.org/sqlite (the pure-Go driver). SQLite pragmas are applied programmatically, not via DSN query parameters.

Temporal

temporal:
  image: temporalio/auto-setup:latest
  ports:
    - "7233:7233"
  environment:
    - DB=sqlite
  volumes:
    - temporal_data:/etc/temporal/data

temporal-ui:
  image: temporalio/ui:latest
  ports:
    - "8233:8080"
  environment:
    - TEMPORAL_ADDRESS=temporal:7233

Environment File

Create a .env file for sensitive values:

TOKENHUB_ADMIN_TOKEN=your-secret-admin-token

Without Temporal

To run without Temporal:

docker compose up -d tokenhub

Or set TOKENHUB_TEMPORAL_ENABLED=false.

Provider Bootstrapping

Providers are loaded from ~/.tokenhub/credentials at startup. For Docker, mount the credentials file into the container or use the host path if running via Docker Compose with a volume mount. See Provider Management for the file format.

Health Check

The Docker health check uses the /healthz endpoint:

curl -f http://localhost:8080/healthz

Returns 200 when adapters and models are registered, 503 otherwise.

Data Persistence

All persistent data is stored in SQLite at the path configured by TOKENHUB_DB_DSN. In Docker, mount a volume to /data:

volumes:
  - tokenhub_data:/data

This persists:

Model and provider configurations
Vault salt and encrypted credentials
Request logs, audit logs, reward entries
API keys
Routing configuration
TSDB time-series data

Resource Requirements

TokenHub is lightweight:

Memory: ~50MB baseline, scales with request concurrency
CPU: Minimal (most time is spent waiting on provider APIs)
Disk: Depends on log retention; ~1MB per 10,000 requests

Production Checklist

Use this checklist when deploying TokenHub to production.

Pre-Deployment

Set a strong vault password (16+ characters, mixed case, numbers, symbols)
Configure at least one provider via environment variable
Set appropriate routing defaults for your use case
Create API keys for all client applications
Configure TSDB retention appropriate for your storage budget

Security Hardening

Set TOKENHUB_ADMIN_TOKEN: Stable Bearer token for /admin/v1/* endpoints (auto-generated if not set — check logs for the token)
Set TOKENHUB_CORS_ORIGINS: Restrict CORS to your domain(s) (e.g., https://app.example.com)
Rate limiting: Review TOKENHUB_RATE_LIMIT_RPS (default: 60/s) and TOKENHUB_RATE_LIMIT_BURST (default: 120) for your traffic patterns

Network Security

TLS termination: Place TokenHub behind a reverse proxy (nginx, Caddy, Traefik) with TLS
Firewall rules: Only allow inbound traffic on the configured listen port

Example nginx Configuration

server {
    listen 443 ssl;
    server_name tokenhub.example.com;

    ssl_certificate     /etc/letsencrypt/live/tokenhub.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/tokenhub.example.com/privkey.pem;

    # Consumer API - publicly accessible with API key auth
    location /v1/ {
        proxy_pass http://tokenhub:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Request-ID $request_id;

        # SSE streaming support
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
    }

    # Health check
    location /healthz {
        proxy_pass http://tokenhub:8080;
    }

    # Metrics (restrict to monitoring network)
    location /metrics {
        allow 10.0.0.0/8;
        deny all;
        proxy_pass http://tokenhub:8080;
    }

    # Admin endpoints (restrict to admin VPN)
    location /admin {
        allow 10.100.0.0/16;
        deny all;
        proxy_pass http://tokenhub:8080;
    }
}

Database

Mount a persistent volume for the SQLite database
Set WAL mode: Include _pragma=journal_mode(WAL) in the DSN
Set busy timeout: Include _pragma=busy_timeout(5000) in the DSN
Schedule backups: Periodically copy the SQLite file (safe with WAL mode)

Backup Script

#!/bin/bash
# Safe SQLite backup using the .backup command
sqlite3 /data/tokenhub.sqlite ".backup /backups/tokenhub-$(date +%Y%m%d-%H%M%S).sqlite"

Monitoring

Prometheus scraping: Configure Prometheus to scrape /metrics
Set up alerts based on the recommended alerting rules
Log aggregation: Forward structured JSON logs to your log management system
Monitor TSDB size: Set appropriate retention to prevent unbounded growth

Key Metrics to Watch

Metric	Alert Threshold	Severity
Error rate	> 5% over 5 min	Warning
P95 latency	> 10s	Warning
Provider down	> 2 min	Critical
Cost spike	> 2x weekly average	Warning
Vault locked	During business hours	Critical
Disk usage	> 80%	Warning

Graceful Shutdown

TokenHub handles SIGINT and SIGTERM for graceful shutdown:

Stop accepting new connections
Drain in-flight requests (30-second timeout)
Stop background goroutines (prober, Thompson Sampling refresh, TSDB prune)
Stop Temporal worker (if enabled)
Close database connection

In Kubernetes, set terminationGracePeriodSeconds: 35 to allow the full drain.

Scaling Considerations

TokenHub is a single-process application with SQLite. For higher throughput:

Horizontal: Run multiple instances with separate SQLite databases (no shared state; each instance routes independently)
Temporal: Enable Temporal for durable workflow execution across restarts
Read replicas: Not applicable (SQLite is embedded)
Connection pooling: SQLite WAL mode supports concurrent reads natively

For very high throughput (>1000 req/s), consider migrating the store to PostgreSQL (implement the Store interface for a new backend).

CLI Administration

Use tokenhubctl for scriptable administration and health checks:

# Quick status check
tokenhubctl status

# Verify providers and models
tokenhubctl provider list
tokenhubctl model list

# Watch for issues in real time
tokenhubctl events

See tokenhubctl CLI for the full command reference.

Environment Variables Summary

See Configuration Reference for the complete list of all environment variables and their defaults.

API Reference

Complete reference for all TokenHub HTTP endpoints.

Consumer Endpoints

POST /v1/chat

Send a chat completion request with automatic model routing.

Authentication: Required (Bearer token)

Request Body:

{
  "request": {
    "messages": [{"role": "string", "content": "string"}],
    "model_hint": "string",
    "estimated_input_tokens": 0,
    "parameters": {},
    "stream": false,
    "meta": {},
    "output_schema": {}
  },
  "capabilities": {"planning": false},
  "policy": {
    "mode": "normal",
    "max_budget_usd": 0.05,
    "max_latency_ms": 20000,
    "min_weight": 0
  },
  "output_format": {
    "type": "json",
    "schema": "string",
    "max_tokens": 0,
    "strip_think": false
  }
}

Response: 200 OK

{
  "negotiated_model": "string",
  "estimated_cost_usd": 0.0,
  "routing_reason": "string",
  "response": {}
}

Errors: 400, 401, 403, 502

POST /v1/plan

Send an orchestrated multi-model request.

Authentication: Required (Bearer token)

Request Body:

{
  "request": {
    "messages": [{"role": "string", "content": "string"}]
  },
  "orchestration": {
    "mode": "adversarial",
    "iterations": 2,
    "primary_model_id": "string",
    "review_model_id": "string",
    "primary_min_weight": 0,
    "review_min_weight": 0,
    "return_plan_only": false,
    "output_schema": "string"
  }
}

Response: 200 OK

{
  "negotiated_model": "string",
  "estimated_cost_usd": 0.0,
  "routing_reason": "string",
  "response": {}
}

Errors: 400, 401, 403, 502

Health

GET /healthz

System health check.

Response: 200 OK or 503 Service Unavailable

{
  "status": "ok",
  "adapters": 2,
  "models": 6
}

GET /metrics

Prometheus metrics endpoint.

Response: 200 OK (text/plain, Prometheus exposition format)

Admin - Vault

POST /admin/v1/vault/unlock

Body: {"admin_password": "string"}

Response: 200 OK → {"ok": true}

POST /admin/v1/vault/lock

Response: 200 OK → {"ok": true, "already_locked": false}

POST /admin/v1/vault/rotate

Body: {"old_password": "string", "new_password": "string"}

Response: 200 OK → {"ok": true}

Admin - Providers

POST /admin/v1/providers

Create or update a provider.

Body: {"id": "string", "type": "openai|anthropic|vllm", "enabled": true, "base_url": "string", "cred_store": "vault|none", "api_key": "string"}

Response: 200 OK → {"ok": true, "cred_store": "vault"}

GET /admin/v1/providers

List all providers (from the persistent store).

Query: ?limit=N&offset=N

Response: 200 OK → {"items": [{provider objects}], "total": N, "limit": N, "offset": N}

PATCH /admin/v1/providers/

Partial update of a provider. Runtime-only providers (not in the store) are automatically created in the store when first patched.

Body: {"type": "string", "base_url": "string", "enabled": true, "api_key": "string", "cred_store": "string"}

Response: 200 OK → {"ok": true, "provider": {updated provider}}

DELETE /admin/v1/providers/

Delete a provider.

Response: 200 OK → {"ok": true}

GET /admin/v1/providers/{id}/discover

Discover models available from a provider by querying its /v1/models endpoint.

Response: 200 OK → {"models": [{"id": "string", "registered": false}]}

Admin - Models

POST /admin/v1/models

Create or update a model. Registers the model in both the runtime engine and persistent store.

Body: {"id": "string", "provider_id": "string", "weight": 5, "max_context_tokens": 128000, "input_per_1k": 0.01, "output_per_1k": 0.03, "enabled": true}

Response: 200 OK → {"ok": true}

GET /admin/v1/models

List all models (from the persistent store).

Query: ?limit=N&offset=N

Response: 200 OK → {"items": [{model objects}], "total": N, "limit": N, "offset": N}

PATCH /admin/v1/models/

Partial model update. Model IDs can contain slashes (e.g., Qwen/Qwen2.5-Coder-32B-Instruct). Runtime-only models are automatically seeded into the store from engine data on first patch.

Body: {"weight": 7, "enabled": true, "input_per_1k": 0.015, "output_per_1k": 0.035, "max_context_tokens": 128000}

Response: 200 OK → {"ok": true, "model": {updated model}}

DELETE /admin/v1/models/

Delete a model. Model IDs with slashes are supported.

Response: 200 OK → {"ok": true}

Admin - Routing

GET /admin/v1/routing-config

Get current routing defaults.

Response: 200 OK → {"default_mode": "string", "default_max_budget_usd": 0.05, "default_max_latency_ms": 20000}

PUT /admin/v1/routing-config

Set routing defaults.

Body: {"default_mode": "string", "default_max_budget_usd": 0.1, "default_max_latency_ms": 30000}

Response: 200 OK → {"ok": true}

GET /admin/v1/wildcard-models

Get the effective model="*" fail-down ladder and whether it is using an operator-configured override.

Response: 200 OK → {"models":["gpt-5.5","gpt-5.4-mini"],"configured":true,"configured_models":["gpt-5.5","gpt-5.4-mini"],"defaults":[...]}

PUT /admin/v1/wildcard-models

Replace the model="*" fail-down ladder. The body may be JSON or YAML, either as {"models":["gpt-5.5","gpt-5.4-mini"]} or a top-level list. Send an empty list to restore built-in defaults.

Response: 200 OK → same shape as GET /admin/v1/wildcard-models

POST /admin/v1/routing/simulate

Run a what-if routing simulation without sending a real request.

Body: {"mode": "string", "token_count": 500, "max_budget_usd": 0.05, "min_weight": 0, "model_hint": "string"}

Response: 200 OK → {"decision": {decision object}, "eligible": [{model objects}]}

Admin - API Keys

POST /admin/v1/apikeys

Create a new API key.

Body: {"name": "string", "scopes": "[\"chat\",\"plan\"]", "rotation_days": 0, "expires_in": "720h", "monthly_budget_usd": 50.0}

Response: 200 OK → {"ok": true, "key": "tokenhub_...", "id": "string", "prefix": "string", "warning": "string"}

GET /admin/v1/apikeys

List all API keys (no plaintext).

Response: 200 OK → [{key objects without plaintext}]

POST /admin/v1/apikeys/{id}/rotate

Rotate an API key.

Response: 200 OK → {"ok": true, "key": "tokenhub_...", "warning": "string"}

PATCH /admin/v1/apikeys/

Update API key metadata.

Body: {"name": "string", "scopes": "string", "rotation_days": 0, "enabled": true}

Response: 200 OK → {"ok": true}

DELETE /admin/v1/apikeys/

Revoke (delete) an API key.

Response: 200 OK → {"ok": true}

Admin - Observability

GET /admin/v1/health

Provider health status.

Response: 200 OK → {"providers": [{health state objects}]}

GET /admin/v1/stats

Aggregated request statistics.

Response: 200 OK → {"global": {}, "by_model": {}, "by_provider": {}}

GET /admin/v1/logs?limit=100&offset=0

Paginated request logs.

GET /admin/v1/audit?limit=100&offset=0

Paginated audit logs.

GET /admin/v1/rewards?limit=100&offset=0

Paginated reward entries.

GET /admin/v1/engine/models

Runtime model registry, adapter list, and adapter metadata.

Response: 200 OK

{
  "models": [{model objects}],
  "total": 7,
  "adapters": ["openai", "anthropic", "vllm"],
  "adapter_info": [
    {"id": "openai", "health_endpoint": "https://api.openai.com/v1/models"},
    {"id": "vllm", "health_endpoint": "http://vllm-1:8000/health"}
  ]
}

Admin - TSDB

GET /admin/v1/tsdb/query?metric=latency&model_id=gpt-4&start=...&end=...&step_ms=60000

Query time-series data.

GET /admin/v1/tsdb/metrics

List available TSDB metrics.

POST /admin/v1/tsdb/prune

Manually prune old TSDB data.

PUT /admin/v1/tsdb/retention

Set TSDB retention period.

Body: {"retention_days": 7}

Admin - Workflows (Temporal)

GET /admin/v1/workflows?limit=50&status=RUNNING

List Temporal workflow executions.

GET /admin/v1/workflows/

Describe a workflow execution.

GET /admin/v1/workflows/{id}/history

Get workflow event history.

Admin - Events

GET /admin/v1/events

Server-Sent Events stream.

Content-Type: text/event-stream

Events: route_success, route_error

Admin UI

GET /admin

Serves the embedded admin SPA. The root URL (/) redirects here.

GET /admin/v1/info

Admin status information. Requires admin token authentication (Bearer header or ?token= query parameter).

Response: 200 OK

{
  "tokenhub": "admin",
  "vault_locked": true,
  "vault_initialized": false
}

The vault_initialized field indicates whether the vault has ever been set up (salt exists). The UI uses this to distinguish first-time setup from a normal unlock prompt.

Prometheus Metrics

TokenHub exports Prometheus metrics at the /metrics endpoint.

Available Metrics

tokenhub_requests_total

Type: Counter

Total number of requests processed.

Labels:

Label	Values	Description
`mode`	cheap, normal, high_confidence, planning, adversarial, thompson	Routing mode used
`model`	gpt-4, claude-opus, etc.	Model that handled the request
`provider`	openai, anthropic, vllm	Provider adapter
`status`	ok, error	Request outcome

Examples:

# Total successful requests
tokenhub_requests_total{status="ok"}

# Request rate by provider
rate(tokenhub_requests_total[5m])

# Error rate
sum(rate(tokenhub_requests_total{status="error"}[5m]))
  /
sum(rate(tokenhub_requests_total[5m]))

tokenhub_request_latency_ms

Type: Histogram

Request latency distribution in milliseconds.

Labels:

Label	Values	Description
`mode`	cheap, normal, etc.	Routing mode
`model`	gpt-4, etc.	Model ID
`provider`	openai, etc.	Provider ID

Buckets: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120 ms (exponential, base 2)

Examples:

# Median latency
histogram_quantile(0.5, rate(tokenhub_request_latency_ms_bucket[5m]))

# P95 latency
histogram_quantile(0.95, rate(tokenhub_request_latency_ms_bucket[5m]))

# P99 latency by model
histogram_quantile(0.99, sum(rate(tokenhub_request_latency_ms_bucket[5m])) by (model, le))

# Average latency
rate(tokenhub_request_latency_ms_sum[5m]) / rate(tokenhub_request_latency_ms_count[5m])

tokenhub_cost_usd_total

Type: Counter

Cumulative estimated cost in USD.

Labels:

Label	Values	Description
`model`	gpt-4, etc.	Model ID
`provider`	openai, etc.	Provider ID

Examples:

# Total cost in the last hour
increase(tokenhub_cost_usd_total[1h])

# Cost rate (USD per second)
rate(tokenhub_cost_usd_total[5m])

# Cost per hour by model
rate(tokenhub_cost_usd_total[1h]) * 3600

# Most expensive model
topk(3, sum(rate(tokenhub_cost_usd_total[1h])) by (model))

Grafana Dashboard

Suggested Panels

Panel	Query	Visualization
Request Rate	`sum(rate(tokenhub_requests_total[5m]))`	Time series
Error Rate	Error rate formula above	Gauge (0-100%)
P95 Latency	P95 formula above	Time series
Cost per Hour	Cost rate * 3600	Stat
Requests by Model	`sum by (model) (rate(tokenhub_requests_total[5m]))`	Pie chart
Latency Heatmap	`tokenhub_request_latency_ms_bucket`	Heatmap

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: tokenhub
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets: ['tokenhub:8080']

For Docker Compose, use the service name as the target.

Error Classification

TokenHub classifies provider errors to enable intelligent failover. Each error from a provider is classified into one of four categories that determine the routing engine's next action.

Error Classes

context_overflow

The request exceeds the model's context window.

Triggers:

HTTP 413 from provider
Response body contains context_length_exceeded

Router action: Escalate to a model with a larger context window. If no larger model is available, try the next model in scored order.

rate_limited

The provider is throttling requests.

Triggers:

HTTP 429 from provider

Router action: Skip to a different provider. If the response includes a Retry-After header, the delay is recorded in the classified error for optional use by the caller.

transient

A temporary server-side failure.

Triggers:

HTTP 5xx from provider

Router action: Retry the same model with exponential backoff:

Base delay: 100ms
Maximum retries: 2
Backoff multiplier: 2x (100ms, 200ms)

After retries are exhausted, try the next model.

fatal

An unrecoverable client error.

Triggers:

HTTP 4xx (except 429 and 413)
Any other unclassified error

Router action: Skip to the next model in scored order. No retry.

Error Flow

Provider returns error
  │
  ├── adapter.ClassifyError(err) → ClassifiedError{Class, RetryAfter}
  │
  └── Router handles based on class:
        ├── context_overflow → Find bigger model
        ├── rate_limited → Different provider (respect RetryAfter)
        ├── transient → Retry with backoff (up to 2x)
        └── fatal → Next model

ClassifiedError Type

type ClassifiedError struct {
    Err        error
    Class      ErrorClass  // "context_overflow", "rate_limited", "transient", "fatal"
    RetryAfter float64     // Seconds to wait (from Retry-After header, 429 only)
}

HTTP Error Responses

Consumer API Errors

Status	Meaning	When
400	Bad Request	Invalid JSON, missing messages, validation failure
401	Unauthorized	Missing or invalid API key
403	Forbidden	Valid key but insufficient scopes
502	Bad Gateway	All models failed, no eligible models, or provider errors

Admin API Errors

Status	Meaning	When
400	Bad Request	Invalid parameters or validation failure
404	Not Found	Resource not found (model, key, provider)
500	Internal Server Error	Database or vault errors

Provider-Specific Classification

OpenAI

HTTP Status	Body Pattern	Error Class
429	—	rate_limited
500-599	—	transient
400	`context_length_exceeded`	context_overflow
Other 4xx	—	fatal

Anthropic

HTTP Status	Body Pattern	Error Class
429	—	rate_limited
500-599	—	transient
400	`context_length_exceeded`	context_overflow
Other 4xx	—	fatal

vLLM

HTTP Status	Body Pattern	Error Class
429	—	rate_limited
500-599	—	transient
400	`context_length_exceeded`	context_overflow
Other 4xx	—	fatal

Reward Impact

Error classification affects the contextual bandit reward system:

Successful requests: Reward computed from latency and cost
Failed requests: Reward = 0.0 (regardless of error class)
Error class is stored in reward entries for analysis

This ensures the Thompson Sampling policy learns to avoid unreliable models over time.

TokenHub Documentation