Introduction

TokenHub is an intelligent LLM routing proxy that sits between your applications and multiple AI providers. It provides a unified API for chat and planning requests while automatically selecting the best model based on cost, latency, capability, and provider health.

What TokenHub Does

  • Unified API: Single endpoint for OpenAI, Anthropic, and vLLM models
  • Intelligent Routing: Multi-objective model selection considering cost, latency, capability weight, and provider health
  • Orchestration: Multi-model reasoning with adversarial critique, voting, and iterative refinement modes
  • Credential Security: AES-256-GCM encrypted vault for provider API keys with auto-lock and password rotation
  • Client Key Management: Issue, rotate, and revoke API keys for your applications
  • Real-Time Monitoring: Prometheus metrics, time-series database, audit logs, and a built-in admin UI
  • Streaming: Server-Sent Events (SSE) streaming pass-through to all providers
  • Reinforcement Learning: Thompson Sampling bandit policy for adaptive model routing

Architecture at a Glance

┌─────────────┐     ┌──────────────────────────────────────────────┐
│  Client App  │────▶│                  TokenHub                    │
│              │◀────│                                              │
└─────────────┘     │  ┌─────────┐  ┌────────┐  ┌──────────────┐  │
                    │  │ Router  │──│ Health  │  │  Admin API   │  │
                    │  │ Engine  │  │ Tracker │  │  + UI (SPA)  │  │
                    │  └────┬────┘  └────────┘  └──────────────┘  │
                    │       │                                      │
                    │  ┌────┴──────────────────────────┐           │
                    │  │        Provider Adapters       │           │
                    │  │  ┌────────┐┌─────────┐┌────┐  │           │
                    │  │  │ OpenAI ││Anthropic││vLLM│  │           │
                    │  │  └────────┘└─────────┘└────┘  │           │
                    │  └───────────────────────────────┘           │
                    │                                              │
                    │  ┌─────────┐ ┌──────┐ ┌──────┐ ┌─────────┐  │
                    │  │ SQLite  │ │ TSDB │ │Vault │ │Temporal │  │
                    │  └─────────┘ └──────┘ └──────┘ └─────────┘  │
                    └──────────────────────────────────────────────┘

Who This Documentation Is For

  • Users / Application Developers: Learn how to send requests through TokenHub and use features like streaming, directives, and output formatting. Start with the User Guide.
  • Administrators: Configure providers, manage credentials, set routing policies, and monitor the system. Start with the Administrator Guide.
  • Developers / Contributors: Understand the internals, extend provider support, or contribute to the project. Start with the Developer Guide.
TaskWhere to Go
Send your first requestQuick Start
Configure providersProvider Management
Set up API keysAPI Key Management
Command-line admintokenhubctl CLI
Deploy with DockerDocker & Compose
Full API referenceAPI Reference
Monitor the systemMonitoring

Quick Start

This guide gets TokenHub running and serving your first request in under five minutes.

Prerequisites

  • Docker (for Docker Compose), or Go 1.24+ (for building from source)
  • At least one LLM provider endpoint and API key

TokenHub works with any OpenAI-compatible API, the Anthropic API, or vLLM. This includes services like NVIDIA NIM, Azure OpenAI, Together AI, Groq, Fireworks, Mistral, local Ollama instances — anything that speaks the OpenAI /v1/chat/completions protocol.

1. Start the Server

git clone https://github.com/jordanhubbard/tokenhub.git
cd tokenhub
docker compose up -d tokenhub

Build from Source

git clone https://github.com/jordanhubbard/tokenhub.git
cd tokenhub
make install      # builds and installs tokenhub + tokenhubctl to ~/.local/bin
tokenhub

TokenHub starts on port 8080 by default. Docker Compose maps this to host port 8090. Adjust the examples below accordingly.

2. Register Providers

A freshly started TokenHub has no providers configured. You need to tell it where your LLM endpoints are. There are several ways to do this. Pick whichever fits your workflow.

The ~/.tokenhub/credentials file is a declarative JSON file that seeds providers and models at startup. It lives outside the source tree, requires 0600 permissions, and is processed before the service accepts requests.

API keys are automatically stored in the vault (when TOKENHUB_VAULT_PASSWORD is set) and providers are persisted to the database on first boot. The file is idempotent — it can stay in place across restarts.

mkdir -p ~/.tokenhub
chmod 700 ~/.tokenhub
cat > ~/.tokenhub/credentials << 'EOF'
{
  "providers": [
    {
      "id": "ollama",
      "type": "openai",
      "base_url": "http://localhost:11434"
    },
    {
      "id": "nvidia",
      "type": "openai",
      "base_url": "https://integrate.api.nvidia.com",
      "api_key": "nvapi-..."
    }
  ],
  "models": [
    {
      "id": "llama3.1:8b",
      "provider_id": "ollama",
      "weight": 5,
      "max_context_tokens": 8192,
      "input_per_1k": 0.0,
      "output_per_1k": 0.0
    },
    {
      "id": "meta/llama-3.1-70b-instruct",
      "provider_id": "nvidia",
      "weight": 8,
      "max_context_tokens": 128000,
      "input_per_1k": 0.0003,
      "output_per_1k": 0.0003
    }
  ]
}
EOF
chmod 600 ~/.tokenhub/credentials

Then start the server:

make run    # builds image, starts compose, tails logs

Override the default path with TOKENHUB_CREDENTIALS_FILE.

Option B: tokenhubctl (interactive)

With the server already running, use the CLI directly:

export TOKENHUB_URL="http://localhost:8090"

# Register a provider
tokenhubctl provider add '{
    "id": "openai",
    "type": "openai",
    "base_url": "https://api.openai.com",
    "api_key": "sk-..."
}'

# Register a model on that provider
tokenhubctl model add '{
    "id": "gpt-4o",
    "provider_id": "openai",
    "weight": 8,
    "max_context_tokens": 128000,
    "input_per_1k": 0.0025,
    "output_per_1k": 0.01,
    "enabled": true
}'

Option C: Admin UI

Open http://localhost:8090/admin in your browser. The setup wizard walks you through adding your first provider: select the type, enter the base URL and API key, test the connection, then discover and register available models — all without touching the command line.

Option D: Admin API (curl)

# Register a provider
curl -X POST http://localhost:8090/admin/v1/providers \
  -H "Content-Type: application/json" \
  -d '{
    "id": "anthropic",
    "type": "anthropic",
    "base_url": "https://api.anthropic.com",
    "api_key": "sk-ant-...",
    "enabled": true
  }'

# Register a model
curl -X POST http://localhost:8090/admin/v1/models \
  -H "Content-Type: application/json" \
  -d '{
    "id": "claude-sonnet-4-5-20250514",
    "provider_id": "anthropic",
    "weight": 8,
    "max_context_tokens": 200000,
    "input_per_1k": 0.003,
    "output_per_1k": 0.015,
    "enabled": true
  }'

Providers persist across restarts. Once registered via the credentials file, the API, tokenhubctl, or the UI, providers and models are stored in the database and restored automatically on restart. You only need to configure them once. API keys for vault-backed providers require the vault to be unlocked after restart (set TOKENHUB_VAULT_PASSWORD for automatic unlock).

3. Verify It's Running

curl http://localhost:8090/healthz

Or:

tokenhubctl status

Expected response:

{"status": "ok", "adapters": 2, "models": 2}

4. Create an API Key

TokenHub issues its own API keys to clients. Provider keys stay on the server.

tokenhubctl apikey create '{"name":"my-first-key","scopes":"[\"chat\",\"plan\"]"}'

Or via curl:

curl -X POST http://localhost:8090/admin/v1/apikeys \
  -H "Content-Type: application/json" \
  -d '{"name": "my-first-key", "scopes": "[\"chat\",\"plan\"]"}'

Save the returned key value — it is shown only once:

{
  "ok": true,
  "key": "tokenhub_a1b2c3d4...",
  "id": "a1b2c3d4e5f6g7h8",
  "prefix": "tokenhub_a1b2c3d4"
}

5. Send Your First Request

curl -X POST http://localhost:8090/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_a1b2c3d4..." \
  -d '{
    "request": {
      "messages": [
        {"role": "user", "content": "What is the capital of France?"}
      ]
    }
  }'

TokenHub selects the best available model based on its routing policy and returns the response:

{
  "negotiated_model": "gpt-4o",
  "estimated_cost_usd": 0.0023,
  "routing_reason": "routed-weight-8",
  "response": {
    "choices": [{
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      }
    }]
  }
}

6. Explore

# See all registered providers and models
tokenhubctl provider list
tokenhubctl model list

# Watch routing decisions in real time
tokenhubctl events

# Open the admin dashboard
open http://localhost:8090/admin

Next Steps

User Guide Overview

This section is for application developers integrating with TokenHub. TokenHub exposes two main endpoints:

EndpointPurpose
POST /v1/chatSingle-turn or multi-turn chat completion
POST /v1/planMulti-model orchestrated reasoning

Both endpoints accept a unified request format and return the provider's response along with routing metadata (which model was chosen, estimated cost, and routing reason).

Key Concepts

Routing Policies

Every request can include a policy that guides model selection:

  • cheap — Minimize cost (prefer smaller, cheaper models)
  • normal — Balance cost, latency, capability, and reliability
  • high_confidence — Prefer the most capable models regardless of cost
  • planning — Optimized for planning and reasoning tasks
  • thompson — Adaptive selection using reinforcement learning

If no policy is specified, the server's default routing mode applies.

Model Selection

TokenHub maintains a registry of models from all configured providers. Each model has:

  • Weight (0-10): Higher weight = more capable
  • Context window: Maximum tokens the model can process
  • Pricing: Cost per 1,000 input and output tokens
  • Health status: Based on recent success/failure rates

The routing engine scores all eligible models and selects the best match for your request.

Authentication

All /v1 requests require an API key in the Authorization header:

Authorization: Bearer tokenhub_<key>

API keys are created and managed by administrators. Each key has scopes controlling which endpoints it can access (chat, plan, or both).

Provider Transparency

You interact only with TokenHub. The underlying provider (OpenAI, Anthropic, vLLM) is selected automatically and its API key is never exposed. The response includes which model and provider were used in the negotiated_model field.

Sections

Chat API

The chat endpoint provides single-turn or multi-turn completions with automatic model routing.

Endpoint: POST /v1/chat

Request Format

{
  "request": {
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "model_hint": "gpt-4",
    "estimated_input_tokens": 500,
    "parameters": {
      "temperature": 0.7,
      "max_tokens": 1024,
      "top_p": 0.9
    },
    "stream": false,
    "meta": {
      "user_id": "u123",
      "session": "abc"
    }
  },
  "capabilities": {
    "planning": true
  },
  "policy": {
    "mode": "normal",
    "max_budget_usd": 0.05,
    "max_latency_ms": 15000,
    "min_weight": 5
  },
  "output_format": {
    "type": "json",
    "schema": "{\"type\":\"object\",\"properties\":{\"answer\":{\"type\":\"string\"}}}",
    "max_tokens": 500,
    "strip_think": true
  }
}

Request Fields

request (required)

FieldTypeRequiredDescription
messagesarrayYesArray of {role, content} message objects
model_hintstringNoPreferred model ID; tried first before scoring
estimated_input_tokensintNoToken count hint for routing decisions
parametersobjectNoProvider parameters forwarded as-is (temperature, max_tokens, top_p, etc.)
streamboolNoEnable SSE streaming response
metaobjectNoArbitrary metadata for logging and tracing
output_schemaJSONNoJSON Schema for structured output validation

policy (optional)

Controls model selection behavior. All fields are optional and fall back to server defaults.

FieldTypeDefaultRangeDescription
modestringnormalSee belowRouting mode
max_budget_usdfloat0.050-100Maximum cost per request
max_latency_msint200000-300000Maximum acceptable latency
min_weightint00-10Minimum model capability weight

Routing modes:

ModeCost WeightLatency WeightFailure WeightCapability Weight
cheap0.70.10.10.1
normal0.250.250.250.25
high_confidence0.050.10.150.7
planning0.10.10.20.6
thompsonN/AN/AN/AN/A

The thompson mode uses reinforcement learning (Thompson Sampling with Beta distributions) to adaptively select models based on historical reward data.

capabilities (optional)

FieldTypeDescription
planningboolIndicates request needs planning capability

Capabilities influence which routing mode profile is used when no explicit mode is set.

output_format (optional)

FieldTypeDescription
typestringOutput format: json, markdown, text, xml
schemastringJSON Schema string for validating structured output
max_tokensintMaximum output tokens to request from provider
strip_thinkboolRemove <think>...</think> blocks from response

Response Format

{
  "negotiated_model": "gpt-4",
  "estimated_cost_usd": 0.0023,
  "routing_reason": "routed-weight-8",
  "response": {
    "id": "chatcmpl-...",
    "choices": [{
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses..."
      }
    }],
    "usage": {
      "prompt_tokens": 45,
      "completion_tokens": 128,
      "total_tokens": 173
    }
  }
}
FieldDescription
negotiated_modelThe model ID that was selected
estimated_cost_usdEstimated cost based on model pricing and token counts
routing_reasonWhy this model was chosen (see Routing Reasons)
responseRaw JSON response from the selected provider

Routing Reasons

ReasonDescription
routed-weight-NSelected by scoring; N is the model's weight
model-hintClient's model hint was used
escalated-context-overflowEscalated to a model with a larger context window
retried-transientRetried after a transient provider error

Error Responses

StatusBodyCause
400"bad json"Malformed request body
400"messages required"Empty messages array
400"max_budget_usd must be between 0 and 100"Policy validation failure
401"missing or invalid api key"Missing or invalid Authorization header
403"scope not allowed"API key lacks chat scope
502Error messageAll models failed or no eligible models

Examples

Minimal Request

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Hello!"}]
    }
  }'

Cost-Optimized Request

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Summarize this text..."}]
    },
    "policy": {
      "mode": "cheap",
      "max_budget_usd": 0.001
    }
  }'

Request with Model Hint

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Write a poem about the ocean."}],
      "model_hint": "claude-opus",
      "parameters": {
        "temperature": 0.9,
        "max_tokens": 2048
      }
    }
  }'

Structured JSON Output

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "List 3 programming languages with their year of creation"}]
    },
    "output_format": {
      "type": "json",
      "schema": "{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"year\":{\"type\":\"integer\"}}}}"
    }
  }'

Plan API

The plan endpoint provides multi-model orchestrated reasoning. It coordinates multiple LLM calls using different strategies to produce higher-quality outputs than a single model call.

Endpoint: POST /v1/plan

Request Format

{
  "request": {
    "messages": [
      {"role": "user", "content": "Design a REST API for a task management app"}
    ]
  },
  "orchestration": {
    "mode": "adversarial",
    "iterations": 2,
    "primary_model_id": "claude-opus",
    "review_model_id": "gpt-4",
    "primary_min_weight": 5,
    "review_min_weight": 8,
    "return_plan_only": false,
    "output_schema": "{\"type\":\"object\"}"
  }
}

Orchestration Modes

Adversarial

A three-phase plan-critique-refine loop:

  1. Plan: Primary model generates an initial plan
  2. Critique: Review model analyzes the plan and provides feedback
  3. Refine: Primary model improves the plan based on the critique

The critique-refine cycle repeats for the configured number of iterations.

{
  "orchestration": {
    "mode": "adversarial",
    "iterations": 2
  }
}

Response:

{
  "negotiated_model": "claude-opus",
  "estimated_cost_usd": 0.15,
  "routing_reason": "adversarial-orchestration",
  "response": {
    "initial_plan": "Here is the initial API design...",
    "critique": "The design has these issues: ...",
    "refined_plan": "Here is the improved design addressing the feedback..."
  }
}

Vote

Multiple models respond independently, then a judge model selects the best:

  1. N models (voters) each generate a response to the same prompt
  2. A judge model reviews all responses and selects the best one
{
  "orchestration": {
    "mode": "vote"
  }
}

Response:

{
  "negotiated_model": "gpt-4",
  "estimated_cost_usd": 0.08,
  "routing_reason": "vote-orchestration",
  "response": {
    "responses": [
      {"model": "gpt-4", "content": "Response A...", "selected": true},
      {"model": "claude-sonnet", "content": "Response B...", "selected": false},
      {"model": "gpt-3.5-turbo", "content": "Response C...", "selected": false}
    ],
    "selected": 0,
    "judge": "claude-opus"
  }
}

Refine

A single model iteratively improves its own response:

  1. Model generates an initial response
  2. Model reviews and refines its own response (repeats for N iterations)
{
  "orchestration": {
    "mode": "refine",
    "iterations": 3
  }
}

Response:

{
  "negotiated_model": "claude-opus",
  "estimated_cost_usd": 0.12,
  "routing_reason": "refine-orchestration",
  "response": {
    "refined_response": "Final refined response...",
    "iterations": 3,
    "model": "claude-opus"
  }
}

Planning

Simple single-route with the planning weight profile (prioritizes capable models):

{
  "orchestration": {
    "mode": "planning"
  }
}

Orchestration Fields

FieldTypeDefaultRangeDescription
modestringplanningSee aboveOrchestration strategy
iterationsint1-20-10Number of refinement iterations
primary_model_idstringExplicit model for primary phase
review_model_idstringExplicit model for review/judge phase
primary_min_weightint00-10Minimum weight for primary model
review_min_weightint00-10Minimum weight for review model
return_plan_onlyboolfalseReturn plan without executing refinement
output_schemastringJSON Schema for structured output validation

Explicit Model Selection

By default, TokenHub selects models using its routing engine. You can override this with explicit model IDs:

{
  "orchestration": {
    "mode": "adversarial",
    "primary_model_id": "claude-opus",
    "review_model_id": "gpt-4"
  }
}

Alternatively, use primary_min_weight and review_min_weight to set capability floors without specifying exact models:

{
  "orchestration": {
    "mode": "adversarial",
    "primary_min_weight": 7,
    "review_min_weight": 9
  }
}

Error Responses

StatusBodyCause
400"messages required"Empty messages array
400"iterations must be between 0 and 10"Invalid iteration count
400"unknown orchestration mode"Unrecognized mode value
401"missing or invalid api key"Authentication failure
403"scope not allowed"API key lacks plan scope
502Error messageOrchestration failed (all models failed)

Cost Considerations

Orchestration modes make multiple LLM calls. Approximate cost multipliers:

ModeCalls per RequestTypical Cost Multiplier
Planning11x
Adversarial (2 iter)5 (plan + 2x(critique + refine))5x
Vote (3 voters)4 (3 voters + 1 judge)4x
Refine (3 iter)4 (initial + 3 refinements)4x

Budget accordingly when setting max_budget_usd in your policy.

Streaming

TokenHub supports Server-Sent Events (SSE) streaming for chat requests. When streaming is enabled, tokens are delivered incrementally as they are generated by the provider.

Enabling Streaming

Set stream: true in your request:

{
  "request": {
    "messages": [{"role": "user", "content": "Tell me a story..."}],
    "stream": true
  }
}

Response Format

Streaming responses use the text/event-stream content type. Each event is a line prefixed with data: :

data: {"choices":[{"delta":{"content":"Once"},"index":0}]}

data: {"choices":[{"delta":{"content":" upon"},"index":0}]}

data: {"choices":[{"delta":{"content":" a"},"index":0}]}

data: {"choices":[{"delta":{"content":" time"},"index":0}]}

data: [DONE]

The stream ends with data: [DONE].

Response Headers

Streaming responses include these headers:

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
X-TokenHub-Model: gpt-4
X-TokenHub-Provider: openai
X-TokenHub-Reason: routed-weight-8

The X-TokenHub-* headers provide routing metadata that would normally be in the JSON response envelope.

Example with curl

curl -N -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Count from 1 to 10 slowly."}],
      "stream": true
    }
  }'

The -N flag disables output buffering so tokens appear as they arrive.

Example with Python

import requests
import json

response = requests.post(
    "http://localhost:8080/v1/chat",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer tokenhub_..."
    },
    json={
        "request": {
            "messages": [{"role": "user", "content": "Tell me a story."}],
            "stream": True
        }
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        text = line.decode("utf-8")
        if text.startswith("data: ") and text != "data: [DONE]":
            chunk = json.loads(text[6:])
            delta = chunk["choices"][0].get("delta", {})
            if "content" in delta:
                print(delta["content"], end="", flush=True)

Provider Compatibility

All three provider adapters support streaming:

ProviderStreaming Protocol
OpenAISSE (native)
AnthropicSSE (native)
vLLMSSE (OpenAI-compatible)

TokenHub passes the SSE stream through directly from the selected provider. The event format matches the provider's native format.

Failover Behavior

Streaming uses the same model selection and failover logic as non-streaming requests. If the selected model fails to establish a stream, TokenHub falls back through eligible models in scored order.

However, once streaming has begun (first bytes sent to the client), failover is not possible. If the provider disconnects mid-stream, the stream ends with an error event.

Limitations

  • Streaming is only available on /v1/chat, not /v1/plan
  • Output format validation (output_format.schema) is not applied to streaming responses
  • Cost estimation in streaming responses may be less accurate since token counts are not known until the stream completes
  • When Temporal workflows are enabled, streaming bypasses Temporal and uses direct engine dispatch

In-Band Directives

TokenHub supports embedding routing directives directly in message content. This allows clients to override routing policy without changing the request structure, which is useful when working through intermediary systems that pass messages through unchanged.

Single-Line Directive

Embed a directive anywhere in a message's content using the @@tokenhub prefix:

@@tokenhub mode=cheap budget=0.01 latency=5000 min_weight=5

Example in a full request:

{
  "request": {
    "messages": [
      {
        "role": "user",
        "content": "@@tokenhub mode=cheap budget=0.005\nSummarize this document..."
      }
    ]
  }
}

Block Directive

For complex directives (especially those containing JSON schemas), use the block format:

@@tokenhub
mode=high_confidence
budget=0.10
latency=30000
min_weight=8
output_schema={"type":"object","properties":{"answer":{"type":"string"},"confidence":{"type":"number"}}}
@@end

The block starts with @@tokenhub on its own line and ends with @@end.

Supported Keys

KeyTypeMaps ToDescription
modestringpolicy.modeRouting mode (cheap, normal, high_confidence, planning, adversarial)
budgetfloatpolicy.max_budget_usdMaximum cost in USD
latencyintpolicy.max_latency_msMaximum latency in milliseconds
min_weightintpolicy.min_weightMinimum model capability weight
output_schemaJSONrequest.output_schemaJSON Schema for structured output

Processing Rules

  1. Scanning: TokenHub scans all messages for directives. The last directive found takes precedence.
  2. Stripping: Directives are removed from message content before forwarding to the provider. The LLM never sees @@tokenhub text.
  3. Override: Directive values override both server defaults and request-level policy fields.
  4. Partial override: You can set only the fields you want to override. Unspecified fields retain their values from the request policy or server defaults.

Examples

Cost-optimize a specific request

@@tokenhub mode=cheap budget=0.001
What is 2 + 2?

Force high-quality response

@@tokenhub mode=high_confidence min_weight=9
Write a detailed analysis of the economic implications of quantum computing.

Structured output via directive

@@tokenhub
output_schema={"type":"object","properties":{"name":{"type":"string"},"population":{"type":"integer"}}}
@@end
What is the most populous city in Japan?

Output Formats

TokenHub can shape provider responses into specific output formats. This is useful for applications that need structured data from LLM responses.

Configuration

Set the output_format field in your chat request:

{
  "output_format": {
    "type": "json",
    "schema": "{\"type\":\"object\",\"properties\":{\"answer\":{\"type\":\"string\"}}}",
    "max_tokens": 500,
    "strip_think": true
  }
}

Format Types

JSON

Validates the response against a JSON Schema. If the provider's output doesn't match the schema, TokenHub returns a validation error.

{
  "output_format": {
    "type": "json",
    "schema": "{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"value\":{\"type\":\"number\"}}}}"
  }
}

The schema is passed as a string (not a nested object) to allow maximum flexibility.

Markdown

Requests the provider to format its response as Markdown:

{
  "output_format": {
    "type": "markdown"
  }
}

Text

Plain text output with optional truncation:

{
  "output_format": {
    "type": "text",
    "max_tokens": 200
  }
}

XML

Requests XML-formatted output:

{
  "output_format": {
    "type": "xml"
  }
}

Output Format Fields

FieldTypeDescription
typestringOutput format: json, markdown, text, xml
schemastringJSON Schema for validation (only with type: "json")
max_tokensintMaximum output tokens to request from the provider
strip_thinkboolRemove <think>...</think> reasoning blocks from the response

Think Block Stripping

Some models (particularly those with chain-of-thought reasoning) wrap their internal reasoning in <think>...</think> tags. Setting strip_think: true removes these blocks from the final response:

Before stripping:

<think>
The user wants to know the capital of France. This is a straightforward factual question.
</think>
The capital of France is Paris.

After stripping:

The capital of France is Paris.

JSON Schema Validation

When type: "json" is specified with a schema, TokenHub:

  1. Sends the request to the provider (with a system message hint to produce JSON)
  2. Parses the provider's response as JSON
  3. Validates against the provided JSON Schema
  4. Returns the validated JSON in the response

If validation fails, the error is returned in the response body with a 502 status.

Authentication

All requests to TokenHub's consumer API (/v1/*) require authentication via API keys.

API Key Format

TokenHub API keys follow this format:

tokenhub_<64 hex characters>

Example: tokenhub_a1b2c3d4e5f6789012345678abcdef0123456789abcdef0123456789abcdef01

Using API Keys

Include the key in the Authorization header as a Bearer token:

curl -X POST http://localhost:8080/v1/chat \
  -H "Authorization: Bearer tokenhub_a1b2c3d4..." \
  -H "Content-Type: application/json" \
  -d '{"request": {"messages": [{"role": "user", "content": "Hello"}]}}'

Scopes

Each API key has scopes that control which endpoints it can access:

ScopeEndpointDescription
chatPOST /v1/chatChat completion requests
planPOST /v1/planOrchestrated planning requests

A key with scopes ["chat", "plan"] can access both endpoints. A key with only ["chat"] receives a 403 Forbidden when calling /v1/plan.

If scopes are empty ([]), the key has access to all endpoints.

Error Responses

StatusMessageCause
401"missing or invalid api key"No Authorization header, invalid format, wrong key, expired, or disabled
403"scope not allowed"Valid key but lacks the required scope

Key Lifecycle

  1. Created by an administrator via the admin API or UI
  2. Distributed to the client application (plaintext shown only once at creation)
  3. Used by the client for all /v1 requests
  4. Rotated periodically (manually or on a configured schedule)
  5. Revoked when no longer needed

Keys can be configured with:

  • Expiration: Automatic expiry after a set duration
  • Rotation schedule: Recommended rotation period in days
  • Enable/disable: Temporarily deactivate without deleting

Security Properties

  • Plaintext is never stored: Only a bcrypt hash is persisted
  • Shown once: The plaintext key is returned only at creation and rotation
  • Provider isolation: Clients authenticate with TokenHub keys. Provider API keys are stored encrypted in the vault and never exposed.
  • Validation cache: A 5-minute TTL cache reduces bcrypt overhead without compromising security

See API Key Management for the administrator's guide to creating and managing keys.

Administrator Guide Overview

This section covers how to configure, manage, and monitor a TokenHub deployment.

Administration Model

TokenHub uses a three-tier security model:

  1. Admin token (TOKENHUB_ADMIN_TOKEN): Authenticates access to the admin API (/admin/v1/*) and the admin dashboard. The UI requires the token at login; all admin API calls include it as Authorization: Bearer <token>. Retrieve it with tokenhubctl admin-token.
  2. Vault password: A separate secret that encrypts provider API keys at rest. Even a valid admin token cannot decrypt the vault — the vault must be explicitly unlocked after each restart (or set TOKENHUB_VAULT_PASSWORD for auto-unlock).
  3. API keys: Issued to client applications for /v1 endpoint access. Managed via the admin API or UI.

In production, always set TOKENHUB_ADMIN_TOKEN and restrict network access to /admin/* at the firewall, VPN, or reverse proxy level.

Administration Tools

Admin UI

The built-in web dashboard at /admin provides a graphical interface for all admin operations. See Admin UI.

tokenhubctl

A command-line tool for scripting and quick administration. Covers all admin API operations. See tokenhubctl CLI.

curl / Admin API

All operations are available via the REST API at /admin/v1/*. See API Reference.

Admin Endpoints

CategoryEndpointsPurpose
Vault/admin/v1/vault/*Lock, unlock, rotate vault password
Providers/admin/v1/providersRegister, edit, and manage LLM providers
Models/admin/v1/modelsRegister, edit, and manage model configurations
Discovery/admin/v1/providers/{id}/discoverDiscover models from a provider's API
Simulation/admin/v1/routing/simulateWhat-if routing simulation
Routing/admin/v1/routing-configSet default routing policy
API Keys/admin/v1/apikeysCreate, rotate, revoke client API keys
Health/admin/v1/healthView provider health status
Stats/admin/v1/statsView aggregated request statistics
Logs/admin/v1/logsView request logs
Audit/admin/v1/auditView audit trail
Rewards/admin/v1/rewardsView contextual bandit reward data
Engine/admin/v1/engine/modelsView runtime model registry and adapter info
TSDB/admin/v1/tsdb/*Query time-series metrics
Workflows/admin/v1/workflowsView Temporal workflow executions
Events/admin/v1/eventsSSE stream of real-time events

Sections

Vault & Credentials

TokenHub includes an AES-256-GCM encrypted vault for storing provider API keys securely. Provider credentials are encrypted at rest and only decrypted in memory when the vault is unlocked.

Vault password vs. admin token: The vault password is not the same as your admin token. The admin token authenticates HTTP requests to the admin API. The vault password derives the encryption key used to protect stored credentials. Both are required in a production deployment: the admin token to access the API, and the vault password to decrypt provider keys.

How It Works

  1. An administrator sets a vault password when first configuring TokenHub
  2. The password is run through Argon2id key derivation (OWASP-recommended parameters) to produce an encryption key
  3. Provider API keys are encrypted with AES-256-GCM and stored in SQLite
  4. A random salt is generated per vault instance and persisted alongside the encrypted data
  5. After server restart, the vault must be unlocked with the same password before provider requests can be made

Vault States

StateDescription
Not initializedFirst-time setup required — choose a master password
LockedCredentials encrypted; provider requests will fail
UnlockedCredentials decrypted in memory; requests are served normally

Auto-Unlock (Headless)

Set TOKENHUB_VAULT_PASSWORD to unlock the vault automatically at startup. This is required for automated/headless deployments where no operator is present to enter the password interactively.

export TOKENHUB_VAULT_PASSWORD="your-secure-password"

On first boot this also initializes the vault, so no interactive setup is needed.

Operations

Unlock the Vault

Via the admin UI (recommended for first-time setup — the UI asks for the password twice to prevent typos), or via API/CLI:

tokenhubctl vault unlock "your-secure-password"

Or via curl:

curl -X POST http://localhost:8080/admin/v1/vault/unlock \
  -H "Content-Type: application/json" \
  -d '{"admin_password": "your-secure-password"}'

Response:

{"ok": true}

Lock the Vault

curl -X POST http://localhost:8080/admin/v1/vault/lock

Response:

{"ok": true, "already_locked": false}

Rotate the Vault Password

Re-encrypts all stored credentials with a new password:

curl -X POST http://localhost:8080/admin/v1/vault/rotate \
  -H "Content-Type: application/json" \
  -d '{
    "old_password": "current-password",
    "new_password": "new-secure-password"
  }'

This operation is atomic — all credentials are re-encrypted in a single transaction.

Auto-Lock

The vault automatically locks after 30 minutes of inactivity. Every successful credential access resets the timer.

When the vault auto-locks:

  • In-flight requests that have already retrieved credentials continue normally
  • New requests will fail with a provider error until the vault is unlocked again
  • An audit log entry is recorded

Credential Storage

When you register a provider with cred_store: "vault", TokenHub stores the API key encrypted in the vault under the key provider:{provider_id}:api_key.

The credential lifecycle:

  1. Admin provides API key when creating/updating a provider
  2. Key is encrypted and stored in the vault
  3. Key is also persisted (encrypted) in the database for recovery after restart
  4. When the vault is unlocked, the salt and encrypted blob are loaded from the database
  5. Keys are decrypted only in memory

Security Parameters

ParameterValue
EncryptionAES-256-GCM
Key derivationArgon2id
Argon2id time3 iterations
Argon2id memory64 MB
Argon2id threads4
Salt16 bytes, random per vault
Auto-lock timeout30 minutes

Best Practices

  1. Use a strong vault password: At least 16 characters with mixed case, numbers, and symbols
  2. Use TOKENHUB_VAULT_PASSWORD for automated deployments so the vault unlocks on restart
  3. Rotate regularly: Use the rotate endpoint to change the vault password periodically
  4. Monitor auto-lock: Set up alerts if the vault locks unexpectedly during business hours
  5. Backup the database: The vault salt and encrypted blob are stored in SQLite. Back up the database file to ensure credential recovery
  6. Network isolation: Restrict access to vault admin endpoints to trusted networks

Provider Management

Providers are the LLM services that TokenHub routes requests to. TokenHub ships with adapter support for OpenAI, Anthropic, and vLLM (OpenAI-compatible).

Registration Methods

The ~/.tokenhub/credentials file is a declarative JSON file processed at startup. Providers are persisted to the database and API keys are stored in the vault (when unlocked via TOKENHUB_VAULT_PASSWORD). The file is idempotent — it can remain in place across restarts.

The file must have 0600 permissions and live outside the source tree.

{
  "providers": [
    {
      "id": "openai",
      "type": "openai",
      "base_url": "https://api.openai.com",
      "api_key": "sk-..."
    },
    {
      "id": "anthropic",
      "type": "anthropic",
      "base_url": "https://api.anthropic.com",
      "api_key": "sk-ant-..."
    },
    {
      "id": "ollama-local",
      "type": "openai",
      "base_url": "http://localhost:11434"
    }
  ],
  "models": [
    {
      "id": "gpt-4o",
      "provider_id": "openai",
      "weight": 8,
      "max_context_tokens": 128000,
      "input_per_1k": 0.0025,
      "output_per_1k": 0.01
    }
  ]
}
FieldTypeRequiredDescription
idstringYesUnique provider identifier
typestringYesProvider type: openai, anthropic, or vllm
base_urlstringYesProvider API base URL
api_keystringNoAPI key (stored in vault when available, omit for keyless providers)
enabledboolNoWhether the provider is active (default: true)

Override the default path with TOKENHUB_CREDENTIALS_FILE.

Admin API / tokenhubctl

Providers can be registered and managed dynamically via the admin API or tokenhubctl at any time after the service starts.

Admin UI

The setup wizard at /admin walks through adding providers interactively.

API Operations

Create or Update a Provider

curl -X POST http://localhost:8080/admin/v1/providers \
  -H "Content-Type: application/json" \
  -d '{
    "id": "openai-prod",
    "type": "openai",
    "enabled": true,
    "base_url": "https://api.openai.com",
    "cred_store": "vault",
    "api_key": "sk-..."
  }'

Or with tokenhubctl:

tokenhubctl provider add '{"id":"openai-prod","type":"openai","base_url":"https://api.openai.com","api_key":"sk-..."}'
FieldTypeRequiredDescription
idstringYesUnique provider identifier
typestringYesProvider type: openai, anthropic, or vllm
enabledboolNoWhether the provider is active (default: true)
base_urlstringYesProvider API base URL
cred_storestringNoWhere to store credentials: vault or none
api_keystringNoAPI key (stored according to cred_store)

List Providers

curl http://localhost:8080/admin/v1/providers
tokenhubctl provider list

The tokenhubctl provider list command merges providers from both the persistent store and the runtime engine, showing base URLs derived from adapter health endpoints and indicating whether each provider is store-persisted or runtime-only.

API keys are never returned in list responses.

Edit a Provider

Partial updates via PATCH:

curl -X PATCH http://localhost:8080/admin/v1/providers/openai \
  -H "Content-Type: application/json" \
  -d '{"base_url": "https://api.openai.com", "enabled": true}'

Or:

tokenhubctl provider edit openai '{"base_url":"https://api.openai.com","enabled":true}'

Patchable fields: type, base_url, enabled, api_key, cred_store.

Delete a Provider

curl -X DELETE http://localhost:8080/admin/v1/providers/openai-staging
tokenhubctl provider delete openai-staging

Discover Models

Query a provider's API to discover available models:

curl http://localhost:8080/admin/v1/providers/openai/discover
tokenhubctl provider discover openai

This calls the provider's /v1/models endpoint (using the stored API key from the vault if available) and returns the list of models with a registered flag indicating which are already configured in TokenHub.

Credential Storage Options

cred_storeDescription
vaultAPI key is encrypted and stored in the vault (default when api_key is provided)
noneNo credentials needed (e.g., local vLLM/Ollama without auth)

When using vault, the API key is encrypted with AES-256-GCM and only available when the vault is unlocked.

Supported Provider Types

OpenAI (openai)

  • API endpoint: /v1/chat/completions
  • Health probe: GET /v1/models
  • Streaming: SSE (native)
  • Authentication: Authorization: Bearer <key>

Anthropic (anthropic)

  • API endpoint: /v1/messages
  • Health probe: GET /v1/messages (405 response = healthy)
  • Streaming: SSE (native)
  • Authentication: x-api-key: <key>, anthropic-version: 2023-06-01

vLLM (vllm)

  • API endpoint: /v1/chat/completions (OpenAI-compatible)
  • Health probe: GET /health
  • Streaming: SSE (OpenAI-compatible)
  • Authentication: None (or custom header if configured)
  • Multi-endpoint: Supports multiple endpoints with round-robin load balancing

Audit Trail

All provider mutations are logged in the audit trail:

  • provider.upsert — Provider created or updated
  • provider.patch — Provider partially updated
  • provider.delete — Provider removed

Model Management

Models are the LLM model definitions that TokenHub uses for routing decisions. Each model is associated with a provider and has properties that affect routing: capability weight, context window size, and pricing.

Default Models

TokenHub registers these default models at startup:

Model IDProviderWeightContextInput $/1KOutput $/1K
gpt-4openai8128,000$0.010$0.030
gpt-3.5-turboopenai316,385$0.0005$0.0015
claude-opusanthropic10200,000$0.015$0.075
claude-sonnetanthropic7200,000$0.003$0.015

Defaults are overridden if persisted models exist in the database or are registered via the credentials file.

API Operations

Create or Update a Model

curl -X POST http://localhost:8080/admin/v1/models \
  -H "Content-Type: application/json" \
  -d '{
    "id": "gpt-4-turbo",
    "provider_id": "openai",
    "weight": 7,
    "max_context_tokens": 128000,
    "input_per_1k": 0.01,
    "output_per_1k": 0.03,
    "enabled": true
  }'

Or with tokenhubctl:

tokenhubctl model add '{"id":"gpt-4-turbo","provider_id":"openai","weight":7,"max_context_tokens":128000,"input_per_1k":0.01,"output_per_1k":0.03,"enabled":true}'
FieldTypeRequiredDescription
idstringYesModel identifier (must match provider's model name)
provider_idstringYesID of the registered provider
weightintYesCapability weight (0-10); higher = more capable
max_context_tokensintYesMaximum context window in tokens
input_per_1kfloatYesCost per 1,000 input tokens in USD
output_per_1kfloatYesCost per 1,000 output tokens in USD
enabledboolYesWhether the model is available for routing

Model IDs can contain slashes (e.g., Qwen/Qwen2.5-Coder-32B-Instruct, nvidia/openai/gpt-oss-20b). The API handles them correctly.

List Models

curl http://localhost:8080/admin/v1/models
tokenhubctl model list

The tokenhubctl model list command merges models from both the persistent store and the runtime engine, so models registered via environment variables or the credentials file are also shown.

Patch a Model

Update individual fields without resending the full configuration:

curl -X PATCH http://localhost:8080/admin/v1/models/gpt-4o \
  -H "Content-Type: application/json" \
  -d '{
    "weight": 9,
    "enabled": true,
    "input_per_1k": 0.012
  }'

Or:

tokenhubctl model edit gpt-4o '{"weight":9}'

Patchable fields: weight, enabled, input_per_1k, output_per_1k, max_context_tokens.

Runtime-only models (those registered via env vars or credentials file but not in the store) can also be patched. The first patch creates a store record seeded from the engine's runtime data.

Enable / Disable a Model

Quick shortcuts via tokenhubctl:

tokenhubctl model enable gpt-4o
tokenhubctl model disable gpt-4o-legacy

Delete a Model

curl -X DELETE http://localhost:8080/admin/v1/models/gpt-4-legacy
tokenhubctl model delete gpt-4-legacy

Weight Guidelines

The model weight is the primary indicator of model capability used in routing decisions:

WeightIntended For
1-3Simple tasks, low cost (e.g., GPT-3.5 Turbo)
4-6General purpose (e.g., GPT-4 Turbo, Claude Sonnet)
7-8Complex reasoning (e.g., GPT-4, Claude Opus)
9-10Highest capability (e.g., next-gen frontier models)

Different routing modes weight the capability score differently:

  • cheap mode barely considers weight (0.1 factor)
  • high_confidence and planning modes heavily favor higher weights (0.6-0.7 factor)
  • normal mode balances weight equally with cost, latency, and reliability (0.25 each)

Context Window

The max_context_tokens field tells the router whether a model can handle a given request size. The router applies a 15% headroom buffer — a model with 128,000 tokens can handle requests estimated up to ~108,000 tokens.

Token estimation uses estimated_input_tokens from the request if provided, otherwise falls back to a characters / 4 heuristic.

Pricing

Model pricing is used for:

  1. Cost estimation: Returned in the response as estimated_cost_usd
  2. Budget filtering: Models exceeding the request's max_budget_usd are excluded
  3. Cost scoring: In routing modes that consider cost (especially cheap mode)

Keep pricing up to date as providers change their rates.

Audit Trail

Model mutations are logged:

  • model.upsert — Model created or updated
  • model.patch — Model partially updated
  • model.delete — Model removed

Routing Configuration

TokenHub's routing engine uses a multi-objective scoring function to select the best model for each request. Administrators can configure the default routing behavior that applies when clients don't specify a policy.

Default Routing Settings

View Current Defaults

curl http://localhost:8080/admin/v1/routing-config

Response:

{
  "default_mode": "normal",
  "default_max_budget_usd": 0.05,
  "default_max_latency_ms": 20000
}

Update Defaults

curl -X PUT http://localhost:8080/admin/v1/routing-config \
  -H "Content-Type: application/json" \
  -d '{
    "default_mode": "normal",
    "default_max_budget_usd": 0.10,
    "default_max_latency_ms": 30000
  }'
FieldTypeRangeDescription
default_modestringSee belowDefault routing mode
default_max_budget_usdfloat0-100Default cost ceiling per request
default_max_latency_msint0-300000Default latency ceiling

Changes take effect immediately for new requests and are persisted to the database.

Routing Modes

Each mode applies different weights to the four scoring objectives:

ModeCostLatencyFailure RateCapabilityUse Case
cheap0.70.10.10.1Minimize costs for simple tasks
normal0.250.250.250.25Balanced operation
high_confidence0.050.10.150.7Complex tasks needing strong models
planning0.10.10.20.6Multi-step reasoning tasks
adversarial0.10.10.20.6Adversarial orchestration
thompsonAdaptive RL-based selection

How Scoring Works

For modes other than thompson, the scoring formula is:

score = (cost_norm × w_cost) + (latency_norm × w_latency) + (failure_norm × w_failure) - (weight × w_capability)

Where:

  • cost_norm: Estimated cost normalized to 0-1 range
  • latency_norm: Average latency normalized to 0-1 range
  • failure_norm: Error rate from health tracker
  • weight: Model capability weight (0-10)
  • w_*: Mode-specific weights from the table above

Lower score = better model. Models are sorted by score and tried in order.

Thompson Sampling

The thompson mode uses a contextual bandit approach:

  1. Each (model, token_bucket) pair maintains Beta distribution parameters (alpha, beta)
  2. For each request, a reward value is sampled from each model's Beta distribution
  3. Models are sorted by sampled reward (highest first)
  4. Parameters are updated periodically from historical reward data

This approach automatically adapts to changing model performance over time.

Model Eligibility Filtering

Before scoring, the router filters models:

  1. Enabled: Model must be enabled
  2. Minimum weight: Must meet the request's min_weight threshold
  3. Context capacity: Must have enough context window (with 15% headroom)
  4. Provider health: Provider must not be in the "down" state
  5. Budget: Estimated cost must be within max_budget_usd

If no models pass filtering, the request fails with a 502 error.

Escalation and Failover

When a provider call fails, the router uses the error classification to decide what to do:

Error ClassAction
context_overflowFind a model with a larger context window
rate_limitedSkip to the next provider; honor Retry-After header
transient (5xx)Retry with exponential backoff (100ms base, 2 retries)
fatal (4xx)Try the next model in scored order

The router tries up to 5 models before giving up.

Runtime Model Registry

View the current in-memory model registry and registered adapters:

curl http://localhost:8080/admin/v1/engine/models

Response:

{
  "models": [
    {
      "id": "gpt-4",
      "provider_id": "openai",
      "weight": 8,
      "max_context_tokens": 128000,
      "input_per_1k": 0.01,
      "output_per_1k": 0.03,
      "enabled": true
    }
  ],
  "adapters": ["openai", "anthropic", "vllm"]
}

Audit Trail

Routing configuration changes are logged as routing-config.update in the audit trail.

API Key Management

TokenHub issues its own API keys to client applications. Provider API keys are escrowed in the vault — clients never see them. This provides a clean separation between client authentication and provider credentials.

Key Properties

PropertyDescription
ID16-character hex identifier
PrefixFirst 8 characters of the key for identification
NameHuman-readable label
ScopesJSON array of allowed endpoints (chat, plan)
Rotation daysRecommended rotation period (0 = manual only)
ExpirationOptional automatic expiry
EnabledActive/inactive toggle

Operations

Create a Key

curl -X POST http://localhost:8080/admin/v1/apikeys \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production-backend",
    "scopes": "[\"chat\",\"plan\"]",
    "rotation_days": 90,
    "expires_in": "2160h"
  }'
FieldTypeRequiredDescription
namestringYesHuman-readable name for the key
scopesstringNoJSON array of scopes (default: ["chat","plan"])
rotation_daysintNoRecommended rotation period in days (default: 0)
expires_instringNoGo duration for expiry (e.g., 720h for 30 days)

Response:

{
  "ok": true,
  "key": "tokenhub_a1b2c3d4e5f6789012345678abcdef0123456789abcdef0123456789abcdef01",
  "id": "a1b2c3d4e5f6g7h8",
  "prefix": "tokenhub_a1b2c3d4",
  "warning": "Store this key securely. It will not be shown again."
}

Important: The plaintext key is returned only at creation time. Store it securely before closing the response.

List Keys

curl http://localhost:8080/admin/v1/apikeys

Response:

[
  {
    "id": "a1b2c3d4e5f6g7h8",
    "key_prefix": "tokenhub_a1b2c3d4",
    "name": "production-backend",
    "scopes": "[\"chat\",\"plan\"]",
    "created_at": "2026-02-16T10:00:00Z",
    "last_used_at": "2026-02-16T12:34:56Z",
    "expires_at": "2026-05-16T10:00:00Z",
    "rotation_days": 90,
    "enabled": true
  }
]

Plaintext keys are never shown in list responses.

Rotate a Key

Generate a new key value while keeping the same ID and configuration:

curl -X POST http://localhost:8080/admin/v1/apikeys/a1b2c3d4e5f6g7h8/rotate

Response:

{
  "ok": true,
  "key": "tokenhub_<new-64-hex-chars>",
  "warning": "Store this key securely. It will not be shown again."
}

The old key immediately becomes invalid. Distribute the new key to all clients before rotating.

Update a Key

Modify key metadata without changing the key value:

curl -X PATCH http://localhost:8080/admin/v1/apikeys/a1b2c3d4e5f6g7h8 \
  -H "Content-Type: application/json" \
  -d '{
    "name": "production-backend-v2",
    "scopes": "[\"chat\"]",
    "enabled": true,
    "rotation_days": 60
  }'

All fields are optional — only specified fields are updated.

Revoke (Delete) a Key

curl -X DELETE http://localhost:8080/admin/v1/apikeys/a1b2c3d4e5f6g7h8

This permanently removes the key. It cannot be recovered.

Security Details

Storage

  • Keys are hashed with bcrypt (cost factor 10) before storage
  • To reduce bcrypt overhead per-request, validated keys are cached for 5 minutes
  • The SHA-256 digest of the plaintext is bcrypt-hashed (allowing keys longer than bcrypt's 72-byte limit)

Validation Flow

  1. Extract Bearer tokenhub_... from Authorization header
  2. Extract the key prefix (first 8 chars after tokenhub_)
  3. Check the validation cache (5-minute TTL)
  4. If not cached: load record by prefix, bcrypt-verify, check enabled + expiry
  5. Update last_used_at timestamp
  6. Verify the key's scopes include the requested endpoint

Scopes

ScopeProtects
chatPOST /v1/chat
planPOST /v1/plan

An empty scopes array [] grants access to all endpoints.

Audit Trail

All key management operations are logged:

  • apikey.create — New key created
  • apikey.rotate — Key rotated (new value generated)
  • apikey.update — Key metadata changed
  • apikey.revoke — Key deleted

Best Practices

  1. Name keys descriptively: Use names like staging-backend, prod-api-v2, data-pipeline
  2. Use minimal scopes: If a client only needs chat, don't grant plan access
  3. Set rotation schedules: Configure rotation_days as a reminder to rotate
  4. Set expiration for temporary keys: Use expires_in for keys issued to contractors or experiments
  5. Monitor last_used_at: Keys not used for extended periods may be candidates for revocation
  6. Rotate after incidents: If a key may have been compromised, rotate immediately

Monitoring & Observability

TokenHub provides multiple layers of observability: health tracking, Prometheus metrics, time-series data, request logs, audit logs, reward logs, and real-time SSE events.

Health Endpoint

curl http://localhost:8080/healthz
StatusMeaning
200System is healthy, adapters and models are registered
503No adapters or no models are registered

Response:

{"status": "ok", "adapters": 2, "models": 6}

Provider Health

View per-provider health status:

curl http://localhost:8080/admin/v1/health

Response:

{
  "providers": [
    {
      "provider_id": "openai",
      "state": "healthy",
      "total_requests": 1234,
      "total_errors": 5,
      "consec_errors": 0,
      "avg_latency_ms": 456.7,
      "last_error": "",
      "last_success_at": "2026-02-16T12:34:56Z",
      "cooldown_until": "0001-01-01T00:00:00Z"
    }
  ]
}

Health States

StateConsecutive ErrorsBehavior
Healthy0-1Normal routing
Degraded2-4Still routed but penalized in scoring
Down5+Excluded from routing; 30-second cooldown

Active Health Probing

TokenHub actively probes provider health endpoints in the background:

ProviderHealth EndpointSuccess Criteria
OpenAIGET /v1/models2xx response
AnthropicGET /v1/messages2xx or 405 response
vLLMGET /health2xx response

Probes run every 30 seconds with a 10-second timeout.

Prometheus Metrics

Expose metrics at:

curl http://localhost:8080/metrics

Available Metrics

MetricTypeLabelsDescription
tokenhub_requests_totalcountermode, model, provider, statusTotal requests processed
tokenhub_request_latency_mshistogrammode, model, providerRequest latency distribution
tokenhub_cost_usd_totalcountermodel, providerCumulative estimated cost

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: tokenhub
    scrape_interval: 15s
    static_configs:
      - targets: ['tokenhub:8080']

Example Queries

# Request rate by model
rate(tokenhub_requests_total[5m])

# P95 latency
histogram_quantile(0.95, rate(tokenhub_request_latency_ms_bucket[5m]))

# Cost per hour by provider
rate(tokenhub_cost_usd_total[1h]) * 3600

# Error rate
sum(rate(tokenhub_requests_total{status="error"}[5m])) /
sum(rate(tokenhub_requests_total[5m]))

Time-Series Database (TSDB)

TokenHub includes a lightweight SQLite-backed TSDB for historical metrics with querying and downsampling.

Query Metrics

curl "http://localhost:8080/admin/v1/tsdb/query?metric=latency&model_id=gpt-4&start=2026-02-16T00:00:00Z&end=2026-02-16T23:59:59Z&step_ms=60000"
ParameterRequiredDescription
metricYesMetric name (latency or cost)
model_idNoFilter by model
provider_idNoFilter by provider
startNoStart time (RFC3339)
endNoEnd time (RFC3339)
step_msNoDownsample bucket in milliseconds

List Available Metrics

curl http://localhost:8080/admin/v1/tsdb/metrics

Configure Retention

curl -X PUT http://localhost:8080/admin/v1/tsdb/retention \
  -H "Content-Type: application/json" \
  -d '{"retention_days": 14}'

Default retention is 7 days. Old data is automatically pruned hourly.

Manual Prune

curl -X POST http://localhost:8080/admin/v1/tsdb/prune

Request Logs

View paginated request history:

curl "http://localhost:8080/admin/v1/logs?limit=50&offset=0"

Each entry contains:

  • Timestamp, request ID
  • Model ID, provider ID, routing mode
  • Estimated cost, latency
  • HTTP status code, error class (if failed)

Audit Logs

View admin action history:

curl "http://localhost:8080/admin/v1/audit?limit=50&offset=0"

Logged actions:

  • vault.lock, vault.unlock, vault.rotate
  • provider.upsert, provider.delete
  • model.upsert, model.patch, model.delete
  • apikey.create, apikey.rotate, apikey.update, apikey.revoke
  • routing-config.update

Reward Logs

View contextual bandit reward data for RL-based routing analysis:

curl "http://localhost:8080/admin/v1/rewards?limit=50&offset=0"

Each entry contains: request ID, mode, model, provider, token count, token bucket (small/medium/large), latency budget, actual latency, cost, success flag, error class, and computed reward.

Aggregated Statistics

curl http://localhost:8080/admin/v1/stats

Returns global aggregates plus breakdowns by model and by provider.

Server-Sent Events (SSE)

Subscribe to real-time events:

curl -N http://localhost:8080/admin/v1/events

Event types:

EventFieldsWhen
route_successmodel_id, provider_id, latency_ms, cost_usd, reasonRequest completed successfully
route_errorlatency_ms, error_class, error_msgRequest failed

Example:

data: {"type":"route_success","model_id":"gpt-4","provider_id":"openai","latency_ms":456.7,"cost_usd":0.023,"reason":"routed-weight-8"}
AlertConditionSeverity
High error rateError rate > 5% over 5 minutesWarning
Provider downProvider in "down" state > 2 minutesCritical
High latencyP95 latency > 10 secondsWarning
Cost spikeHourly cost > 2x 7-day averageWarning
Vault lockedVault locked during business hoursCritical
No providersAdapter count = 0Critical

Admin UI

TokenHub includes a built-in single-page admin dashboard accessible at /admin. The UI is embedded in the binary — no separate frontend build or deployment is needed.

Accessing the UI

Navigate to:

http://localhost:8080/admin

The root URL (http://localhost:8080/) automatically redirects to /admin/.

Authentication

When TOKENHUB_ADMIN_TOKEN is set, the dashboard displays a full-screen Admin Authentication modal on first visit. Paste your admin token and press Authenticate (or Enter). The token is verified against the API before the dashboard loads; an invalid token shows an inline error.

Once authenticated, the token is stored in sessionStorage (cleared when the browser tab closes). A Sign Out button in the header clears the session and re-opens the authentication modal.

To retrieve the admin token:

tokenhubctl admin-token

Cache Busting

The admin HTML is served with Cache-Control: no-cache, must-revalidate and an ETag derived from the content hash. Static assets under /_assets/ are served with immutable cache headers and versioned URLs (?v=<hash>), ensuring browsers always get fresh assets after a rebuild without manual cache clearing.

Dashboard Panels

Vault Controls

The vault panel adapts to three states:

  • First-Time Setup: When the vault has never been initialized, the UI displays a prompt to choose a master password (minimum 8 characters) with a confirmation field. Press Enter in the confirmation field or click Initialize Vault to complete setup.
  • Locked: When the vault has been initialized but is locked, the UI shows a password input. Press Enter or click Unlock to unlock.
  • Unlocked: Shows the unlocked status with a Lock button.

Note: The vault password encrypts your stored provider API keys. It is distinct from your admin token, which authenticates access to the admin API. You need both: the admin token to access the dashboard, and the vault password to decrypt stored credentials.

Provider Management

Full CRUD interface for providers:

  • Setup Wizard: Multi-step guided onboarding for new providers — select type (OpenAI/Anthropic/vLLM), enter base URL and API key, test the connection, then discover and register available models.
  • Provider Table: Shows all providers from both the persistent store and runtime engine (env vars, credentials file). Runtime-only providers are indicated with a badge. Base URLs are derived from adapter health endpoints when not stored.
  • Edit Modal: Click "Edit" on any provider to change type, base URL, API key, or enabled state.
  • Discover: Query a provider's API to find available models and register them.
  • Delete: Remove a provider from the store.

Model Management

Full CRUD interface for models:

  • Add Model Form: Create a new model with provider, weight, context window, and pricing.
  • Model Table: Shows all models from both the store and engine, with their provider, weight, context, pricing, and enabled state.
  • Edit Modal: Click "Edit" on any model to change weight, max context tokens, pricing, or enabled state.
  • Weight Slider: Quick inline weight adjustment (0-10).
  • Enable/Disable Toggle: Click the status icon to toggle a model.
  • Delete: Remove a model from the store and engine.

Model Selection Graph

An interactive directed acyclic graph (DAG) showing the relationship between providers and models. Built with Cytoscape.js, it is populated on page load with all known providers and models and updates in real time as routing events arrive.

  • Provider nodes (colored by health state)
  • Model nodes (sized by weight)
  • Edges colored by latency: green (<1s), yellow (1-3s), red (>3s)
  • Edge thickness based on request volume
  • Node size and border based on throughput and latency

Cost and Latency Charts

Multi-series D3.js line charts showing cost and latency trends over time:

  • Per-model breakdown
  • Configurable time window
  • Hover for exact values

What-If Simulator

Test routing decisions without sending a live request:

  • Select routing mode, token count, max budget, min weight, and model hint
  • See the winning model, eligible candidates, and the routing reason
  • Useful for understanding how parameter changes affect model selection

SSE Decision Feed

Live event stream showing every routing decision in real time:

  • Model, provider, latency, cost, and reason for each event
  • Error events with error classification
  • Auto-scrolling event list

Routing Configuration

Set server-wide routing defaults:

  • Default mode selector (cheap, normal, high_confidence, planning, adversarial)
  • Budget input (USD)
  • Latency input (milliseconds)
  • Save button with validation

Provider Health

Real-time provider health display:

  • State badges: Healthy (green), Degraded (yellow), Down (red)
  • Consecutive error count
  • Last success timestamp
  • Average latency

API Keys

Key management interface:

  • Create new keys (name, scopes, rotation, expiry)
  • One-time key display modal with copy button
  • Rotate keys (with one-time new key display)
  • Enable/disable toggle
  • Revoke (delete) keys
  • Table showing: name, prefix, scopes, created, last used, expires, rotation days, status

Request Log

Paginated request history:

  • Model, provider, mode columns
  • Latency, cost, status code
  • Error class (for failed requests)
  • Pagination controls

Audit Log

Paginated audit trail viewer:

  • Action type filter
  • Timestamp, action, resource ID
  • Request ID for correlation

Model Leaderboard

A ranked table of models by performance:

  • Success rate
  • Average latency
  • Total cost
  • Request count

Rewards

Contextual bandit reward data for Thompson Sampling analysis.

Workflows (Temporal)

When Temporal is enabled, shows workflow execution history:

  • Workflow ID, type, status
  • Start time, duration
  • Status badges: Running (blue), Completed (green), Failed (red)
  • Click to expand activity history

Static Assets

Static assets (Cytoscape.js, D3.js) are served from /_assets/ to avoid conflicts with the /admin/v1 API prefix. All assets are embedded in the binary via Go's embed package and served with immutable cache headers.

Customization

The admin UI is a single index.html file located at web/index.html in the source tree. To customize:

  1. Edit web/index.html
  2. Rebuild the binary (make build) or Docker image (make package)
  3. The updated UI is embedded automatically with fresh cache-busting hashes

tokenhubctl CLI

tokenhubctl is the command-line interface for managing TokenHub. It wraps every admin API endpoint into a convenient, scriptable tool.

Installation

make install    # Builds natively and installs to ~/.local/bin

Or build inside the Docker builder container:

make build      # Produces bin/tokenhub and bin/tokenhubctl

Configuration

VariableDefaultDescription
TOKENHUB_URLhttp://localhost:8080TokenHub server URL
TOKENHUB_ADMIN_TOKENBearer token for admin endpoints (see admin-token command)
export TOKENHUB_URL="http://tokenhub.internal:8080"
export TOKENHUB_ADMIN_TOKEN="$(tokenhubctl admin-token)"

Command Reference

General

tokenhubctl admin-token         # Print the admin token (env, file, or Docker)
tokenhubctl status              # Server info, health, vault state
tokenhubctl health              # Provider health table
tokenhubctl version             # CLI version
tokenhubctl help                # Full usage

Admin Token

The admin-token command retrieves the admin token by checking, in order:

  1. TOKENHUB_ADMIN_TOKEN environment variable
  2. ~/.tokenhub/.admin-token file (native deployments)
  3. docker exec into the running container to read /data/.admin-token

This avoids the need to parse server logs. The token file is written automatically by the server at startup (whether auto-generated or set via env).

Rotating the Admin Token

tokenhubctl rotate-admin-token           # Generate a new random token
tokenhubctl rotate-admin-token <token>   # Replace with a specific token

After rotation, update your local environment:

make _write-env   # Sync token from container to ~/.tokenhub/env

The new token takes effect immediately (no restart required) and is persisted to the data directory so it survives restarts. The old token is invalidated instantly.

Vault

tokenhubctl vault unlock <password>
tokenhubctl vault lock
tokenhubctl vault rotate <old-password> <new-password>

Providers

tokenhubctl provider list
tokenhubctl provider add '<json>'
tokenhubctl provider edit <id> '<json>'
tokenhubctl provider delete <id>
tokenhubctl provider discover <id>

The list command merges providers from both the persistent store and the runtime engine, showing the source of each.

The discover command queries a provider's /v1/models endpoint to list available models and whether each is already registered in TokenHub.

Example:

# Add a new provider
tokenhubctl provider add '{
  "id": "openai",
  "type": "openai",
  "base_url": "https://api.openai.com",
  "api_key": "sk-..."
}'

# Update its base URL
tokenhubctl provider edit openai '{"base_url":"https://api.openai.com"}'

# Discover available models
tokenhubctl provider discover openai

Models

tokenhubctl model list
tokenhubctl model add '<json>'
tokenhubctl model edit <id> '<json>'
tokenhubctl model delete <id>
tokenhubctl model enable <id>
tokenhubctl model disable <id>

Model IDs can contain slashes (e.g., Qwen/Qwen2.5-Coder-32B-Instruct). The CLI handles them correctly.

Example:

# Add a model
tokenhubctl model add '{
  "id": "gpt-4o",
  "provider_id": "openai",
  "weight": 8,
  "max_context_tokens": 128000,
  "input_per_1k": 0.0025,
  "output_per_1k": 0.01,
  "enabled": true
}'

# Adjust its weight
tokenhubctl model edit gpt-4o '{"weight": 9}'

# Temporarily disable it
tokenhubctl model disable gpt-4o

Routing

tokenhubctl routing get
tokenhubctl routing set '<json>'

Example:

tokenhubctl routing set '{"default_mode":"cheap","default_max_budget_usd":0.02,"default_max_latency_ms":10000}'

API Keys

tokenhubctl apikey list
tokenhubctl apikey create '<json>'
tokenhubctl apikey rotate <id>
tokenhubctl apikey edit <id> '<json>'
tokenhubctl apikey delete <id>

The create command prints the API key exactly once. Save it immediately.

Example:

tokenhubctl apikey create '{"name":"prod-app","scopes":"[\"chat\",\"plan\"]","monthly_budget_usd":50.0}'

Observability

tokenhubctl logs [--limit N]       # Request logs
tokenhubctl audit [--limit N]      # Audit trail
tokenhubctl rewards [--limit N]    # Thompson Sampling reward data
tokenhubctl stats                  # Aggregated statistics
tokenhubctl engine models          # Runtime model registry and adapter info
tokenhubctl events                 # Live SSE event stream (Ctrl-C to stop)

Routing Simulation

Run a what-if simulation without sending a real request:

tokenhubctl simulate '{"mode":"cheap","token_count":500}'
tokenhubctl simulate '{"mode":"high_confidence","token_count":2000,"max_budget_usd":0.10}'

TSDB

tokenhubctl tsdb metrics
tokenhubctl tsdb query metric=latency&model_id=gpt-4o&step_ms=60000
tokenhubctl tsdb prune

Output Format

Most commands produce human-readable tabular output. For programmatic use, pipe JSON responses directly from curl or parse tokenhubctl output with standard text tools.

Architecture

TokenHub is a Go application structured as a layered system with clear package boundaries and dependency injection.

Package Layout

tokenhub/
├── cmd/tokenhub/          # Entry point, signal handling, HTTP server lifecycle
├── internal/
│   ├── app/               # Server construction, config loading, wiring
│   ├── apikey/            # API key manager + auth middleware
│   ├── events/            # In-memory event bus (pub/sub for SSE)
│   ├── health/            # Provider health tracker + active prober
│   ├── httpapi/           # HTTP handlers and route mounting
│   ├── logging/           # Structured logging setup (slog)
│   ├── metrics/           # Prometheus metric registry
│   ├── providers/         # Provider adapter contract + context helpers
│   │   ├── openai/        # OpenAI adapter
│   │   ├── anthropic/     # Anthropic adapter
│   │   └── vllm/          # vLLM adapter
│   ├── router/            # Routing engine, scoring, orchestration, Thompson Sampling
│   ├── stats/             # In-memory statistics collector
│   ├── store/             # Persistence layer (SQLite)
│   ├── temporal/          # Temporal workflow integration
│   ├── tsdb/              # Time-series database (SQLite-backed)
│   └── vault/             # AES-256-GCM encrypted credential vault
├── web/                   # Embedded admin UI (index.html)
└── docs/                  # This documentation

Dependency Flow

cmd/tokenhub/main.go
  └── internal/app.NewServer(cfg)
        ├── vault.New()
        ├── router.NewEngine()
        ├── store.NewSQLite()
        ├── health.NewTracker()
        ├── health.NewProber()         → health.Tracker
        ├── loadCredentialsFile()      → router.Engine
        ├── loadPersistedProviders()   → router.Engine
        ├── router.NewThompsonSampler()
        ├── apikey.NewManager()        → store.Store
        ├── metrics.New()
        ├── events.NewBus()
        ├── stats.NewCollector()
        ├── tsdb.New()
        ├── temporal.New()             → (optional)
        └── httpapi.MountRoutes()      → Dependencies{...}

All dependencies flow downward. HTTP handlers receive a Dependencies struct containing all services they need.

Key Interfaces

router.Sender

The provider adapter contract:

type Sender interface {
    ID() string
    Send(ctx context.Context, model string, req Request) (ProviderResponse, error)
    ClassifyError(err error) *ClassifiedError
}

router.StreamSender

Optional streaming extension:

type StreamSender interface {
    Sender
    SendStream(ctx context.Context, model string, req Request) (io.ReadCloser, error)
}

health.Probeable

Health probe interface for providers:

type Probeable interface {
    ID() string
    HealthEndpoint() string
}

store.Store

Persistence interface with methods for models, providers, request logs, audit logs, reward entries, API keys, vault blobs, and routing configuration.

Request Lifecycle

  1. HTTP handler receives the request, validates input, extracts API key
  2. Directive parser scans messages for @@tokenhub overrides and strips them
  3. Policy resolution: Merge request policy with server defaults and directive overrides
  4. Token estimation: Estimate input tokens (explicit or chars/4 heuristic)
  5. Model selection: Filter eligible models, score by policy weights, sort
  6. Provider dispatch: Call the top-scored model's adapter
  7. Error handling: On failure, classify the error and escalate/retry/failover
  8. Output shaping: Apply output format (JSON schema validation, think-block stripping)
  9. Observability: Record metrics, TSDB points, request logs, reward entries, SSE events
  10. Response: Return the provider response with routing metadata

Concurrency Model

  • The HTTP server uses Go's standard net/http with chi router (goroutine per request)
  • The TSDB uses internal write buffering (batched inserts)
  • The health prober runs as a background goroutine with configurable interval
  • The Thompson Sampler refresh runs as a background goroutine
  • The TSDB prune loop runs as a background goroutine (hourly)
  • Temporal workflows (when enabled) are managed by the Temporal worker

All background goroutines are cleanly stopped via Server.Close().

Configuration

All configuration is via environment variables, loaded in internal/app/config.go. See Configuration Reference for the complete list.

Embedding

The admin UI (web/index.html) is embedded in the binary using Go's //go:embed directive in the root embed.go file. This means the entire application is a single self-contained binary.

Routing Engine

The routing engine (internal/router/engine.go) is TokenHub's core component. It manages the model registry, scores models against request policies, dispatches to provider adapters, and handles failover.

Engine Structure

type Engine struct {
    adapters      map[string]Sender         // provider ID → adapter
    models        []Model                   // registered models
    healthChecker HealthChecker             // optional health state provider
    banditPolicy  BanditPolicy              // optional Thompson Sampling
    defaults      EngineConfig              // default mode, budget, latency
}

Model Registration

Models and adapters are registered at startup and can be modified at runtime:

eng.RegisterAdapter(openai.New("openai", apiKey, baseURL))
eng.RegisterModel(router.Model{
    ID: "gpt-4", ProviderID: "openai",
    Weight: 8, MaxContextTokens: 128000,
    InputPer1K: 0.01, OutputPer1K: 0.03, Enabled: true,
})

Scoring Algorithm

The scoreModel() function computes a composite score for each eligible model:

score = (costNorm * w.Cost) + (latencyNorm * w.Latency) + (failureNorm * w.Failure) - (weightNorm * w.Weight)

Normalization:

  • costNorm: estimatedCost / maxBudgetUSD (clamped to 0-1)
  • latencyNorm: avgLatencyMs / maxLatencyMs (from health tracker)
  • failureNorm: errorRate (from health tracker, 0-1)
  • weightNorm: model.Weight / 10.0

Lower scores are better. The weight term is subtracted (higher weight reduces score).

Eligibility Filtering

eligibleModels() filters the model registry:

  1. Must be Enabled
  2. Must meet min_weight threshold
  3. Must have sufficient context window (estimated tokens * 1.15 headroom)
  4. Provider must not be in "down" health state
  5. Estimated cost must be within budget

For thompson mode, eligible models are reordered by Thompson Sampling instead of the scoring function.

RouteAndSend Flow

func (e *Engine) RouteAndSend(ctx context.Context, req Request, policy Policy) (Decision, ProviderResponse, error)
  1. Resolve defaults (fill in zero-value policy fields from server defaults)
  2. Get eligible models
  3. If model_hint is set and the model exists, try it first
  4. Sort remaining models by score
  5. For each model (up to 5 attempts): a. Look up the adapter by model.ProviderID b. Call adapter.Send(ctx, model.ID, req) c. On success: return decision + response d. On error: classify the error and decide next action:
    • ErrContextOverflow: Find a model with larger context
    • ErrRateLimited: Skip to next provider (honor RetryAfter)
    • ErrTransient: Retry same model with exponential backoff
    • ErrFatal: Try next model

Orchestration

Orchestrate() handles multi-model modes:

func (e *Engine) Orchestrate(ctx context.Context, req Request, dir OrchestrationDirective) (Decision, json.RawMessage, error)

See Orchestration Modes for details.

Streaming

func (e *Engine) RouteAndStream(ctx context.Context, req Request, policy Policy) (Decision, io.ReadCloser, error)

Same model selection as RouteAndSend, but calls SendStream() on adapters that implement StreamSender. Returns the raw SSE stream body for the HTTP handler to proxy.

Health Integration

The engine optionally uses a HealthChecker interface:

type HealthChecker interface {
    ProviderState(providerID string) ProviderHealthState
}

This provides:

  • Error rate for scoring (failureNorm)
  • "Down" state for eligibility filtering
  • Average latency for scoring (latencyNorm)

Thompson Sampling Integration

When a BanditPolicy is set:

type BanditPolicy interface {
    Sample(models []Model, tokenBucket string) []Model
}

In thompson mode, eligibleModels() calls banditPolicy.Sample() instead of the scoring function. The sampler draws from Beta distributions parameterized by historical reward data.

Thread Safety

The engine uses sync.RWMutex to protect the model registry and adapter map. Reads (model selection, routing) take a read lock. Writes (register/unregister) take a write lock.

Orchestration Modes

Orchestration enables multi-model reasoning patterns. The orchestration logic lives in internal/router/engine.go in the Orchestrate() method.

Architecture

Orchestrate(req, directive)
  ├── adversarial: Plan → Critique → Refine (loop)
  ├── vote:        N Voters → Judge → Select best
  ├── refine:      Generate → Refine → Refine (loop)
  └── planning:    Single RouteAndSend with planning profile

Model Selection for Orchestration

Each orchestration mode needs a "primary" model and optionally a "review" model. Models are selected by:

  1. Explicit model ID: primary_model_id / review_model_id in the directive
  2. Weight floor: primary_min_weight / review_min_weight sets minimum capability
  3. Automatic: Falls back to routing engine scoring with the appropriate policy

For review models, the policy uses high_confidence mode by default to ensure a capable judge/critic.

Adversarial Mode

Three-phase iterative refinement with a separate critique model:

// Phase 1: Plan
planResp = RouteAndSend(req with "Create a detailed plan...")
// Phase 2: Critique (loop N iterations)
critiqueResp = RouteAndSend(req with "Critique this plan: ...")
// Phase 3: Refine
refinedResp = RouteAndSend(req with "Refine based on critique: ...")

The critique and refine phases repeat for directive.Iterations (default 1).

Output schema:

{
  "initial_plan": "Plan text from phase 1",
  "critique": "Final critique from last iteration",
  "refined_plan": "Final refined plan from last iteration"
}

Vote Mode

Multiple models respond independently, a judge selects the best:

// Phase 1: Collect votes (one per eligible model, up to 3)
for model in eligibleModels:
    responses[model] = RouteAndSend(req, model)

// Phase 2: Judge
judgeResp = RouteAndSend(req with "Select the best response (1-N): ...")
selectedIdx = parseNumber(judgeResp) - 1

Output schema:

{
  "responses": [
    {"model": "gpt-4", "content": "...", "selected": true},
    {"model": "claude-sonnet", "content": "...", "selected": false}
  ],
  "selected": 0,
  "judge": "claude-opus"
}

Refine Mode

Single model iteratively improves its own response:

// Phase 1: Initial response
resp = RouteAndSend(req)

// Phase 2: Iterative refinement (loop N iterations)
for i := 0; i < iterations; i++:
    resp = RouteAndSend(req with "Review and improve: " + resp)

Output schema:

{
  "refined_response": "Final refined text",
  "iterations": 3,
  "model": "claude-opus"
}

Planning Mode

Falls through to a standard RouteAndSend with the planning routing profile:

decision, resp, err = RouteAndSend(req, Policy{Mode: "planning"})

Cost and Latency

Orchestration makes multiple LLM calls. The Decision returned by Orchestrate() accumulates costs from all calls:

totalDecision.EstimatedCostUSD += stepDecision.EstimatedCostUSD

The routing reason is set to {mode}-orchestration (e.g., adversarial-orchestration).

Temporal Integration

When Temporal is enabled, orchestration runs as a OrchestrationWorkflow:

  • Each LLM call becomes a Temporal activity
  • Activities run with retry policies and timeouts
  • The full execution is visible in the Temporal UI
  • If Temporal is unavailable, falls back to direct orchestration

See Temporal Workflows for details.

Adding New Orchestration Modes

To add a new mode:

  1. Add the mode name to the validation list in handlers_plan.go
  2. Add a case in Orchestrate() in engine.go
  3. Implement the multi-call pattern following existing modes
  4. Return a json.RawMessage with the composite result
  5. Update the OrchestrationWorkflow in temporal/workflows.go if using Temporal

Provider Adapters

Provider adapters translate TokenHub's generic request format into provider-specific API calls. Each adapter implements the router.Sender interface.

Interface

// Sender is the core provider adapter interface.
type Sender interface {
    ID() string
    Send(ctx context.Context, model string, req Request) (ProviderResponse, error)
    ClassifyError(err error) *ClassifiedError
}

// StreamSender extends Sender with streaming support.
type StreamSender interface {
    Sender
    SendStream(ctx context.Context, model string, req Request) (io.ReadCloser, error)
}

// Probeable enables active health probing.
type Probeable interface {
    ID() string
    HealthEndpoint() string
}

ProviderResponse is []byte (raw JSON from the provider).

Existing Adapters

OpenAI (internal/providers/openai/)

  • Endpoint: POST {baseURL}/v1/chat/completions
  • Health: GET {baseURL}/v1/models
  • Auth: Authorization: Bearer {apiKey}
  • Request translation: Maps req.Messages to OpenAI chat format, merges req.Parameters
  • Error classification:
    • 429 → ErrRateLimited (with Retry-After header parsing)
    • 5xx → ErrTransient
    • Body contains context_length_exceededErrContextOverflow
    • Other → ErrFatal

Anthropic (internal/providers/anthropic/)

  • Endpoint: POST {baseURL}/v1/messages
  • Health: GET {baseURL}/v1/messages (405 = healthy)
  • Auth: x-api-key: {apiKey}, anthropic-version: 2023-06-01
  • Request translation: Splits system message from user messages (Anthropic API requires separate system field), defaults max_tokens to 4096 if not in req.Parameters
  • Error classification: Same pattern as OpenAI

vLLM (internal/providers/vllm/)

  • Endpoint: POST {endpoint}/v1/chat/completions (OpenAI-compatible)
  • Health: GET {endpoint}/health
  • Auth: None (local deployment)
  • Features: Multiple endpoints with round-robin load balancing
  • Request translation: Same as OpenAI (vLLM implements OpenAI-compatible API)

Common Patterns

Parameter Forwarding

All adapters merge req.Parameters into the provider payload:

for k, v := range req.Parameters {
    if k != "model" && k != "messages" {
        payload[k] = v
    }
}

Reserved keys (model, messages, stream) are never overridden by parameters.

Request ID Propagation

All adapters forward the request ID for distributed tracing:

if reqID := providers.GetRequestID(ctx); reqID != "" {
    req.Header.Set("X-Request-ID", reqID)
}

The request ID is injected into the context by the HTTP handler using providers.WithRequestID().

Error Wrapping

Adapters wrap HTTP errors in providers.StatusError:

type StatusError struct {
    StatusCode    int
    Body          string
    RetryAfterSecs float64
}

The ClassifyError() method on each adapter converts these to router.ClassifiedError for the routing engine's failover logic.

Creating a New Adapter

To add support for a new provider:

  1. Create internal/providers/{name}/adapter.go
  2. Implement router.Sender (and optionally router.StreamSender and health.Probeable)
  3. Add an Option pattern for configuration (timeout, endpoints, etc.)
  4. Add a case for the new type in registerProviderAdapter() in internal/httpapi/handlers_admin.go
  5. Register providers and models at runtime via the admin API or tokenhubctl

Example skeleton:

package newprovider

import (
    "context"
    "github.com/jordanhubbard/tokenhub/internal/router"
)

type Adapter struct {
    id     string
    apiKey string
    // ...
}

func New(id, apiKey string) *Adapter {
    return &Adapter{id: id, apiKey: apiKey}
}

func (a *Adapter) ID() string { return a.id }

func (a *Adapter) Send(ctx context.Context, model string, req router.Request) (router.ProviderResponse, error) {
    // Translate req to provider format, make HTTP call, return raw JSON
}

func (a *Adapter) ClassifyError(err error) *router.ClassifiedError {
    // Classify the error for failover logic
}

func (a *Adapter) HealthEndpoint() string {
    return "https://api.newprovider.com/health"
}

Health System

The health system tracks provider reliability and provides both passive monitoring (based on request outcomes) and active probing (periodic HTTP checks).

Components

Health Tracker (internal/health/tracker.go)

The tracker maintains per-provider health state:

type ProviderHealthState struct {
    State         string    // "healthy", "degraded", "down"
    TotalRequests int64
    TotalErrors   int64
    ConsecErrors  int
    AvgLatencyMs  float64   // Exponential moving average
    LastError     string
    LastSuccessAt time.Time
    CooldownUntil time.Time
}

State Transitions

                   success
   ┌─────────────────────────────────┐
   │                                 │
   ▼          2+ consec errors       │
Healthy ──────────────────────► Degraded
   ▲                                 │
   │          success                │ 5+ consec errors
   │◄────────────────────────────────┤
   │                                 ▼
   │                               Down
   │          cooldown expired       │
   │          + success              │
   └─────────────────────────────────┘

Configuration

type Config struct {
    DegradedThreshold int           // Consecutive errors to enter degraded (default: 2)
    DownThreshold     int           // Consecutive errors to enter down (default: 5)
    CooldownDuration  time.Duration // Time in down state before retry (default: 30s)
}

Recording Results

// Called after every provider request
tracker.RecordSuccess(providerID, latencyMs)
tracker.RecordError(providerID, errorMsg)

Each success resets the consecutive error counter. Each error increments it and potentially triggers a state transition.

Health Prober (internal/health/prober.go)

The prober performs active health checks against provider endpoints:

type Probeable interface {
    ID() string
    HealthEndpoint() string
}

Probe Logic

  • Sends GET requests to each provider's health endpoint
  • Runs all probes concurrently with a per-probe timeout
  • 2xx or 405 responses are considered healthy (405 is expected from some endpoints like Anthropic's /v1/messages)
  • Any other response or connection error records a failure

Configuration

type ProberConfig struct {
    Interval time.Duration // Time between probe rounds (default: 30s)
    Timeout  time.Duration // Per-probe HTTP timeout (default: 10s)
}

Provider Health Endpoints

ProviderEndpointSuccess
OpenAIGET /v1/models2xx
AnthropicGET /v1/messages2xx or 405
vLLMGET /health2xx

Integration with Routing

The routing engine queries health state during model selection:

  1. Eligibility: Models from providers in "down" state are excluded
  2. Scoring: The failure rate (totalErrors / totalRequests) contributes to the model's score
  3. Latency: The exponential moving average latency contributes to the model's score
type HealthChecker interface {
    ProviderState(providerID string) ProviderHealthState
}

The tracker implements this interface and is passed to the engine via engine.SetHealthChecker().

Observability

Provider health is exposed via:

  • GET /admin/v1/health — JSON health state for all providers
  • Admin UI health panel — Visual health badges
  • SSE events — Error events include provider state changes

Storage Layer

TokenHub uses SQLite for persistence, providing a zero-dependency embedded database. The storage layer is defined by the store.Store interface and implemented by store.SQLiteStore.

Interface

The Store interface (internal/store/store.go) provides methods for all persistence needs:

Models

UpsertModel(ctx, Model) error
GetModel(ctx, id) (*Model, error)
ListModels(ctx) ([]Model, error)
DeleteModel(ctx, id) error

Providers

UpsertProvider(ctx, Provider) error
ListProviders(ctx) ([]Provider, error)
DeleteProvider(ctx, id) error

Request Logs

LogRequest(ctx, RequestLog) error
ListRequestLogs(ctx, limit, offset) ([]RequestLog, error)

Audit Logs

LogAudit(ctx, AuditEntry) error
ListAuditLogs(ctx, limit, offset) ([]AuditEntry, error)

Reward Entries

LogReward(ctx, RewardEntry) error
ListRewardEntries(ctx, limit, offset) ([]RewardEntry, error)
GetRewardSummary(ctx) ([]RewardSummary, error)

API Keys

CreateAPIKey(ctx, APIKeyRecord) error
GetAPIKey(ctx, id) (*APIKeyRecord, error)
ListAPIKeys(ctx) ([]APIKeyRecord, error)
UpdateAPIKey(ctx, APIKeyRecord) error
DeleteAPIKey(ctx, id) error

Vault Blob

SaveVaultBlob(ctx, salt, data) error
LoadVaultBlob(ctx) (salt, data, error)

Routing Configuration

SaveRoutingConfig(ctx, RoutingConfig) error
LoadRoutingConfig(ctx) (RoutingConfig, error)

Schema

The database schema is created and migrated in sqlite.go's Migrate() method:

models

CREATE TABLE IF NOT EXISTS models (
    id TEXT PRIMARY KEY,
    provider_id TEXT NOT NULL,
    weight INTEGER NOT NULL DEFAULT 5,
    max_context_tokens INTEGER NOT NULL DEFAULT 4096,
    input_per_1k REAL NOT NULL DEFAULT 0,
    output_per_1k REAL NOT NULL DEFAULT 0,
    enabled INTEGER NOT NULL DEFAULT 1
);

providers

CREATE TABLE IF NOT EXISTS providers (
    id TEXT PRIMARY KEY,
    type TEXT NOT NULL,
    enabled INTEGER NOT NULL DEFAULT 1,
    base_url TEXT NOT NULL DEFAULT '',
    cred_store TEXT NOT NULL DEFAULT 'none'
);

request_logs

CREATE TABLE IF NOT EXISTS request_logs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT NOT NULL,
    request_id TEXT NOT NULL DEFAULT '',
    model_id TEXT NOT NULL DEFAULT '',
    provider_id TEXT NOT NULL DEFAULT '',
    mode TEXT NOT NULL DEFAULT '',
    estimated_cost_usd REAL NOT NULL DEFAULT 0,
    latency_ms INTEGER NOT NULL DEFAULT 0,
    status_code INTEGER NOT NULL DEFAULT 0,
    error_class TEXT NOT NULL DEFAULT ''
);

audit_logs

CREATE TABLE IF NOT EXISTS audit_logs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT NOT NULL,
    action TEXT NOT NULL,
    resource TEXT NOT NULL DEFAULT '',
    request_id TEXT NOT NULL DEFAULT ''
);

reward_entries

CREATE TABLE IF NOT EXISTS reward_entries (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT NOT NULL,
    request_id TEXT NOT NULL DEFAULT '',
    model_id TEXT NOT NULL DEFAULT '',
    provider_id TEXT NOT NULL DEFAULT '',
    mode TEXT NOT NULL DEFAULT '',
    estimated_tokens INTEGER NOT NULL DEFAULT 0,
    token_bucket TEXT NOT NULL DEFAULT '',
    latency_budget_ms REAL NOT NULL DEFAULT 0,
    latency_ms REAL NOT NULL DEFAULT 0,
    cost_usd REAL NOT NULL DEFAULT 0,
    success INTEGER NOT NULL DEFAULT 0,
    error_class TEXT NOT NULL DEFAULT '',
    reward REAL NOT NULL DEFAULT 0
);

api_keys

CREATE TABLE IF NOT EXISTS api_keys (
    id TEXT PRIMARY KEY,
    key_hash TEXT NOT NULL,
    key_prefix TEXT NOT NULL,
    name TEXT NOT NULL,
    scopes TEXT NOT NULL DEFAULT '["chat","plan"]',
    created_at TEXT NOT NULL,
    last_used_at TEXT,
    expires_at TEXT,
    rotation_days INTEGER NOT NULL DEFAULT 0,
    enabled INTEGER NOT NULL DEFAULT 1
);

vault_blob

CREATE TABLE IF NOT EXISTS vault_blob (
    id TEXT PRIMARY KEY DEFAULT 'singleton',
    salt TEXT,
    data_json TEXT
);

routing_config

CREATE TABLE IF NOT EXISTS routing_config (
    id TEXT PRIMARY KEY DEFAULT 'default',
    default_mode TEXT NOT NULL DEFAULT '',
    default_max_budget_usd REAL NOT NULL DEFAULT 0,
    default_max_latency_ms INTEGER NOT NULL DEFAULT 0
);

SQLite Configuration

The default DSN includes pragmas for performance:

file:/data/tokenhub.sqlite?_pragma=busy_timeout(5000)&_pragma=journal_mode(WAL)
  • busy_timeout: Wait up to 5 seconds for locks instead of failing immediately
  • journal_mode(WAL): Write-Ahead Logging for concurrent read/write access

TSDB

The time-series database (internal/tsdb/) uses a separate table in the same SQLite database:

CREATE TABLE IF NOT EXISTS tsdb_points (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ts INTEGER NOT NULL,        -- Unix nanoseconds
    metric TEXT NOT NULL,
    model_id TEXT NOT NULL DEFAULT '',
    provider_id TEXT NOT NULL DEFAULT '',
    value REAL NOT NULL
);

Features:

  • Write buffering (batch size 100)
  • Automatic retention pruning (default 7 days)
  • Downsampling support (configurable step size in queries)

Security Model

TokenHub implements security at multiple layers: credential encryption, client authentication, input validation, and audit logging.

Credential Security

Vault Encryption

Provider API keys are encrypted using AES-256-GCM:

  1. Admin provides a vault password
  2. Password + random salt → Argon2id key derivation → 256-bit encryption key
  3. Each value is encrypted with a unique nonce
  4. Encrypted values are stored in SQLite

Argon2id Parameters (per OWASP recommendations):

  • Time: 3 iterations
  • Memory: 64 MB
  • Threads: 4
  • Salt: 16 random bytes

Key Material Handling

  • Encryption keys exist only in memory while the vault is unlocked
  • Auto-lock clears the key after 30 minutes of inactivity
  • Vault salt is persisted in the database for key re-derivation
  • Password rotation re-encrypts all values atomically

Admin Authentication

Admin Token

All /admin/v1/* endpoints require a bearer token set via TOKENHUB_ADMIN_TOKEN. If not set, the server auto-generates a cryptographically random 64-character hex token at startup. The token is never logged — it is written to a file at /data/.admin-token (or ~/.tokenhub/.admin-token for native deployments) and can be retrieved with:

tokenhubctl admin-token

Client Authentication

API Key Security

  • Keys are hashed with bcrypt (cost 10) before storage
  • SHA-256 pre-hash allows keys longer than bcrypt's 72-byte input limit
  • 5-minute validation cache reduces bcrypt overhead
  • Plaintext is shown only once at creation/rotation

Key Validation Flow

Request → Extract Bearer token → Check cache (5min TTL)
  ├── Cache hit → Check scopes → Allow/Deny
  └── Cache miss → Load by prefix → bcrypt verify → Check enabled → Check expiry
       ├── Valid → Update cache + last_used_at → Check scopes → Allow/Deny
       └── Invalid → 401 Unauthorized

Input Validation

All API inputs are validated before processing:

Chat/Plan Endpoints

  • Messages array: required, non-empty
  • max_budget_usd: 0-100 range
  • max_latency_ms: 0-300000 range
  • min_weight: 0-10 range
  • Orchestration iterations: 0-10 range
  • Orchestration mode: must be a known value

Admin Endpoints

  • Routing config mode: must be a known value
  • Routing config budget/latency: same ranges as consumer API
  • Model weight: reasonable range
  • API key name: required

Request Isolation

  • Each request gets its own context with a unique request ID
  • Provider API keys are never exposed to clients
  • Client API key records are attached to context but not serialized in responses
  • Request parameters are validated before forwarding to providers

Audit Trail

All administrative mutations are logged:

type AuditEntry struct {
    Timestamp time.Time
    Action    string  // e.g., "vault.unlock", "model.patch"
    Resource  string  // Resource identifier
    RequestID string  // For correlation
}

Auditable actions:

  • Vault operations (lock, unlock, rotate)
  • Provider CRUD
  • Model CRUD
  • API key lifecycle (create, rotate, update, revoke)
  • Routing configuration changes

Network Security

TokenHub itself does not implement TLS. In production:

  1. Use a reverse proxy (nginx, Caddy, Traefik) for TLS termination
  2. Restrict admin endpoints to internal networks or VPN
  3. Use CORS appropriately (currently allows all origins for development)

Recommendations

  1. Vault password: Use a strong, unique password (16+ characters)
  2. API key rotation: Rotate keys every 90 days (configurable via rotation_days)
  3. Network segmentation: Keep admin endpoints behind a VPN or firewall
  4. TLS everywhere: Terminate TLS at a reverse proxy in front of TokenHub
  5. Database backups: SQLite file contains encrypted credentials and configuration
  6. Monitor audit logs: Set up alerting on unexpected admin actions

Temporal Workflows

TokenHub optionally integrates with Temporal for durable workflow execution. When enabled, every chat and orchestration request is dispatched as a Temporal workflow, providing visibility, retry guarantees, and execution history.

Architecture

HTTP Handler
  │
  ├── Temporal Enabled?
  │     ├── Yes → Start Temporal Workflow → Wait for result → Return response
  │     └── No  → Direct engine call → Return response
  │
  └── Temporal Unavailable (runtime)
        └── Fall back to direct engine call

Configuration

Env VarDefaultDescription
TOKENHUB_TEMPORAL_ENABLEDfalseEnable Temporal dispatch
TOKENHUB_TEMPORAL_HOSTlocalhost:7233Temporal server address
TOKENHUB_TEMPORAL_NAMESPACEtokenhubTemporal namespace
TOKENHUB_TEMPORAL_TASK_QUEUEtokenhub-tasksWorker task queue name

Components

Manager (internal/temporal/manager.go)

The manager creates and manages the Temporal client and worker:

type Manager struct {
    client client.Client
    worker worker.Worker
}
  • New(cfg, activities) — Creates Temporal client, registers workflows and activities
  • Start() — Starts the worker (non-blocking)
  • Client() — Returns the Temporal client for HTTP handlers
  • Stop() — Gracefully stops worker and closes client

Types (internal/temporal/types.go)

Input/output types for workflows:

type ChatInput struct {
    RequestID string
    APIKeyID  string
    Request   router.Request
    Policy    router.Policy
}

type ChatOutput struct {
    Decision  router.Decision
    Response  json.RawMessage
    LatencyMs int64
    Error     string
}

type OrchestrationInput struct {
    RequestID string
    APIKeyID  string
    Request   router.Request
    Directive router.OrchestrationDirective
}

Activities (internal/temporal/activities.go)

Activities are the atomic units of work. They receive injected dependencies:

type Activities struct {
    Engine   *router.Engine
    Store    store.Store
    Health   *health.Tracker
    Metrics  *metrics.Registry
    EventBus *events.Bus
    Stats    *stats.Collector
    TSDB     *tsdb.Store
}

Key activities:

  • ChatActivity: Calls engine.RouteAndSend() and returns the result
  • LogResultActivity: Persists metrics, request logs, reward entries, TSDB points, and SSE events

Workflows (internal/temporal/workflows.go)

  • ChatWorkflow: Calls ChatActivity then LogResultActivity
  • OrchestrationWorkflow: Calls ChatActivity for orchestration, then LogResultActivity

HTTP Handler Integration

Handlers check for a Temporal client and dispatch accordingly:

if d.TemporalClient != nil {
    run, err := d.TemporalClient.ExecuteWorkflow(ctx, opts, ChatWorkflow, input)
    if err != nil {
        // Temporal unavailable — fall back
        decision, resp, err = d.Engine.RouteAndSend(ctx, req, policy)
    } else {
        var output ChatOutput
        err = run.Get(ctx, &output)
        // Use output
    }
} else {
    decision, resp, err = d.Engine.RouteAndSend(ctx, req, policy)
}

The fallback ensures TokenHub continues to work even if Temporal becomes unavailable at runtime.

Workflow Visibility

Admin endpoints expose Temporal workflow data:

  • GET /admin/v1/workflows?limit=50&status=RUNNING — List workflows
  • GET /admin/v1/workflows/{id} — Describe workflow
  • GET /admin/v1/workflows/{id}/history — Activity history

Status values: RUNNING, COMPLETED, FAILED, CANCELED, TERMINATED, CONTINUED_AS_NEW, TIMED_OUT

Docker Compose Setup

temporal:
  image: temporalio/auto-setup:latest
  ports:
    - "7233:7233"
  environment:
    - DB=sqlite

temporal-ui:
  image: temporalio/ui:latest
  ports:
    - "8233:8080"
  environment:
    - TEMPORAL_ADDRESS=temporal:7233

Access the Temporal Web UI at http://localhost:8233.

Streaming Note

Streaming requests (stream: true) bypass Temporal and use direct engine dispatch. This is because streaming requires a persistent HTTP connection for SSE, which is incompatible with Temporal's request-response workflow model.

Extending TokenHub

This guide covers common extension points for adding functionality to TokenHub.

Adding a New Provider

  1. Create the adapter package:
internal/providers/newprovider/
├── adapter.go      # Sender implementation
└── adapter_test.go # Tests
  1. Implement the interfaces:
package newprovider

type Adapter struct {
    id      string
    apiKey  string
    baseURL string
    client  *http.Client
}

// Required: router.Sender
func (a *Adapter) ID() string { return a.id }
func (a *Adapter) Send(ctx context.Context, model string, req router.Request) (router.ProviderResponse, error) { ... }
func (a *Adapter) ClassifyError(err error) *router.ClassifiedError { ... }

// Optional: router.StreamSender
func (a *Adapter) SendStream(ctx context.Context, model string, req router.Request) (io.ReadCloser, error) { ... }

// Optional: health.Probeable
func (a *Adapter) HealthEndpoint() string { return a.baseURL + "/health" }
  1. Register via the admin API (providers and models are registered at runtime, not compiled in):
curl -X POST http://localhost:8080/admin/v1/providers \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"id":"newprovider","type":"openai","base_url":"https://api.newprovider.com","api_key":"..."}'
  1. Add adapter construction in registerProviderAdapter() in handlers_admin.go:
case "newprovider":
    d.Engine.RegisterAdapter(newprovider.New(p.ID, apiKey, p.BaseURL, newprovider.WithTimeout(timeout)))

Adding a New Routing Mode

  1. Define the weight profile in internal/router/engine.go:
var modeWeights = map[string]weights{
    // ...existing modes...
    "mymode": {Cost: 0.3, Latency: 0.2, Failure: 0.2, Weight: 0.3},
}
  1. Add validation in internal/httpapi/handlers_chat.go and handlers_plan.go:
case "mymode":
    // valid
  1. Add to routing config validation in handlers_routing.go.

Adding a New Orchestration Mode

  1. Add the case in engine.Orchestrate():
case "mymode":
    // Implement multi-call pattern
    result, err := json.Marshal(map[string]any{...})
    return totalDecision, result, err
  1. Add validation in handlers_plan.go.

  2. Update Temporal if using workflows:

// In OrchestrationWorkflow
case "mymode":
    // Implement as Temporal activities

Adding New Admin Endpoints

  1. Create handler in internal/httpapi/handlers_newfeature.go:
func NewFeatureHandler(d Dependencies) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        // Handler logic
    }
}
  1. Mount route in internal/httpapi/routes.go:
r.Get("/admin/v1/newfeature", NewFeatureHandler(d))
  1. Add to Dependencies if new services are needed.

Adding New Metrics

In internal/metrics/metrics.go:

type Registry struct {
    // ...existing metrics...
    NewMetric *prometheus.CounterVec
}

func New() *Registry {
    r := &Registry{
        NewMetric: prometheus.NewCounterVec(prometheus.CounterOpts{
            Namespace: "tokenhub",
            Name:      "new_metric_total",
            Help:      "Description of the new metric",
        }, []string{"label1", "label2"}),
    }
    // Register with Prometheus
    return r
}

Adding New Store Operations

  1. Add to the interface in internal/store/store.go
  2. Implement in SQLite in internal/store/sqlite.go
  3. Add migration in Migrate() if new tables are needed
  4. Write tests in internal/store/sqlite_test.go

Testing

TokenHub uses Go's standard testing package. Key test patterns:

  • Unit tests: Each package has *_test.go files
  • Integration tests: internal/httpapi/handlers_test.go tests the full HTTP stack
  • Mock adapters: mockSender in handler tests simulates provider responses
  • In-memory SQLite: Tests use :memory: DSN for isolated databases

Run all tests:

make test        # Standard tests
make test-race   # With race detector

Build

make build       # Build to bin/tokenhub
make package     # Build Docker image
make lint        # Run linter (requires golangci-lint)
make vet         # Go vet

Configuration Reference

TokenHub is configured entirely via environment variables. All variables are optional and have sensible defaults.

Environment Variables

Server

VariableDefaultDescription
TOKENHUB_LISTEN_ADDR:8080HTTP server listen address (binds all interfaces)
TOKENHUB_LOG_LEVELinfoLog level: debug, info, warn, error
TOKENHUB_DB_DSN/data/tokenhub.sqliteSQLite database path
TOKENHUB_VAULT_ENABLEDtrueEnable encrypted credential vault
TOKENHUB_VAULT_PASSWORDAuto-unlock vault at startup (headless mode)
TOKENHUB_PROVIDER_TIMEOUT_SECS30HTTP timeout for provider API calls

Routing Defaults

VariableDefaultDescription
TOKENHUB_DEFAULT_MODEnormalDefault routing mode
TOKENHUB_DEFAULT_MAX_BUDGET_USD0.05Default max cost per request (USD)
TOKENHUB_DEFAULT_MAX_LATENCY_MS20000Default max latency (milliseconds)

Security & Hardening

VariableDefaultDescription
TOKENHUB_ADMIN_TOKENBearer token for /admin/v1/* access (required in production)
TOKENHUB_CORS_ORIGINS*Comma-separated allowed CORS origins
TOKENHUB_RATE_LIMIT_RPS60Max requests per second per IP
TOKENHUB_RATE_LIMIT_BURST120Burst capacity per IP

Credentials

VariableDefaultDescription
TOKENHUB_CREDENTIALS_FILE~/.tokenhub/credentialsPath to external credentials JSON file

Providers are registered at startup via ~/.tokenhub/credentials or at runtime via the admin API, tokenhubctl, or the admin UI. At least one provider must be registered for TokenHub to route requests.

Temporal (Optional)

VariableDefaultDescription
TOKENHUB_TEMPORAL_ENABLEDfalseEnable Temporal workflow dispatch
TOKENHUB_TEMPORAL_HOSTlocalhost:7233Temporal server host:port
TOKENHUB_TEMPORAL_NAMESPACEtokenhubTemporal namespace
TOKENHUB_TEMPORAL_TASK_QUEUEtokenhub-tasksTemporal task queue name

OpenTelemetry (Optional)

VariableDefaultDescription
TOKENHUB_OTEL_ENABLEDfalseEnable OpenTelemetry tracing
TOKENHUB_OTEL_ENDPOINTlocalhost:4318OTLP exporter endpoint
TOKENHUB_OTEL_SERVICE_NAMEtokenhubService name for traces

External Credentials File

The ~/.tokenhub/credentials file is the primary mechanism for bootstrapping providers and models. It is processed at startup — providers are persisted to the database and API keys are stored in the vault (when TOKENHUB_VAULT_PASSWORD is set). The file must have 0600 permissions.

{
  "providers": [
    {
      "id": "openai",
      "type": "openai",
      "base_url": "https://api.openai.com",
      "api_key": "sk-..."
    },
    {
      "id": "vllm-local",
      "type": "vllm",
      "base_url": "http://localhost:8000"
    }
  ],
  "models": [
    {
      "id": "gpt-4o",
      "provider_id": "openai",
      "weight": 8,
      "max_context_tokens": 128000,
      "input_per_1k": 0.0025,
      "output_per_1k": 0.01
    }
  ]
}

The file is idempotent — providers and models are upserted, so it can remain in place across restarts. api_key is optional for keyless providers (vLLM, Ollama). All providers default to enabled: true unless explicitly set to false.

Example Configuration

Minimal

./bin/tokenhub
# Then register providers via ~/.tokenhub/credentials, admin API, or UI.

Full Production

export TOKENHUB_LISTEN_ADDR=":8080"
export TOKENHUB_LOG_LEVEL="info"
export TOKENHUB_DB_DSN="/data/tokenhub.sqlite"
export TOKENHUB_VAULT_ENABLED="true"
export TOKENHUB_PROVIDER_TIMEOUT_SECS="30"

# Security
export TOKENHUB_ADMIN_TOKEN="your-secret-admin-token"
export TOKENHUB_CORS_ORIGINS="https://app.example.com"
export TOKENHUB_RATE_LIMIT_RPS="100"
export TOKENHUB_RATE_LIMIT_BURST="200"

# Routing
export TOKENHUB_DEFAULT_MODE="normal"
export TOKENHUB_DEFAULT_MAX_BUDGET_USD="0.10"
export TOKENHUB_DEFAULT_MAX_LATENCY_MS="30000"

# Temporal (optional)
export TOKENHUB_TEMPORAL_ENABLED="true"
export TOKENHUB_TEMPORAL_HOST="temporal:7233"

# OpenTelemetry (optional)
export TOKENHUB_OTEL_ENABLED="true"
export TOKENHUB_OTEL_ENDPOINT="otel-collector:4318"

./bin/tokenhub
# Providers are loaded from ~/.tokenhub/credentials, or registered via admin API/UI.

Runtime Configuration

The following settings can be changed at runtime via the admin API or tokenhubctl without restarting:

  • Routing defaults: PUT /admin/v1/routing-config or tokenhubctl routing set
  • Models: POST/PATCH/DELETE /admin/v1/models or tokenhubctl model add/edit/delete
  • Providers: POST/PATCH/DELETE /admin/v1/providers or tokenhubctl provider add/edit/delete
  • API keys: POST/PATCH/DELETE /admin/v1/apikeys or tokenhubctl apikey create/edit/delete
  • TSDB retention: PUT /admin/v1/tsdb/retention or tokenhubctl tsdb

Docker & Compose

TokenHub provides a Dockerfile for container builds and a Docker Compose file for local development with all dependencies.

Docker Image

Build

make package
# or
docker buildx build --load -t tokenhub .

The Dockerfile uses a multi-stage build:

  1. Build stage: golang:1.24-alpine — compiles the Go binary and builds mdbook documentation
  2. Runtime stage: alpine:3.21 — lightweight runtime with curl for health checks

The final image runs as a non-root tokenhub user.

Run

docker run -d \
  -p 8080:8080 \
  -e TOKENHUB_ADMIN_TOKEN="your-admin-token" \
  -v tokenhub_data:/data \
  tokenhub

The container expects:

  • Port 8080: HTTP server (binds all interfaces by default)
  • Volume /data: SQLite database persistence

Docker Compose

Full Stack

docker compose up -d

This starts:

ServicePortDescription
tokenhub8080TokenHub server
temporal7233Temporal server (gRPC)
temporal-ui8233Temporal Web UI

Services

TokenHub

tokenhub:
  image: tokenhub:latest
  ports:
    - "8080:8080"
  environment:
    - TOKENHUB_LISTEN_ADDR=:8080
    - TOKENHUB_DB_DSN=/data/tokenhub.sqlite
    - TOKENHUB_VAULT_ENABLED=true
    - TOKENHUB_VAULT_PASSWORD=${TOKENHUB_VAULT_PASSWORD}
    - TOKENHUB_ADMIN_TOKEN=${TOKENHUB_ADMIN_TOKEN}
  volumes:
    - tokenhub_data:/data
  restart: unless-stopped

Set TOKENHUB_VAULT_PASSWORD to auto-unlock the vault at startup (headless mode). If not set, unlock interactively via UI or tokenhubctl. Providers are loaded from ~/.tokenhub/credentials at startup, or registered at runtime via the admin API, tokenhubctl, or the admin UI.

Note: The TOKENHUB_DB_DSN should be a plain path (e.g., /data/tokenhub.sqlite) when using modernc.org/sqlite (the pure-Go driver). SQLite pragmas are applied programmatically, not via DSN query parameters.

Temporal

temporal:
  image: temporalio/auto-setup:latest
  ports:
    - "7233:7233"
  environment:
    - DB=sqlite
  volumes:
    - temporal_data:/etc/temporal/data

temporal-ui:
  image: temporalio/ui:latest
  ports:
    - "8233:8080"
  environment:
    - TEMPORAL_ADDRESS=temporal:7233

Environment File

Create a .env file for sensitive values:

TOKENHUB_ADMIN_TOKEN=your-secret-admin-token

Without Temporal

To run without Temporal:

docker compose up -d tokenhub

Or set TOKENHUB_TEMPORAL_ENABLED=false.

Provider Bootstrapping

Providers are loaded from ~/.tokenhub/credentials at startup. For Docker, mount the credentials file into the container or use the host path if running via Docker Compose with a volume mount. See Provider Management for the file format.

Health Check

The Docker health check uses the /healthz endpoint:

curl -f http://localhost:8080/healthz

Returns 200 when adapters and models are registered, 503 otherwise.

Data Persistence

All persistent data is stored in SQLite at the path configured by TOKENHUB_DB_DSN. In Docker, mount a volume to /data:

volumes:
  - tokenhub_data:/data

This persists:

  • Model and provider configurations
  • Vault salt and encrypted credentials
  • Request logs, audit logs, reward entries
  • API keys
  • Routing configuration
  • TSDB time-series data

Resource Requirements

TokenHub is lightweight:

  • Memory: ~50MB baseline, scales with request concurrency
  • CPU: Minimal (most time is spent waiting on provider APIs)
  • Disk: Depends on log retention; ~1MB per 10,000 requests

Production Checklist

Use this checklist when deploying TokenHub to production.

Pre-Deployment

  • Set a strong vault password (16+ characters, mixed case, numbers, symbols)
  • Configure at least one provider via environment variable
  • Set appropriate routing defaults for your use case
  • Create API keys for all client applications
  • Configure TSDB retention appropriate for your storage budget

Security Hardening

  • Set TOKENHUB_ADMIN_TOKEN: Stable Bearer token for /admin/v1/* endpoints (auto-generated if not set — check logs for the token)
  • Set TOKENHUB_CORS_ORIGINS: Restrict CORS to your domain(s) (e.g., https://app.example.com)
  • Rate limiting: Review TOKENHUB_RATE_LIMIT_RPS (default: 60/s) and TOKENHUB_RATE_LIMIT_BURST (default: 120) for your traffic patterns

Network Security

  • TLS termination: Place TokenHub behind a reverse proxy (nginx, Caddy, Traefik) with TLS
  • Firewall rules: Only allow inbound traffic on the configured listen port

Example nginx Configuration

server {
    listen 443 ssl;
    server_name tokenhub.example.com;

    ssl_certificate     /etc/letsencrypt/live/tokenhub.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/tokenhub.example.com/privkey.pem;

    # Consumer API - publicly accessible with API key auth
    location /v1/ {
        proxy_pass http://tokenhub:8080;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Request-ID $request_id;

        # SSE streaming support
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
    }

    # Health check
    location /healthz {
        proxy_pass http://tokenhub:8080;
    }

    # Metrics (restrict to monitoring network)
    location /metrics {
        allow 10.0.0.0/8;
        deny all;
        proxy_pass http://tokenhub:8080;
    }

    # Admin endpoints (restrict to admin VPN)
    location /admin {
        allow 10.100.0.0/16;
        deny all;
        proxy_pass http://tokenhub:8080;
    }
}

Database

  • Mount a persistent volume for the SQLite database
  • Set WAL mode: Include _pragma=journal_mode(WAL) in the DSN
  • Set busy timeout: Include _pragma=busy_timeout(5000) in the DSN
  • Schedule backups: Periodically copy the SQLite file (safe with WAL mode)

Backup Script

#!/bin/bash
# Safe SQLite backup using the .backup command
sqlite3 /data/tokenhub.sqlite ".backup /backups/tokenhub-$(date +%Y%m%d-%H%M%S).sqlite"

Monitoring

  • Prometheus scraping: Configure Prometheus to scrape /metrics
  • Set up alerts based on the recommended alerting rules
  • Log aggregation: Forward structured JSON logs to your log management system
  • Monitor TSDB size: Set appropriate retention to prevent unbounded growth

Key Metrics to Watch

MetricAlert ThresholdSeverity
Error rate> 5% over 5 minWarning
P95 latency> 10sWarning
Provider down> 2 minCritical
Cost spike> 2x weekly averageWarning
Vault lockedDuring business hoursCritical
Disk usage> 80%Warning

Graceful Shutdown

TokenHub handles SIGINT and SIGTERM for graceful shutdown:

  1. Stop accepting new connections
  2. Drain in-flight requests (30-second timeout)
  3. Stop background goroutines (prober, Thompson Sampling refresh, TSDB prune)
  4. Stop Temporal worker (if enabled)
  5. Close database connection

In Kubernetes, set terminationGracePeriodSeconds: 35 to allow the full drain.

Scaling Considerations

TokenHub is a single-process application with SQLite. For higher throughput:

  • Horizontal: Run multiple instances with separate SQLite databases (no shared state; each instance routes independently)
  • Temporal: Enable Temporal for durable workflow execution across restarts
  • Read replicas: Not applicable (SQLite is embedded)
  • Connection pooling: SQLite WAL mode supports concurrent reads natively

For very high throughput (>1000 req/s), consider migrating the store to PostgreSQL (implement the Store interface for a new backend).

CLI Administration

Use tokenhubctl for scriptable administration and health checks:

# Quick status check
tokenhubctl status

# Verify providers and models
tokenhubctl provider list
tokenhubctl model list

# Watch for issues in real time
tokenhubctl events

See tokenhubctl CLI for the full command reference.

Environment Variables Summary

See Configuration Reference for the complete list of all environment variables and their defaults.

API Reference

Complete reference for all TokenHub HTTP endpoints.

Consumer Endpoints

POST /v1/chat

Send a chat completion request with automatic model routing.

Authentication: Required (Bearer token)

Request Body:

{
  "request": {
    "messages": [{"role": "string", "content": "string"}],
    "model_hint": "string",
    "estimated_input_tokens": 0,
    "parameters": {},
    "stream": false,
    "meta": {},
    "output_schema": {}
  },
  "capabilities": {"planning": false},
  "policy": {
    "mode": "normal",
    "max_budget_usd": 0.05,
    "max_latency_ms": 20000,
    "min_weight": 0
  },
  "output_format": {
    "type": "json",
    "schema": "string",
    "max_tokens": 0,
    "strip_think": false
  }
}

Response: 200 OK

{
  "negotiated_model": "string",
  "estimated_cost_usd": 0.0,
  "routing_reason": "string",
  "response": {}
}

Errors: 400, 401, 403, 502


POST /v1/plan

Send an orchestrated multi-model request.

Authentication: Required (Bearer token)

Request Body:

{
  "request": {
    "messages": [{"role": "string", "content": "string"}]
  },
  "orchestration": {
    "mode": "adversarial",
    "iterations": 2,
    "primary_model_id": "string",
    "review_model_id": "string",
    "primary_min_weight": 0,
    "review_min_weight": 0,
    "return_plan_only": false,
    "output_schema": "string"
  }
}

Response: 200 OK

{
  "negotiated_model": "string",
  "estimated_cost_usd": 0.0,
  "routing_reason": "string",
  "response": {}
}

Errors: 400, 401, 403, 502


Health

GET /healthz

System health check.

Response: 200 OK or 503 Service Unavailable

{
  "status": "ok",
  "adapters": 2,
  "models": 6
}

GET /metrics

Prometheus metrics endpoint.

Response: 200 OK (text/plain, Prometheus exposition format)


Admin - Vault

POST /admin/v1/vault/unlock

Body: {"admin_password": "string"}

Response: 200 OK{"ok": true}


POST /admin/v1/vault/lock

Response: 200 OK{"ok": true, "already_locked": false}


POST /admin/v1/vault/rotate

Body: {"old_password": "string", "new_password": "string"}

Response: 200 OK{"ok": true}


Admin - Providers

POST /admin/v1/providers

Create or update a provider.

Body: {"id": "string", "type": "openai|anthropic|vllm", "enabled": true, "base_url": "string", "cred_store": "vault|none", "api_key": "string"}

Response: 200 OK{"ok": true, "cred_store": "vault"}


GET /admin/v1/providers

List all providers (from the persistent store).

Query: ?limit=N&offset=N

Response: 200 OK{"items": [{provider objects}], "total": N, "limit": N, "offset": N}


PATCH /admin/v1/providers/

Partial update of a provider. Runtime-only providers (not in the store) are automatically created in the store when first patched.

Body: {"type": "string", "base_url": "string", "enabled": true, "api_key": "string", "cred_store": "string"}

Response: 200 OK{"ok": true, "provider": {updated provider}}


DELETE /admin/v1/providers/

Delete a provider.

Response: 200 OK{"ok": true}


GET /admin/v1/providers/{id}/discover

Discover models available from a provider by querying its /v1/models endpoint.

Response: 200 OK{"models": [{"id": "string", "registered": false}]}


Admin - Models

POST /admin/v1/models

Create or update a model. Registers the model in both the runtime engine and persistent store.

Body: {"id": "string", "provider_id": "string", "weight": 5, "max_context_tokens": 128000, "input_per_1k": 0.01, "output_per_1k": 0.03, "enabled": true}

Response: 200 OK{"ok": true}


GET /admin/v1/models

List all models (from the persistent store).

Query: ?limit=N&offset=N

Response: 200 OK{"items": [{model objects}], "total": N, "limit": N, "offset": N}


PATCH /admin/v1/models/

Partial model update. Model IDs can contain slashes (e.g., Qwen/Qwen2.5-Coder-32B-Instruct). Runtime-only models are automatically seeded into the store from engine data on first patch.

Body: {"weight": 7, "enabled": true, "input_per_1k": 0.015, "output_per_1k": 0.035, "max_context_tokens": 128000}

Response: 200 OK{"ok": true, "model": {updated model}}


DELETE /admin/v1/models/

Delete a model. Model IDs with slashes are supported.

Response: 200 OK{"ok": true}


Admin - Routing

GET /admin/v1/routing-config

Get current routing defaults.

Response: 200 OK{"default_mode": "string", "default_max_budget_usd": 0.05, "default_max_latency_ms": 20000}


PUT /admin/v1/routing-config

Set routing defaults.

Body: {"default_mode": "string", "default_max_budget_usd": 0.1, "default_max_latency_ms": 30000}

Response: 200 OK{"ok": true}


POST /admin/v1/routing/simulate

Run a what-if routing simulation without sending a real request.

Body: {"mode": "string", "token_count": 500, "max_budget_usd": 0.05, "min_weight": 0, "model_hint": "string"}

Response: 200 OK{"decision": {decision object}, "eligible": [{model objects}]}


Admin - API Keys

POST /admin/v1/apikeys

Create a new API key.

Body: {"name": "string", "scopes": "[\"chat\",\"plan\"]", "rotation_days": 0, "expires_in": "720h", "monthly_budget_usd": 50.0}

Response: 200 OK{"ok": true, "key": "tokenhub_...", "id": "string", "prefix": "string", "warning": "string"}


GET /admin/v1/apikeys

List all API keys (no plaintext).

Response: 200 OK[{key objects without plaintext}]


POST /admin/v1/apikeys/{id}/rotate

Rotate an API key.

Response: 200 OK{"ok": true, "key": "tokenhub_...", "warning": "string"}


PATCH /admin/v1/apikeys/

Update API key metadata.

Body: {"name": "string", "scopes": "string", "rotation_days": 0, "enabled": true}

Response: 200 OK{"ok": true}


DELETE /admin/v1/apikeys/

Revoke (delete) an API key.

Response: 200 OK{"ok": true}


Admin - Observability

GET /admin/v1/health

Provider health status.

Response: 200 OK{"providers": [{health state objects}]}


GET /admin/v1/stats

Aggregated request statistics.

Response: 200 OK{"global": {}, "by_model": {}, "by_provider": {}}


GET /admin/v1/logs?limit=100&offset=0

Paginated request logs.


GET /admin/v1/audit?limit=100&offset=0

Paginated audit logs.


GET /admin/v1/rewards?limit=100&offset=0

Paginated reward entries.


GET /admin/v1/engine/models

Runtime model registry, adapter list, and adapter metadata.

Response: 200 OK

{
  "models": [{model objects}],
  "total": 7,
  "adapters": ["openai", "anthropic", "vllm"],
  "adapter_info": [
    {"id": "openai", "health_endpoint": "https://api.openai.com/v1/models"},
    {"id": "vllm", "health_endpoint": "http://vllm-1:8000/health"}
  ]
}

Admin - TSDB

GET /admin/v1/tsdb/query?metric=latency&model_id=gpt-4&start=...&end=...&step_ms=60000

Query time-series data.


GET /admin/v1/tsdb/metrics

List available TSDB metrics.


POST /admin/v1/tsdb/prune

Manually prune old TSDB data.


PUT /admin/v1/tsdb/retention

Set TSDB retention period.

Body: {"retention_days": 7}


Admin - Workflows (Temporal)

GET /admin/v1/workflows?limit=50&status=RUNNING

List Temporal workflow executions.


GET /admin/v1/workflows/

Describe a workflow execution.


GET /admin/v1/workflows/{id}/history

Get workflow event history.


Admin - Events

GET /admin/v1/events

Server-Sent Events stream.

Content-Type: text/event-stream

Events: route_success, route_error


Admin UI

GET /admin

Serves the embedded admin SPA. The root URL (/) redirects here.

GET /admin/v1/info

Admin status information. Requires admin token authentication (Bearer header or ?token= query parameter).

Response: 200 OK

{
  "tokenhub": "admin",
  "vault_locked": true,
  "vault_initialized": false
}

The vault_initialized field indicates whether the vault has ever been set up (salt exists). The UI uses this to distinguish first-time setup from a normal unlock prompt.

Prometheus Metrics

TokenHub exports Prometheus metrics at the /metrics endpoint.

Available Metrics

tokenhub_requests_total

Type: Counter

Total number of requests processed.

Labels:

LabelValuesDescription
modecheap, normal, high_confidence, planning, adversarial, thompsonRouting mode used
modelgpt-4, claude-opus, etc.Model that handled the request
provideropenai, anthropic, vllmProvider adapter
statusok, errorRequest outcome

Examples:

# Total successful requests
tokenhub_requests_total{status="ok"}

# Request rate by provider
rate(tokenhub_requests_total[5m])

# Error rate
sum(rate(tokenhub_requests_total{status="error"}[5m]))
  /
sum(rate(tokenhub_requests_total[5m]))

tokenhub_request_latency_ms

Type: Histogram

Request latency distribution in milliseconds.

Labels:

LabelValuesDescription
modecheap, normal, etc.Routing mode
modelgpt-4, etc.Model ID
provideropenai, etc.Provider ID

Buckets: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120 ms (exponential, base 2)

Examples:

# Median latency
histogram_quantile(0.5, rate(tokenhub_request_latency_ms_bucket[5m]))

# P95 latency
histogram_quantile(0.95, rate(tokenhub_request_latency_ms_bucket[5m]))

# P99 latency by model
histogram_quantile(0.99, sum(rate(tokenhub_request_latency_ms_bucket[5m])) by (model, le))

# Average latency
rate(tokenhub_request_latency_ms_sum[5m]) / rate(tokenhub_request_latency_ms_count[5m])

tokenhub_cost_usd_total

Type: Counter

Cumulative estimated cost in USD.

Labels:

LabelValuesDescription
modelgpt-4, etc.Model ID
provideropenai, etc.Provider ID

Examples:

# Total cost in the last hour
increase(tokenhub_cost_usd_total[1h])

# Cost rate (USD per second)
rate(tokenhub_cost_usd_total[5m])

# Cost per hour by model
rate(tokenhub_cost_usd_total[1h]) * 3600

# Most expensive model
topk(3, sum(rate(tokenhub_cost_usd_total[1h])) by (model))

Grafana Dashboard

Suggested Panels

PanelQueryVisualization
Request Ratesum(rate(tokenhub_requests_total[5m]))Time series
Error RateError rate formula aboveGauge (0-100%)
P95 LatencyP95 formula aboveTime series
Cost per HourCost rate * 3600Stat
Requests by Modelsum by (model) (rate(tokenhub_requests_total[5m]))Pie chart
Latency Heatmaptokenhub_request_latency_ms_bucketHeatmap

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: tokenhub
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets: ['tokenhub:8080']

For Docker Compose, use the service name as the target.

Error Classification

TokenHub classifies provider errors to enable intelligent failover. Each error from a provider is classified into one of four categories that determine the routing engine's next action.

Error Classes

context_overflow

The request exceeds the model's context window.

Triggers:

  • HTTP 413 from provider
  • Response body contains context_length_exceeded

Router action: Escalate to a model with a larger context window. If no larger model is available, try the next model in scored order.


rate_limited

The provider is throttling requests.

Triggers:

  • HTTP 429 from provider

Router action: Skip to a different provider. If the response includes a Retry-After header, the delay is recorded in the classified error for optional use by the caller.


transient

A temporary server-side failure.

Triggers:

  • HTTP 5xx from provider

Router action: Retry the same model with exponential backoff:

  • Base delay: 100ms
  • Maximum retries: 2
  • Backoff multiplier: 2x (100ms, 200ms)

After retries are exhausted, try the next model.


fatal

An unrecoverable client error.

Triggers:

  • HTTP 4xx (except 429 and 413)
  • Any other unclassified error

Router action: Skip to the next model in scored order. No retry.

Error Flow

Provider returns error
  │
  ├── adapter.ClassifyError(err) → ClassifiedError{Class, RetryAfter}
  │
  └── Router handles based on class:
        ├── context_overflow → Find bigger model
        ├── rate_limited → Different provider (respect RetryAfter)
        ├── transient → Retry with backoff (up to 2x)
        └── fatal → Next model

ClassifiedError Type

type ClassifiedError struct {
    Err        error
    Class      ErrorClass  // "context_overflow", "rate_limited", "transient", "fatal"
    RetryAfter float64     // Seconds to wait (from Retry-After header, 429 only)
}

HTTP Error Responses

Consumer API Errors

StatusMeaningWhen
400Bad RequestInvalid JSON, missing messages, validation failure
401UnauthorizedMissing or invalid API key
403ForbiddenValid key but insufficient scopes
502Bad GatewayAll models failed, no eligible models, or provider errors

Admin API Errors

StatusMeaningWhen
400Bad RequestInvalid parameters or validation failure
404Not FoundResource not found (model, key, provider)
500Internal Server ErrorDatabase or vault errors

Provider-Specific Classification

OpenAI

HTTP StatusBody PatternError Class
429rate_limited
500-599transient
400context_length_exceededcontext_overflow
Other 4xxfatal

Anthropic

HTTP StatusBody PatternError Class
429rate_limited
500-599transient
400context_length_exceededcontext_overflow
Other 4xxfatal

vLLM

HTTP StatusBody PatternError Class
429rate_limited
500-599transient
400context_length_exceededcontext_overflow
Other 4xxfatal

Reward Impact

Error classification affects the contextual bandit reward system:

  • Successful requests: Reward computed from latency and cost
  • Failed requests: Reward = 0.0 (regardless of error class)
  • Error class is stored in reward entries for analysis

This ensures the Thompson Sampling policy learns to avoid unreliable models over time.