The chat endpoint provides single-turn or multi-turn completions with automatic model routing.
Endpoint : POST /v1/chat
{
"request": {
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"model_hint": "gpt-4",
"estimated_input_tokens": 500,
"parameters": {
"temperature": 0.7,
"max_tokens": 1024,
"top_p": 0.9
},
"stream": false,
"meta": {
"user_id": "u123",
"session": "abc"
}
},
"capabilities": {
"planning": true
},
"policy": {
"mode": "normal",
"max_budget_usd": 0.05,
"max_latency_ms": 15000,
"min_weight": 5
},
"output_format": {
"type": "json",
"schema": "{\"type\":\"object\",\"properties\":{\"answer\":{\"type\":\"string\"}}}",
"max_tokens": 500,
"strip_think": true
}
}
Field Type Required Description
messagesarray Yes Array of {role, content} message objects
model_hintstring No Preferred model ID; tried first before scoring. Use * to let TokenHub assign a wildcard model.
estimated_input_tokensint No Token count hint for routing decisions
parametersobject No Provider parameters forwarded as-is (temperature, max_tokens, top_p, etc.)
streambool No Enable SSE streaming response
metaobject No Arbitrary metadata for logging and tracing
output_schemaJSON No JSON Schema for structured output validation
Controls model selection behavior. All fields are optional and fall back to server defaults.
Field Type Default Range Description
modestring normalSee below Routing mode
max_budget_usdfloat 0.05 0-100 Maximum cost per request
max_latency_msint 20000 0-300000 Maximum acceptable latency
min_weightint 0 0-10 Minimum model capability weight
Routing modes :
Mode Cost Weight Latency Weight Failure Weight Capability Weight
cheap0.7 0.1 0.1 0.1
normal0.25 0.25 0.25 0.25
high_confidence0.05 0.1 0.15 0.7
planning0.1 0.1 0.2 0.6
thompsonN/A N/A N/A N/A
The thompson mode uses reinforcement learning (Thompson Sampling with Beta distributions) to adaptively select models based on historical reward data.
Field Type Description
planningbool Indicates request needs planning capability
Capabilities influence which routing mode profile is used when no explicit mode is set.
Field Type Description
typestring Output format: json, markdown, text, xml
schemastring JSON Schema string for validating structured output
max_tokensint Maximum output tokens to request from provider
strip_thinkbool Remove <think>...</think> blocks from response
{
"negotiated_model": "gpt-4",
"estimated_cost_usd": 0.0023,
"routing_reason": "routed-weight-8",
"response": {
"id": "chatcmpl-...",
"choices": [{
"message": {
"role": "assistant",
"content": "Quantum computing uses..."
}
}],
"usage": {
"prompt_tokens": 45,
"completion_tokens": 128,
"total_tokens": 173
}
}
}
Field Description
negotiated_modelThe model ID that was selected
estimated_cost_usdEstimated cost based on model pricing and token counts
routing_reasonWhy this model was chosen (see Routing Reasons)
responseRaw JSON response from the selected provider
Reason Description
routed-weight-NSelected by scoring; N is the model's weight
model-hintClient's model hint was used
escalated-context-overflowEscalated to a model with a larger context window
retried-transientRetried after a transient provider error
Status Body Cause
400 "bad json"Malformed request body
400 "messages required"Empty messages array
400 "max_budget_usd must be between 0 and 100"Policy validation failure
401 "missing or invalid api key"Missing or invalid Authorization header
403 "scope not allowed"API key lacks chat scope
502 Error message All models failed or no eligible models
curl -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "Hello!"}]
}
}'
curl -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "Summarize this text..."}]
},
"policy": {
"mode": "cheap",
"max_budget_usd": 0.001
}
}'
curl -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "Write a poem about the ocean."}],
"model_hint": "claude-opus",
"parameters": {
"temperature": 0.9,
"max_tokens": 2048
}
}
}'
When model_hint is *, TokenHub assigns a model server-side. If no explicit
* alias is configured, wildcard requests use an ordered fail-down ladder.
Administrators can replace that ladder with PUT /admin/v1/wildcard-models or
seed it at startup with TOKENHUB_WILDCARD_MODELS_FILE.
curl -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "Review this plan."}],
"model_hint": "*"
}
}'
curl -X POST http://localhost:8080/v1/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer tokenhub_..." \
-d '{
"request": {
"messages": [{"role": "user", "content": "List 3 programming languages with their year of creation"}]
},
"output_format": {
"type": "json",
"schema": "{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"year\":{\"type\":\"integer\"}}}}"
}
}'