Chat API

The chat endpoint provides single-turn or multi-turn completions with automatic model routing.

Endpoint: POST /v1/chat

Request Format

{
  "request": {
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "model_hint": "gpt-4",
    "estimated_input_tokens": 500,
    "parameters": {
      "temperature": 0.7,
      "max_tokens": 1024,
      "top_p": 0.9
    },
    "stream": false,
    "meta": {
      "user_id": "u123",
      "session": "abc"
    }
  },
  "capabilities": {
    "planning": true
  },
  "policy": {
    "mode": "normal",
    "max_budget_usd": 0.05,
    "max_latency_ms": 15000,
    "min_weight": 5
  },
  "output_format": {
    "type": "json",
    "schema": "{\"type\":\"object\",\"properties\":{\"answer\":{\"type\":\"string\"}}}",
    "max_tokens": 500,
    "strip_think": true
  }
}

Request Fields

request (required)

FieldTypeRequiredDescription
messagesarrayYesArray of {role, content} message objects
model_hintstringNoPreferred model ID; tried first before scoring
estimated_input_tokensintNoToken count hint for routing decisions
parametersobjectNoProvider parameters forwarded as-is (temperature, max_tokens, top_p, etc.)
streamboolNoEnable SSE streaming response
metaobjectNoArbitrary metadata for logging and tracing
output_schemaJSONNoJSON Schema for structured output validation

policy (optional)

Controls model selection behavior. All fields are optional and fall back to server defaults.

FieldTypeDefaultRangeDescription
modestringnormalSee belowRouting mode
max_budget_usdfloat0.050-100Maximum cost per request
max_latency_msint200000-300000Maximum acceptable latency
min_weightint00-10Minimum model capability weight

Routing modes:

ModeCost WeightLatency WeightFailure WeightCapability Weight
cheap0.70.10.10.1
normal0.250.250.250.25
high_confidence0.050.10.150.7
planning0.10.10.20.6
thompsonN/AN/AN/AN/A

The thompson mode uses reinforcement learning (Thompson Sampling with Beta distributions) to adaptively select models based on historical reward data.

capabilities (optional)

FieldTypeDescription
planningboolIndicates request needs planning capability

Capabilities influence which routing mode profile is used when no explicit mode is set.

output_format (optional)

FieldTypeDescription
typestringOutput format: json, markdown, text, xml
schemastringJSON Schema string for validating structured output
max_tokensintMaximum output tokens to request from provider
strip_thinkboolRemove <think>...</think> blocks from response

Response Format

{
  "negotiated_model": "gpt-4",
  "estimated_cost_usd": 0.0023,
  "routing_reason": "routed-weight-8",
  "response": {
    "id": "chatcmpl-...",
    "choices": [{
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses..."
      }
    }],
    "usage": {
      "prompt_tokens": 45,
      "completion_tokens": 128,
      "total_tokens": 173
    }
  }
}
FieldDescription
negotiated_modelThe model ID that was selected
estimated_cost_usdEstimated cost based on model pricing and token counts
routing_reasonWhy this model was chosen (see Routing Reasons)
responseRaw JSON response from the selected provider

Routing Reasons

ReasonDescription
routed-weight-NSelected by scoring; N is the model's weight
model-hintClient's model hint was used
escalated-context-overflowEscalated to a model with a larger context window
retried-transientRetried after a transient provider error

Error Responses

StatusBodyCause
400"bad json"Malformed request body
400"messages required"Empty messages array
400"max_budget_usd must be between 0 and 100"Policy validation failure
401"missing or invalid api key"Missing or invalid Authorization header
403"scope not allowed"API key lacks chat scope
502Error messageAll models failed or no eligible models

Examples

Minimal Request

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Hello!"}]
    }
  }'

Cost-Optimized Request

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Summarize this text..."}]
    },
    "policy": {
      "mode": "cheap",
      "max_budget_usd": 0.001
    }
  }'

Request with Model Hint

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Write a poem about the ocean."}],
      "model_hint": "claude-opus",
      "parameters": {
        "temperature": 0.9,
        "max_tokens": 2048
      }
    }
  }'

Structured JSON Output

curl -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "List 3 programming languages with their year of creation"}]
    },
    "output_format": {
      "type": "json",
      "schema": "{\"type\":\"array\",\"items\":{\"type\":\"object\",\"properties\":{\"name\":{\"type\":\"string\"},\"year\":{\"type\":\"integer\"}}}}"
    }
  }'