Streaming

TokenHub supports Server-Sent Events (SSE) streaming for chat requests. When streaming is enabled, tokens are delivered incrementally as they are generated by the provider.

Enabling Streaming

Set stream: true in your request:

{
  "request": {
    "messages": [{"role": "user", "content": "Tell me a story..."}],
    "stream": true
  }
}

Response Format

Streaming responses use the text/event-stream content type. Each event is a line prefixed with data: :

data: {"choices":[{"delta":{"content":"Once"},"index":0}]}

data: {"choices":[{"delta":{"content":" upon"},"index":0}]}

data: {"choices":[{"delta":{"content":" a"},"index":0}]}

data: {"choices":[{"delta":{"content":" time"},"index":0}]}

data: [DONE]

The stream ends with data: [DONE].

Response Headers

Streaming responses include these headers:

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
X-TokenHub-Model: gpt-4
X-TokenHub-Provider: openai
X-TokenHub-Reason: routed-weight-8

The X-TokenHub-* headers provide routing metadata that would normally be in the JSON response envelope.

Example with curl

curl -N -X POST http://localhost:8080/v1/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer tokenhub_..." \
  -d '{
    "request": {
      "messages": [{"role": "user", "content": "Count from 1 to 10 slowly."}],
      "stream": true
    }
  }'

The -N flag disables output buffering so tokens appear as they arrive.

Example with Python

import requests
import json

response = requests.post(
    "http://localhost:8080/v1/chat",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer tokenhub_..."
    },
    json={
        "request": {
            "messages": [{"role": "user", "content": "Tell me a story."}],
            "stream": True
        }
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        text = line.decode("utf-8")
        if text.startswith("data: ") and text != "data: [DONE]":
            chunk = json.loads(text[6:])
            delta = chunk["choices"][0].get("delta", {})
            if "content" in delta:
                print(delta["content"], end="", flush=True)

Provider Compatibility

All three provider adapters support streaming:

ProviderStreaming Protocol
OpenAISSE (native)
AnthropicSSE (native)
vLLMSSE (OpenAI-compatible)

TokenHub passes the SSE stream through directly from the selected provider. The event format matches the provider's native format.

Failover Behavior

Streaming uses the same model selection and failover logic as non-streaming requests. If the selected model fails to establish a stream, TokenHub falls back through eligible models in scored order.

However, once streaming has begun (first bytes sent to the client), failover is not possible. If the provider disconnects mid-stream, the stream ends with an error event.

Limitations

  • Streaming is only available on /v1/chat, not /v1/plan
  • Output format validation (output_format.schema) is not applied to streaming responses
  • Cost estimation in streaming responses may be less accurate since token counts are not known until the stream completes
  • When Temporal workflows are enabled, streaming bypasses Temporal and uses direct engine dispatch