Streaming Events

Server-sent events emitted when stream: true is enabled.

When stream: true is set on POST /v1/chat/completions, the gateway keeps the connection open and pushes tokens as they're generated using server-sent events. The response shape switches from a single JSON object to a sequence of chat.completion.chunk events, terminated by data: [DONE].

This page documents the wire format, the chunk lifecycle, and how to read token usage from a stream.

Enabling streaming

Set stream: true in the request body. The response will arrive as Content-Type: text/event-stream:

{
  "model": "gpt-4o",
  "messages": [{ "role": "user", "content": "Hi" }],
  "stream": true
}

To also receive token totals at the end of the stream, opt in via stream_options:

{
  "model": "gpt-4o",
  "messages": [{ "role": "user", "content": "Hi" }],
  "stream": true,
  "stream_options": { "include_usage": true }
}

See Token usage in a streamed response for the full behaviour.

Consuming the stream

The simplest path is the official openai SDK - each iteration item is a parsed chunk:

const stream = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hi' }],
  stream: true
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

If you're reading the raw HTTP body, parse SSE lines yourself:

const response = await fetch('https://api.aivene.com/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    Authorization: `Bearer ${apiKey}`
  },
  body: JSON.stringify({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'Hi' }],
    stream: true
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop() ?? '';

  for (const line of lines) {
    if (!line.startsWith('data: ')) continue;
    const payload = line.slice(6);
    if (payload === '[DONE]') return;

    const chunk = JSON.parse(payload);
    const content = chunk.choices[0]?.delta?.content;
    if (content) process.stdout.write(content);
  }
}

Wire format

Each event is a single data: line containing a JSON-encoded chunk, followed by a blank line. The stream is closed with a literal data: [DONE] sentinel:

data: {"id":"ilbs_...","object":"chat.completion.chunk","choices":[...]}

data: {"id":"ilbs_...","object":"chat.completion.chunk","choices":[...]}

data: [DONE]

The id is stable across all chunks of one response - use it for correlation and for log lookups in the Console.

Chunk anatomy

Every event (except [DONE]) parses to this shape:

interface ChatCompletionChunk {
  id: string;
  object: 'chat.completion.chunk';
  created: number;
  model: string;
  choices: {
    index: number;
    delta: {
      role?: 'assistant';
      content?: string;
      reasoning_content?: string;
      tool_calls?: ToolCallDelta[];
    };
    finish_reason: string | null;
  }[];
  usage?: {
    prompt_tokens: number;
    completion_tokens: number;
    total_tokens: number;
    prompt_tokens_details?: {
      cached_tokens?: number;
      cache_write_tokens?: number;
    };
  };
}

The delta object

delta carries only what's new in the current chunk - never the full running message. The fields appear independently:

FieldWhen it appears
roleFirst content chunk only, always 'assistant'.
contentEach text fragment as the model generates it.
reasoning_contentReasoning models (e.g. DeepSeek, Gemini thinking) when reasoning summaries are enabled.
tool_callsWhen the model invokes one or more tools. See Tool call streaming.

Concatenate delta.content across chunks to reconstruct the assistant message.

Stream lifecycle

A typical response moves through four phases:

  1. Role chunk - announces the assistant turn.
  2. Content chunks - one per token (or small group of tokens).
  3. Finish chunk - empty delta, populated finish_reason.
  4. Usage chunk (optional) - empty choices, populated usage. Only emitted when stream_options.include_usage is true.
  5. data: [DONE] - closes the stream.

1. Role chunk

{
  "id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
  "object": "chat.completion.chunk",
  "created": 1716825600,
  "model": "gpt-4o",
  "choices": [{
    "index": 0,
    "delta": { "role": "assistant", "content": "" },
    "finish_reason": null
  }]
}

2. Content chunks

{
  "id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
  "object": "chat.completion.chunk",
  "created": 1716825600,
  "model": "gpt-4o",
  "choices": [{
    "index": 0,
    "delta": { "content": "Hello" },
    "finish_reason": null
  }]
}
{
  "id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
  "object": "chat.completion.chunk",
  "created": 1716825600,
  "model": "gpt-4o",
  "choices": [{
    "index": 0,
    "delta": { "content": " there!" },
    "finish_reason": null
  }]
}

3. Finish chunk

{
  "id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
  "object": "chat.completion.chunk",
  "created": 1716825600,
  "model": "gpt-4o",
  "choices": [{
    "index": 0,
    "delta": {},
    "finish_reason": "stop"
  }]
}

4. Usage chunk (opt-in)

See Token usage in a streamed response.

Finish reasons

ValueMeaning
stopNatural end or a stop sequence was hit.
lengthHit max_completion_tokens (or max_tokens).
tool_callsModel wants to call one or more tools - resolve them and send the results back in a new request.
content_filterOutput was blocked by safety filters.

Tool call streaming

When the model decides to call a tool, the first matching chunk announces the call (id, type, name) with empty arguments:

{
  "choices": [{
    "delta": {
      "tool_calls": [{
        "index": 0,
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": ""
        }
      }]
    },
    "finish_reason": null
  }]
}

Subsequent chunks stream the JSON arguments string in pieces. The index field is your join key - concatenate function.arguments across all chunks with the same index to get the full JSON payload:

{
  "choices": [{
    "delta": {
      "tool_calls": [{
        "index": 0,
        "function": { "arguments": "{\"city\":" }
      }]
    },
    "finish_reason": null
  }]
}
{
  "choices": [{
    "delta": {
      "tool_calls": [{
        "index": 0,
        "function": { "arguments": "\"Tokyo\"}" }
      }]
    },
    "finish_reason": null
  }]
}

The finish chunk for a tool-calling turn carries finish_reason: "tool_calls".

Token usage in a streamed response

Non-streamed responses always carry usage on the top-level object. Streamed responses do not include it by default - SSE chunks are shaped for incremental rendering, not accounting. To receive token totals from a stream, opt in with stream_options.include_usage: true.

Where it shows up

When opted in, the gateway emits one extra terminal chunk after the chunk that carries finish_reason. Its choices array is empty and the usage field is populated:

{
  "id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
  "object": "chat.completion.chunk",
  "created": 1716825600,
  "model": "gpt-4o",
  "choices": [],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 128,
    "total_tokens": 170,
    "prompt_tokens_details": { "cached_tokens": 32 }
  }
}

data: [DONE] follows immediately after.

What gets reported

FieldMeaning
prompt_tokensTokens billed for the input (messages + tools + system).
completion_tokensTokens generated by the assistant, including tool-call arguments.
total_tokensprompt_tokens + completion_tokens.
prompt_tokens_details.cached_tokensSubset of prompt_tokens served from prompt cache. Present when the upstream provider reports it.
prompt_tokens_details.cache_write_tokensTokens written into prompt cache this turn (Anthropic-style providers).

How the gateway sources it

Usage is taken from the upstream provider's final SSE event and normalised into the shape above. Behaviour per provider family:

  • OpenAI / Azure OpenAI / OpenAI-compatible - the gateway forwards stream_options.include_usage and surfaces the chunk the provider emits.
  • Anthropic (Claude via OpenAI shape) - the gateway forces include_usage: true upstream, accumulates message_delta usage events, and re-emits them as an OpenAI usage chunk before [DONE].
  • Google (Gemini) - the gateway maps usageMetadata from the final generateContentStream event into the same usage block.

If you do not ask for usage, the gateway still parses any provider usage chunk internally (so logs and billing remain accurate) but strips it from the bytes returned to your client.

Reading it from an SDK

The usage chunk arrives as a normal iteration item - check for it explicitly:

const stream = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hi' }],
  stream: true,
  stream_options: { include_usage: true }
});

let usage;
for await (const chunk of stream) {
  if (chunk.usage) {
    usage = chunk.usage;
    continue;
  }
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta);
}

console.log(usage); // { prompt_tokens, completion_tokens, total_tokens, ... }

Don't want to stream just for usage?

If you only need totals and don't need incremental rendering, drop stream: true and read response.usage from the single JSON payload - it's always present and carries the same fields.

Error frames

If the upstream provider or the gateway itself fails mid-stream, you'll receive an error frame followed by [DONE] instead of a normal finish:

data: {"error":{"message":"upstream timeout","type":"stream_error"}}

data: [DONE]

Stream consumers should always handle a chunk whose top-level error field is populated, in addition to the normal choices-bearing chunks.