Streaming Events
Server-sent events emitted when stream: true is enabled.
When stream: true is set on POST /v1/chat/completions, the gateway
keeps the connection open and pushes tokens as they're generated using
server-sent events.
The response shape switches from a single JSON object to a sequence of
chat.completion.chunk events, terminated by data: [DONE].
This page documents the wire format, the chunk lifecycle, and how to read token usage from a stream.
Enabling streaming
Set stream: true in the request body. The response will arrive as
Content-Type: text/event-stream:
{
"model": "gpt-4o",
"messages": [{ "role": "user", "content": "Hi" }],
"stream": true
}To also receive token totals at the end of the stream, opt in via
stream_options:
{
"model": "gpt-4o",
"messages": [{ "role": "user", "content": "Hi" }],
"stream": true,
"stream_options": { "include_usage": true }
}See Token usage in a streamed response for the full behaviour.
Consuming the stream
The simplest path is the official openai SDK - each iteration item is a
parsed chunk:
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hi' }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}If you're reading the raw HTTP body, parse SSE lines yourself:
const response = await fetch('https://api.aivene.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${apiKey}`
},
body: JSON.stringify({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hi' }],
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop() ?? '';
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const payload = line.slice(6);
if (payload === '[DONE]') return;
const chunk = JSON.parse(payload);
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
}Wire format
Each event is a single data: line containing a JSON-encoded chunk,
followed by a blank line. The stream is closed with a literal
data: [DONE] sentinel:
data: {"id":"ilbs_...","object":"chat.completion.chunk","choices":[...]}
data: {"id":"ilbs_...","object":"chat.completion.chunk","choices":[...]}
data: [DONE]The id is stable across all chunks of one response - use it for
correlation and for log lookups in the Console.
Chunk anatomy
Every event (except [DONE]) parses to this shape:
interface ChatCompletionChunk {
id: string;
object: 'chat.completion.chunk';
created: number;
model: string;
choices: {
index: number;
delta: {
role?: 'assistant';
content?: string;
reasoning_content?: string;
tool_calls?: ToolCallDelta[];
};
finish_reason: string | null;
}[];
usage?: {
prompt_tokens: number;
completion_tokens: number;
total_tokens: number;
prompt_tokens_details?: {
cached_tokens?: number;
cache_write_tokens?: number;
};
};
}The delta object
delta carries only what's new in the current chunk - never the full
running message. The fields appear independently:
| Field | When it appears |
|---|---|
role | First content chunk only, always 'assistant'. |
content | Each text fragment as the model generates it. |
reasoning_content | Reasoning models (e.g. DeepSeek, Gemini thinking) when reasoning summaries are enabled. |
tool_calls | When the model invokes one or more tools. See Tool call streaming. |
Concatenate delta.content across chunks to reconstruct the assistant
message.
Stream lifecycle
A typical response moves through four phases:
- Role chunk - announces the assistant turn.
- Content chunks - one per token (or small group of tokens).
- Finish chunk - empty
delta, populatedfinish_reason. - Usage chunk (optional) - empty
choices, populatedusage. Only emitted whenstream_options.include_usageistrue. data: [DONE]- closes the stream.
1. Role chunk
{
"id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
"object": "chat.completion.chunk",
"created": 1716825600,
"model": "gpt-4o",
"choices": [{
"index": 0,
"delta": { "role": "assistant", "content": "" },
"finish_reason": null
}]
}2. Content chunks
{
"id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
"object": "chat.completion.chunk",
"created": 1716825600,
"model": "gpt-4o",
"choices": [{
"index": 0,
"delta": { "content": "Hello" },
"finish_reason": null
}]
}{
"id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
"object": "chat.completion.chunk",
"created": 1716825600,
"model": "gpt-4o",
"choices": [{
"index": 0,
"delta": { "content": " there!" },
"finish_reason": null
}]
}3. Finish chunk
{
"id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
"object": "chat.completion.chunk",
"created": 1716825600,
"model": "gpt-4o",
"choices": [{
"index": 0,
"delta": {},
"finish_reason": "stop"
}]
}4. Usage chunk (opt-in)
See Token usage in a streamed response.
Finish reasons
| Value | Meaning |
|---|---|
stop | Natural end or a stop sequence was hit. |
length | Hit max_completion_tokens (or max_tokens). |
tool_calls | Model wants to call one or more tools - resolve them and send the results back in a new request. |
content_filter | Output was blocked by safety filters. |
Tool call streaming
When the model decides to call a tool, the first matching chunk announces
the call (id, type, name) with empty arguments:
{
"choices": [{
"delta": {
"tool_calls": [{
"index": 0,
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": ""
}
}]
},
"finish_reason": null
}]
}Subsequent chunks stream the JSON arguments string in pieces. The
index field is your join key - concatenate function.arguments across
all chunks with the same index to get the full JSON payload:
{
"choices": [{
"delta": {
"tool_calls": [{
"index": 0,
"function": { "arguments": "{\"city\":" }
}]
},
"finish_reason": null
}]
}{
"choices": [{
"delta": {
"tool_calls": [{
"index": 0,
"function": { "arguments": "\"Tokyo\"}" }
}]
},
"finish_reason": null
}]
}The finish chunk for a tool-calling turn carries finish_reason: "tool_calls".
Token usage in a streamed response
Non-streamed responses always carry usage on the top-level object.
Streamed responses do not include it by default - SSE chunks are
shaped for incremental rendering, not accounting. To receive token totals
from a stream, opt in with stream_options.include_usage: true.
Where it shows up
When opted in, the gateway emits one extra terminal chunk after the
chunk that carries finish_reason. Its choices array is empty and the
usage field is populated:
{
"id": "ilbs_ccb8oqnvprv0p2ewiakn4r9s",
"object": "chat.completion.chunk",
"created": 1716825600,
"model": "gpt-4o",
"choices": [],
"usage": {
"prompt_tokens": 42,
"completion_tokens": 128,
"total_tokens": 170,
"prompt_tokens_details": { "cached_tokens": 32 }
}
}data: [DONE] follows immediately after.
What gets reported
| Field | Meaning |
|---|---|
prompt_tokens | Tokens billed for the input (messages + tools + system). |
completion_tokens | Tokens generated by the assistant, including tool-call arguments. |
total_tokens | prompt_tokens + completion_tokens. |
prompt_tokens_details.cached_tokens | Subset of prompt_tokens served from prompt cache. Present when the upstream provider reports it. |
prompt_tokens_details.cache_write_tokens | Tokens written into prompt cache this turn (Anthropic-style providers). |
How the gateway sources it
Usage is taken from the upstream provider's final SSE event and normalised into the shape above. Behaviour per provider family:
- OpenAI / Azure OpenAI / OpenAI-compatible - the gateway forwards
stream_options.include_usageand surfaces the chunk the provider emits. - Anthropic (Claude via OpenAI shape) - the gateway forces
include_usage: trueupstream, accumulatesmessage_deltausage events, and re-emits them as an OpenAI usage chunk before[DONE]. - Google (Gemini) - the gateway maps
usageMetadatafrom the finalgenerateContentStreamevent into the sameusageblock.
If you do not ask for usage, the gateway still parses any provider usage chunk internally (so logs and billing remain accurate) but strips it from the bytes returned to your client.
Reading it from an SDK
The usage chunk arrives as a normal iteration item - check for it explicitly:
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hi' }],
stream: true,
stream_options: { include_usage: true }
});
let usage;
for await (const chunk of stream) {
if (chunk.usage) {
usage = chunk.usage;
continue;
}
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
console.log(usage); // { prompt_tokens, completion_tokens, total_tokens, ... }Don't want to stream just for usage?
If you only need totals and don't need incremental rendering, drop
stream: true and read response.usage from the single JSON payload -
it's always present and carries the same fields.
Error frames
If the upstream provider or the gateway itself fails mid-stream, you'll
receive an error frame followed by [DONE] instead of a normal finish:
data: {"error":{"message":"upstream timeout","type":"stream_error"}}
data: [DONE]Stream consumers should always handle a chunk whose top-level error
field is populated, in addition to the normal choices-bearing chunks.