Audio Understanding
Send audio to a chat model and reason over it without transcription.
Audio understanding models accept audio as part of a user message and reason over it alongside any text. Use this for voice note summarization, meeting analysis, sentiment detection, and audio Q&A - all without a separate transcription step.
POST /v1/chat/completionsWhich models support audio input?
Not all models accept audio input. Check the model's capabilities on
GET /v1/models - look for audio_input in the supported input types.
Base64 input
Embed the audio bytes inline as base64 with format specified.
import { readFile } from 'node:fs/promises';
const bytes = await readFile('voicenote.m4a');
const base64 = bytes.toString('base64');
const res = await client.chat.completions.create({
model: 'gemini-2.5-flash-lite',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Summarise this voice note and list any action items.' },
{
type: 'input_audio',
input_audio: { data: base64, format: 'm4a' }
}
]
}]
});input_audio part fields
| Field | Type | Required | Description |
|---|---|---|---|
data | string | yes | Base64 encoded audio bytes. |
format | string | yes | Audio format: wav, mp3, m4a, ogg, flac, aac, webm, aiff, mpeg. |
Common use cases
Voice note summary
const res = await client.chat.completions.create({
model: 'gemini-2.5-flash-lite',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Summarise this voice note in 3 bullet points.' },
{ type: 'input_audio', input_audio: { data: base64, format: 'mp3' } }
]
}]
});Meeting action items
const res = await client.chat.completions.create({
model: 'gemini-2.5-flash-lite',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Extract action items with owner and deadline from this meeting recording.' },
{ type: 'input_audio', input_audio: { data: meetingAudio, format: 'wav' } }
]
}]
});Sentiment analysis
const res = await client.chat.completions.create({
model: 'gemini-2.5-flash-lite',
response_format: {
type: 'json_schema',
json_schema: {
name: 'sentiment',
schema: {
type: 'object',
properties: {
sentiment: { type: 'string', enum: ['positive', 'neutral', 'negative'] },
confidence: { type: 'number' },
reasoning: { type: 'string' }
},
required: ['sentiment', 'confidence', 'reasoning']
}
}
},
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Analyse the sentiment of the speaker.' },
{ type: 'input_audio', input_audio: { data: base64, format: 'm4a' } }
]
}]
});Combining with other modalities
Audio can be mixed with images and text in the same message.
const res = await client.chat.completions.create({
model: 'gemini-2.5-flash-lite',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Does the audio description match what is shown in the image?' },
{ type: 'input_audio', input_audio: { data: audioBase64, format: 'mp3' } },
{ type: 'image_url', image_url: { url: 'https://example.com/diagram.png' } }
]
}]
});Structured output
Pair audio with response_format: 'json_schema' for typed extraction.
const res = await client.chat.completions.create({
model: 'gemini-2.5-flash-lite',
response_format: {
type: 'json_schema',
json_schema: {
name: 'call_summary',
schema: {
type: 'object',
properties: {
participants: { type: 'array', items: { type: 'string' } },
topics: { type: 'array', items: { type: 'string' } },
action_items: {
type: 'array',
items: {
type: 'object',
properties: {
task: { type: 'string' },
owner: { type: 'string' },
deadline: { type: 'string' }
},
required: ['task']
}
},
next_steps: { type: 'string' }
},
required: ['topics', 'action_items']
}
}
},
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Extract meeting details from this call.' },
{ type: 'input_audio', input_audio: { data: callAudio, format: 'wav' } }
]
}]
});Audio understanding vs transcription
| Use case | Endpoint | When to use |
|---|---|---|
| Raw transcript | /v1/audio/transcriptions | Need verbatim text or subtitles |
| Translation to English | /v1/audio/translations | Source is non-English, need English text only |
| Reasoning over audio | /v1/chat/completions with input_audio | Need summarization, Q&A, sentiment, or tool calls on top of speech |
One round trip vs two
If you need both a transcript AND reasoning, using input_audio in chat
gives you reasoning directly. For just the text, transcription endpoints
are cheaper and faster.
File limits
| Constraint | Limit |
|---|---|
| Max file size | Bound by request body size; keep under ~20 MB |
| Recommended duration | < 5 minutes for best latency |
| Supported formats | wav, mp3, m4a, ogg, flac, aac, webm, aiff, mpeg |
For longer recordings, split on silence first and process in chunks.
Cost notes
Audio input is billed as audio input tokens. Token counts appear in the
usage field of the response. Longer audio = more tokens.
Errors
| Status | Reason |
|---|---|
400 | Unsupported format or corrupted audio. |
413 | File exceeded body size limit. |
415 | Model does not support audio input. |
429 | Rate limit exceeded. |
Next steps
- Transcriptions - dedicated speech-to-text endpoint.
- Text-to-Speech - generate spoken audio from text.
- Chat Completions reference - the full message schema.