Audio Understanding

Send audio to a chat model and reason over it without transcription.

Audio understanding models accept audio as part of a user message and reason over it alongside any text. Use this for voice note summarization, meeting analysis, sentiment detection, and audio Q&A - all without a separate transcription step.

POST /v1/chat/completions

Which models support audio input?

Not all models accept audio input. Check the model's capabilities on GET /v1/models - look for audio_input in the supported input types.

Base64 input

Embed the audio bytes inline as base64 with format specified.

import { readFile } from 'node:fs/promises';

const bytes = await readFile('voicenote.m4a');
const base64 = bytes.toString('base64');

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Summarise this voice note and list any action items.' },
      {
        type: 'input_audio',
        input_audio: { data: base64, format: 'm4a' }
      }
    ]
  }]
});

input_audio part fields

Field	Type	Required	Description
`data`	string	yes	Base64 encoded audio bytes.
`format`	string	yes	Audio format: `wav`, `mp3`, `m4a`, `ogg`, `flac`, `aac`, `webm`, `aiff`, `mpeg`.

Common use cases

Voice note summary

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Summarise this voice note in 3 bullet points.' },
      { type: 'input_audio', input_audio: { data: base64, format: 'mp3' } }
    ]
  }]
});

Meeting action items

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Extract action items with owner and deadline from this meeting recording.' },
      { type: 'input_audio', input_audio: { data: meetingAudio, format: 'wav' } }
    ]
  }]
});

Sentiment analysis

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'sentiment',
      schema: {
        type: 'object',
        properties: {
          sentiment: { type: 'string', enum: ['positive', 'neutral', 'negative'] },
          confidence: { type: 'number' },
          reasoning: { type: 'string' }
        },
        required: ['sentiment', 'confidence', 'reasoning']
      }
    }
  },
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Analyse the sentiment of the speaker.' },
      { type: 'input_audio', input_audio: { data: base64, format: 'm4a' } }
    ]
  }]
});

Combining with other modalities

Audio can be mixed with images and text in the same message.

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Does the audio description match what is shown in the image?' },
      { type: 'input_audio', input_audio: { data: audioBase64, format: 'mp3' } },
      { type: 'image_url', image_url: { url: 'https://example.com/diagram.png' } }
    ]
  }]
});

Structured output

Pair audio with response_format: 'json_schema' for typed extraction.

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'call_summary',
      schema: {
        type: 'object',
        properties: {
          participants: { type: 'array', items: { type: 'string' } },
          topics: { type: 'array', items: { type: 'string' } },
          action_items: {
            type: 'array',
            items: {
              type: 'object',
              properties: {
                task: { type: 'string' },
                owner: { type: 'string' },
                deadline: { type: 'string' }
              },
              required: ['task']
            }
          },
          next_steps: { type: 'string' }
        },
        required: ['topics', 'action_items']
      }
    }
  },
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Extract meeting details from this call.' },
      { type: 'input_audio', input_audio: { data: callAudio, format: 'wav' } }
    ]
  }]
});

Audio understanding vs transcription

Use case	Endpoint	When to use
Raw transcript	`/v1/audio/transcriptions`	Need verbatim text or subtitles
Translation to English	`/v1/audio/translations`	Source is non-English, need English text only
Reasoning over audio	`/v1/chat/completions` with `input_audio`	Need summarization, Q&A, sentiment, or tool calls on top of speech

One round trip vs two

If you need both a transcript AND reasoning, using input_audio in chat gives you reasoning directly. For just the text, transcription endpoints are cheaper and faster.

File limits

Constraint	Limit
Max file size	Bound by request body size; keep under ~20 MB
Recommended duration	< 5 minutes for best latency
Supported formats	wav, mp3, m4a, ogg, flac, aac, webm, aiff, mpeg

For longer recordings, split on silence first and process in chunks.

Cost notes

Audio input is billed as audio input tokens. Token counts appear in the usage field of the response. Longer audio = more tokens.

Errors

Status	Reason
`400`	Unsupported format or corrupted audio.
`413`	File exceeded body size limit.
`415`	Model does not support audio input.
`429`	Rate limit exceeded.

Next steps

Transcriptions - dedicated speech-to-text endpoint.
Text-to-Speech - generate spoken audio from text.
Chat Completions reference - the full message schema.