Audio Understanding

Send audio to a chat model and reason over it without transcription.

Audio understanding models accept audio as part of a user message and reason over it alongside any text. Use this for voice note summarization, meeting analysis, sentiment detection, and audio Q&A - all without a separate transcription step.

POST /v1/chat/completions

Which models support audio input?

Not all models accept audio input. Check the model's capabilities on GET /v1/models - look for audio_input in the supported input types.

Base64 input

Embed the audio bytes inline as base64 with format specified.

import { readFile } from 'node:fs/promises';

const bytes = await readFile('voicenote.m4a');
const base64 = bytes.toString('base64');

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Summarise this voice note and list any action items.' },
      {
        type: 'input_audio',
        input_audio: { data: base64, format: 'm4a' }
      }
    ]
  }]
});

input_audio part fields

FieldTypeRequiredDescription
datastringyesBase64 encoded audio bytes.
formatstringyesAudio format: wav, mp3, m4a, ogg, flac, aac, webm, aiff, mpeg.

Common use cases

Voice note summary

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Summarise this voice note in 3 bullet points.' },
      { type: 'input_audio', input_audio: { data: base64, format: 'mp3' } }
    ]
  }]
});

Meeting action items

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Extract action items with owner and deadline from this meeting recording.' },
      { type: 'input_audio', input_audio: { data: meetingAudio, format: 'wav' } }
    ]
  }]
});

Sentiment analysis

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'sentiment',
      schema: {
        type: 'object',
        properties: {
          sentiment: { type: 'string', enum: ['positive', 'neutral', 'negative'] },
          confidence: { type: 'number' },
          reasoning: { type: 'string' }
        },
        required: ['sentiment', 'confidence', 'reasoning']
      }
    }
  },
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Analyse the sentiment of the speaker.' },
      { type: 'input_audio', input_audio: { data: base64, format: 'm4a' } }
    ]
  }]
});

Combining with other modalities

Audio can be mixed with images and text in the same message.

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Does the audio description match what is shown in the image?' },
      { type: 'input_audio', input_audio: { data: audioBase64, format: 'mp3' } },
      { type: 'image_url', image_url: { url: 'https://example.com/diagram.png' } }
    ]
  }]
});

Structured output

Pair audio with response_format: 'json_schema' for typed extraction.

const res = await client.chat.completions.create({
  model: 'gemini-2.5-flash-lite',
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'call_summary',
      schema: {
        type: 'object',
        properties: {
          participants: { type: 'array', items: { type: 'string' } },
          topics: { type: 'array', items: { type: 'string' } },
          action_items: {
            type: 'array',
            items: {
              type: 'object',
              properties: {
                task: { type: 'string' },
                owner: { type: 'string' },
                deadline: { type: 'string' }
              },
              required: ['task']
            }
          },
          next_steps: { type: 'string' }
        },
        required: ['topics', 'action_items']
      }
    }
  },
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Extract meeting details from this call.' },
      { type: 'input_audio', input_audio: { data: callAudio, format: 'wav' } }
    ]
  }]
});

Audio understanding vs transcription

Use caseEndpointWhen to use
Raw transcript/v1/audio/transcriptionsNeed verbatim text or subtitles
Translation to English/v1/audio/translationsSource is non-English, need English text only
Reasoning over audio/v1/chat/completions with input_audioNeed summarization, Q&A, sentiment, or tool calls on top of speech

One round trip vs two

If you need both a transcript AND reasoning, using input_audio in chat gives you reasoning directly. For just the text, transcription endpoints are cheaper and faster.

File limits

ConstraintLimit
Max file sizeBound by request body size; keep under ~20 MB
Recommended duration< 5 minutes for best latency
Supported formatswav, mp3, m4a, ogg, flac, aac, webm, aiff, mpeg

For longer recordings, split on silence first and process in chunks.

Cost notes

Audio input is billed as audio input tokens. Token counts appear in the usage field of the response. Longer audio = more tokens.

Errors

StatusReason
400Unsupported format or corrupted audio.
413File exceeded body size limit.
415Model does not support audio input.
429Rate limit exceeded.

Next steps