Transcriptions

Transcribe speech, translate it, and reason over audio inside chat.

Audio input shows up in three places:

  • POST /v1/audio/transcriptions - speech to text.
  • POST /v1/audio/translations - speech in any language to English text.
  • POST /v1/chat/completions with an input_audio content part - have a chat model reason directly over the audio.

For audio output, see Text-to-Speech.

Transcribe speech

POST /v1/audio/transcriptions

multipart/form-data so the audio bytes can stream.

Form fields

FieldTypeRequiredDescription
modelstringyesTranscription model, e.g. whisper-1.
filefileconditionalAudio file. Common formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Required if url is not provided.
urlstringconditionalURL to fetch the audio from. Required if file is not provided.
languagestringnoISO-639-1 code (e.g. en). Skip for auto-detect.
promptstringnoBias the transcription with context (names, jargon).
response_formatstringno'json' (default), 'text', or 'verbose_json'.
temperaturenumberno0 to 1. 0 for most accurate, higher for variety.
timestamp_granularitiesarraynoArray of 'word' and/or 'segment'. Only applies with verbose_json.

Example

curl https://api.aivene.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -F model="whisper-1" \
  -F file="@meeting.m4a"

Response:

{ "text": "Today we shipped the new dashboard..." }

For timestamps, use verbose_json:

curl https://api.aivene.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -F model="whisper-1" \
  -F file="@meeting.m4a" \
  -F response_format="verbose_json"
{
  "task": "transcribe",
  "language": "english",
  "duration": 92.4,
  "text": "Today we shipped...",
  "segments": [
    { "id": 0, "start": 0.0, "end": 4.2, "text": "Today we shipped" }
  ]
}

You can also pass a URL instead of uploading the file:

curl https://api.aivene.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -F model="whisper-1" \
  -F url="https://example.com/meeting.m4a"

Translate speech to English

POST /v1/audio/translations

Similar to transcriptions, but the output is always English regardless of input language. Useful when you only need cross-lingual summarisation.

Form fields

FieldTypeRequiredDescription
modelstringyesTranslation model, e.g. whisper-1.
filefileyesAudio file. Formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.
promptstringnoBias the translation with context (names, jargon).
response_formatstringno'json' (default), 'text', 'srt', 'verbose_json', or 'vtt'.
temperaturenumberno0 to 1. 0 for most accurate, higher for variety.
curl https://api.aivene.com/v1/audio/translations \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -F model="whisper-1" \
  -F file="@spanish-clip.mp3"

Use srt or vtt if you need subtitle files - the response body contains the raw subtitle text.

Audio inside chat

Models that natively accept audio (e.g. gpt-4o, some gemini variants) take an input_audio content part on a user message. The model reasons over the audio without a separate transcription step.

import { readFile } from 'node:fs/promises';

const bytes = await readFile('voicenote.m4a');
const base64 = bytes.toString('base64');

const res = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Summarise this voice note and list any action items.' },
      {
        type: 'input_audio',
        input_audio: { data: base64, format: 'm4a' }
      }
    ]
  }]
});

When to use which

Need raw text or subtitles? Use /v1/audio/transcriptions. Need reasoning, summarisation, or tool calls on top of speech? Use chat with an input_audio part - one round trip instead of two.

File limits

EndpointMax file sizeMax duration
/v1/audio/transcriptions25 MBprovider-dependent
/v1/audio/translations25 MBprovider-dependent
/v1/chat/completions (input_audio)bound by body size; keep clips shortrecommend < 5 min

For longer recordings, split on silence first and stitch the transcripts client-side.

Cost notes

Transcription and translation bill per second of input audio. Chat with input_audio bills as audio input tokens, reported in usage.

Errors

StatusReason
400Unsupported format, corrupted file, or language not supported.
413File exceeded 25 MB.
415Chat model does not accept audio input.
429Rate limit.

Next steps