Transcriptions

Transcribe speech, translate it, and reason over audio inside chat.

Audio input shows up in three places:

POST /v1/audio/transcriptions - speech to text.
POST /v1/audio/translations - speech in any language to English text.
POST /v1/chat/completions with an input_audio content part - have a chat model reason directly over the audio.

For audio output, see Text-to-Speech.

Transcribe speech

POST /v1/audio/transcriptions

multipart/form-data so the audio bytes can stream.

Form fields

Field	Type	Required	Description
`model`	string	yes	Transcription model, e.g. `whisper-1`.
`file`	file	conditional	Audio file. Common formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Required if `url` is not provided.
`url`	string	conditional	URL to fetch the audio from. Required if `file` is not provided.
`language`	string	no	ISO-639-1 code (e.g. `en`). Skip for auto-detect.
`prompt`	string	no	Bias the transcription with context (names, jargon).
`response_format`	string	no	`'json'` (default), `'text'`, or `'verbose_json'`.
`temperature`	number	no	`0` to `1`. `0` for most accurate, higher for variety.
`timestamp_granularities`	array	no	Array of `'word'` and/or `'segment'`. Only applies with `verbose_json`.

Example

curl https://api.aivene.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -F model="whisper-1" \
  -F file="@meeting.m4a"

Response:

{ "text": "Today we shipped the new dashboard..." }

For timestamps, use verbose_json:

curl https://api.aivene.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -F model="whisper-1" \
  -F file="@meeting.m4a" \
  -F response_format="verbose_json"

{
  "task": "transcribe",
  "language": "english",
  "duration": 92.4,
  "text": "Today we shipped...",
  "segments": [
    { "id": 0, "start": 0.0, "end": 4.2, "text": "Today we shipped" }
  ]
}

You can also pass a URL instead of uploading the file:

curl https://api.aivene.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -F model="whisper-1" \
  -F url="https://example.com/meeting.m4a"

Translate speech to English

POST /v1/audio/translations

Similar to transcriptions, but the output is always English regardless of input language. Useful when you only need cross-lingual summarisation.

Form fields

Field	Type	Required	Description
`model`	string	yes	Translation model, e.g. `whisper-1`.
`file`	file	yes	Audio file. Formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.
`prompt`	string	no	Bias the translation with context (names, jargon).
`response_format`	string	no	`'json'` (default), `'text'`, `'srt'`, `'verbose_json'`, or `'vtt'`.
`temperature`	number	no	`0` to `1`. `0` for most accurate, higher for variety.

curl https://api.aivene.com/v1/audio/translations \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -F model="whisper-1" \
  -F file="@spanish-clip.mp3"

Use srt or vtt if you need subtitle files - the response body contains the raw subtitle text.

Audio inside chat

Models that natively accept audio (e.g. gpt-4o, some gemini variants) take an input_audio content part on a user message. The model reasons over the audio without a separate transcription step.

import { readFile } from 'node:fs/promises';

const bytes = await readFile('voicenote.m4a');
const base64 = bytes.toString('base64');

const res = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Summarise this voice note and list any action items.' },
      {
        type: 'input_audio',
        input_audio: { data: base64, format: 'm4a' }
      }
    ]
  }]
});

When to use which

Need raw text or subtitles? Use /v1/audio/transcriptions. Need reasoning, summarisation, or tool calls on top of speech? Use chat with an input_audio part - one round trip instead of two.

File limits

Endpoint	Max file size	Max duration
`/v1/audio/transcriptions`	25 MB	provider-dependent
`/v1/audio/translations`	25 MB	provider-dependent
`/v1/chat/completions` (input_audio)	bound by body size; keep clips short	recommend < 5 min

For longer recordings, split on silence first and stitch the transcripts client-side.

Cost notes

Transcription and translation bill per second of input audio. Chat with input_audio bills as audio input tokens, reported in usage.

Errors

Status	Reason
`400`	Unsupported format, corrupted file, or language not supported.
`413`	File exceeded 25 MB.
`415`	Chat model does not accept audio input.
`429`	Rate limit.

Next steps

Text-to-Speech - generate spoken audio from text.
Chat Completions reference - the full message schema.