Transcriptions
Transcribe speech, translate it, and reason over audio inside chat.
Audio input shows up in three places:
POST /v1/audio/transcriptions- speech to text.POST /v1/audio/translations- speech in any language to English text.POST /v1/chat/completionswith aninput_audiocontent part - have a chat model reason directly over the audio.
For audio output, see Text-to-Speech.
Transcribe speech
POST /v1/audio/transcriptionsmultipart/form-data so the audio bytes can stream.
Form fields
| Field | Type | Required | Description |
|---|---|---|---|
model | string | yes | Transcription model, e.g. whisper-1. |
file | file | conditional | Audio file. Common formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Required if url is not provided. |
url | string | conditional | URL to fetch the audio from. Required if file is not provided. |
language | string | no | ISO-639-1 code (e.g. en). Skip for auto-detect. |
prompt | string | no | Bias the transcription with context (names, jargon). |
response_format | string | no | 'json' (default), 'text', or 'verbose_json'. |
temperature | number | no | 0 to 1. 0 for most accurate, higher for variety. |
timestamp_granularities | array | no | Array of 'word' and/or 'segment'. Only applies with verbose_json. |
Example
curl https://api.aivene.com/v1/audio/transcriptions \
-H "Authorization: Bearer $AIVENE_API_KEY" \
-F model="whisper-1" \
-F file="@meeting.m4a"Response:
{ "text": "Today we shipped the new dashboard..." }For timestamps, use verbose_json:
curl https://api.aivene.com/v1/audio/transcriptions \
-H "Authorization: Bearer $AIVENE_API_KEY" \
-F model="whisper-1" \
-F file="@meeting.m4a" \
-F response_format="verbose_json"{
"task": "transcribe",
"language": "english",
"duration": 92.4,
"text": "Today we shipped...",
"segments": [
{ "id": 0, "start": 0.0, "end": 4.2, "text": "Today we shipped" }
]
}You can also pass a URL instead of uploading the file:
curl https://api.aivene.com/v1/audio/transcriptions \
-H "Authorization: Bearer $AIVENE_API_KEY" \
-F model="whisper-1" \
-F url="https://example.com/meeting.m4a"Translate speech to English
POST /v1/audio/translationsSimilar to transcriptions, but the output is always English regardless of input language. Useful when you only need cross-lingual summarisation.
Form fields
| Field | Type | Required | Description |
|---|---|---|---|
model | string | yes | Translation model, e.g. whisper-1. |
file | file | yes | Audio file. Formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. |
prompt | string | no | Bias the translation with context (names, jargon). |
response_format | string | no | 'json' (default), 'text', 'srt', 'verbose_json', or 'vtt'. |
temperature | number | no | 0 to 1. 0 for most accurate, higher for variety. |
curl https://api.aivene.com/v1/audio/translations \
-H "Authorization: Bearer $AIVENE_API_KEY" \
-F model="whisper-1" \
-F file="@spanish-clip.mp3"Use srt or vtt if you need subtitle files - the response body contains
the raw subtitle text.
Audio inside chat
Models that natively accept audio (e.g. gpt-4o, some gemini variants)
take an input_audio content part on a user message. The model reasons
over the audio without a separate transcription step.
import { readFile } from 'node:fs/promises';
const bytes = await readFile('voicenote.m4a');
const base64 = bytes.toString('base64');
const res = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Summarise this voice note and list any action items.' },
{
type: 'input_audio',
input_audio: { data: base64, format: 'm4a' }
}
]
}]
});When to use which
Need raw text or subtitles? Use /v1/audio/transcriptions. Need
reasoning, summarisation, or tool calls on top of speech? Use chat with
an input_audio part - one round trip instead of two.
File limits
| Endpoint | Max file size | Max duration |
|---|---|---|
/v1/audio/transcriptions | 25 MB | provider-dependent |
/v1/audio/translations | 25 MB | provider-dependent |
/v1/chat/completions (input_audio) | bound by body size; keep clips short | recommend < 5 min |
For longer recordings, split on silence first and stitch the transcripts client-side.
Cost notes
Transcription and translation bill per second of input audio. Chat with
input_audio bills as audio input tokens, reported in usage.
Errors
| Status | Reason |
|---|---|
400 | Unsupported format, corrupted file, or language not supported. |
413 | File exceeded 25 MB. |
415 | Chat model does not accept audio input. |
429 | Rate limit. |
Next steps
- Text-to-Speech - generate spoken audio from text.
- Chat Completions reference - the full message schema.