Speech
Generate spoken audio from text using OpenAI-compatible TTS models.
Synthesise speech from text. The endpoint returns the raw audio bytes as the response body - no JSON envelope.
POST /v1/audio/speechRequest body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | yes | TTS model, e.g. tts-1 (fast) or tts-1-hd (higher quality). |
input | string | yes | The text to speak. |
voice | string | yes | Voice id. Common: alloy, echo, fable, onyx, nova, shimmer. |
instructions | string | no | Additional instructions for the TTS model (e.g. tone, pacing). |
response_format | string | no | 'mp3' (default), 'opus', 'aac', 'flac', 'wav', 'pcm'. |
speed | number | no | 0.25 to 4.0. Defaults to 1.0. |
Response is audio, not JSON
Unlike most endpoints, /v1/audio/speech returns the audio file directly
in the response body. Set Content-Type on your side based on
response_format.
Example
curl https://api.aivene.com/v1/audio/speech \
-H "Authorization: Bearer $AIVENE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"voice": "alloy",
"input": "The quick brown fox jumps over the lazy dog."
}' \
--output speech.mp3With the OpenAI SDK:
import { writeFile } from 'node:fs/promises';
const res = await client.audio.speech.create({
model: 'tts-1',
voice: 'alloy',
input: 'Hello from Aivene.'
});
const buffer = Buffer.from(await res.arrayBuffer());
await writeFile('speech.mp3', buffer);Picking a model
| Model | Latency | Quality | Use for |
|---|---|---|---|
tts-1 | low | good | Real-time UX, IVR, voice agents |
tts-1-hd | medium | best | Pre-rendered narration, podcasts |
If you need to play audio before generation finishes, use tts-1 and
stream the bytes to your audio sink as they arrive.
Picking a voice
The voice ids above are deliberately generic. Their characteristics:
| Voice | Vibe |
|---|---|
alloy | Neutral, balanced |
echo | Calm, masculine |
fable | Warm, expressive |
onyx | Deep, authoritative |
nova | Bright, feminine |
shimmer | Soft, feminine |
Run a quick sample with each before committing - voice perception is subjective.
Format and quality
response_format | Container | Good for |
|---|---|---|
mp3 | MPEG-1 Layer 3 | Web playback, default |
opus | Ogg Opus | Low bandwidth, WebRTC |
aac | AAC | iOS / Apple ecosystem |
flac | FLAC | Lossless archival |
wav | WAV | Local processing, editing |
pcm | Raw 16-bit PCM | Pipe into audio processing |
speed affects the wall-clock duration but not the pitch.
Streaming playback
The endpoint streams audio bytes as they are generated. With fetch:
const res = await fetch('https://api.aivene.com/v1/audio/speech', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.AIVENE_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'tts-1',
voice: 'nova',
input: 'Streaming this sentence as it arrives.'
})
});
const reader = res.body?.getReader();
while (reader) {
const { done, value } = await reader.read();
if (done) break;
// Pipe `value` (a chunk of audio bytes) into your audio sink.
}In the browser, feed the chunks into a MediaSource to play before the
whole clip lands.
Cost notes
TTS bills per character of input text, regardless of model or voice. The Console usage page shows the per-request character count.
Content policy
Input text passes through the provider safety filter. Disallowed
content (e.g. impersonation of real people) returns 400 with a reason
in error.message.
Errors
| Status | Reason |
|---|---|
400 | Unknown voice id, unsupported format, or rejected text. |
401 | Missing or invalid API key. |
429 | Rate limit exceeded. |
500 | Internal gateway error. |
502 / 503 | Upstream provider failure. |
Error bodies are always JSON, even though success bodies are audio. Check
Content-Type on the response: audio/* for success, application/json
for errors.
Next steps
- Audio - go the other direction with transcription.
- Chat Completions reference - generate the text first, then speak it.