Speech

Generate spoken audio from text using OpenAI-compatible TTS models.

Synthesise speech from text. The endpoint returns the raw audio bytes as the response body - no JSON envelope.

POST /v1/audio/speech

Request body

FieldTypeRequiredDescription
modelstringyesTTS model, e.g. tts-1 (fast) or tts-1-hd (higher quality).
inputstringyesThe text to speak.
voicestringyesVoice id. Common: alloy, echo, fable, onyx, nova, shimmer.
instructionsstringnoAdditional instructions for the TTS model (e.g. tone, pacing).
response_formatstringno'mp3' (default), 'opus', 'aac', 'flac', 'wav', 'pcm'.
speednumberno0.25 to 4.0. Defaults to 1.0.

Response is audio, not JSON

Unlike most endpoints, /v1/audio/speech returns the audio file directly in the response body. Set Content-Type on your side based on response_format.

Example

curl https://api.aivene.com/v1/audio/speech \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "voice": "alloy",
    "input": "The quick brown fox jumps over the lazy dog."
  }' \
  --output speech.mp3

With the OpenAI SDK:

import { writeFile } from 'node:fs/promises';

const res = await client.audio.speech.create({
  model: 'tts-1',
  voice: 'alloy',
  input: 'Hello from Aivene.'
});

const buffer = Buffer.from(await res.arrayBuffer());
await writeFile('speech.mp3', buffer);

Picking a model

ModelLatencyQualityUse for
tts-1lowgoodReal-time UX, IVR, voice agents
tts-1-hdmediumbestPre-rendered narration, podcasts

If you need to play audio before generation finishes, use tts-1 and stream the bytes to your audio sink as they arrive.

Picking a voice

The voice ids above are deliberately generic. Their characteristics:

VoiceVibe
alloyNeutral, balanced
echoCalm, masculine
fableWarm, expressive
onyxDeep, authoritative
novaBright, feminine
shimmerSoft, feminine

Run a quick sample with each before committing - voice perception is subjective.

Format and quality

response_formatContainerGood for
mp3MPEG-1 Layer 3Web playback, default
opusOgg OpusLow bandwidth, WebRTC
aacAACiOS / Apple ecosystem
flacFLACLossless archival
wavWAVLocal processing, editing
pcmRaw 16-bit PCMPipe into audio processing

speed affects the wall-clock duration but not the pitch.

Streaming playback

The endpoint streams audio bytes as they are generated. With fetch:

const res = await fetch('https://api.aivene.com/v1/audio/speech', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${process.env.AIVENE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'tts-1',
    voice: 'nova',
    input: 'Streaming this sentence as it arrives.'
  })
});

const reader = res.body?.getReader();
while (reader) {
  const { done, value } = await reader.read();
  if (done) break;
  // Pipe `value` (a chunk of audio bytes) into your audio sink.
}

In the browser, feed the chunks into a MediaSource to play before the whole clip lands.

Cost notes

TTS bills per character of input text, regardless of model or voice. The Console usage page shows the per-request character count.

Content policy

Input text passes through the provider safety filter. Disallowed content (e.g. impersonation of real people) returns 400 with a reason in error.message.

Errors

StatusReason
400Unknown voice id, unsupported format, or rejected text.
401Missing or invalid API key.
429Rate limit exceeded.
500Internal gateway error.
502 / 503Upstream provider failure.

Error bodies are always JSON, even though success bodies are audio. Check Content-Type on the response: audio/* for success, application/json for errors.

Next steps