Speech

Generate spoken audio from text using OpenAI-compatible TTS models.

Synthesise speech from text. The endpoint returns the raw audio bytes as the response body - no JSON envelope.

POST /v1/audio/speech

Request body

Field	Type	Required	Description
`model`	string	yes	TTS model, e.g. `tts-1` (fast) or `tts-1-hd` (higher quality).
`input`	string	yes	The text to speak.
`voice`	string	yes	Voice id. Common: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`.
`instructions`	string	no	Additional instructions for the TTS model (e.g. tone, pacing).
`response_format`	string	no	`'mp3'` (default), `'opus'`, `'aac'`, `'flac'`, `'wav'`, `'pcm'`.
`speed`	number	no	`0.25` to `4.0`. Defaults to `1.0`.

Response is audio, not JSON

Unlike most endpoints, /v1/audio/speech returns the audio file directly in the response body. Set Content-Type on your side based on response_format.

Example

curl https://api.aivene.com/v1/audio/speech \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "voice": "alloy",
    "input": "The quick brown fox jumps over the lazy dog."
  }' \
  --output speech.mp3

With the OpenAI SDK:

import { writeFile } from 'node:fs/promises';

const res = await client.audio.speech.create({
  model: 'tts-1',
  voice: 'alloy',
  input: 'Hello from Aivene.'
});

const buffer = Buffer.from(await res.arrayBuffer());
await writeFile('speech.mp3', buffer);

Picking a model

Model	Latency	Quality	Use for
`tts-1`	low	good	Real-time UX, IVR, voice agents
`tts-1-hd`	medium	best	Pre-rendered narration, podcasts

If you need to play audio before generation finishes, use tts-1 and stream the bytes to your audio sink as they arrive.

Picking a voice

The voice ids above are deliberately generic. Their characteristics:

Voice	Vibe
`alloy`	Neutral, balanced
`echo`	Calm, masculine
`fable`	Warm, expressive
`onyx`	Deep, authoritative
`nova`	Bright, feminine
`shimmer`	Soft, feminine

Run a quick sample with each before committing - voice perception is subjective.

Format and quality

`response_format`	Container	Good for
`mp3`	MPEG-1 Layer 3	Web playback, default
`opus`	Ogg Opus	Low bandwidth, WebRTC
`aac`	AAC	iOS / Apple ecosystem
`flac`	FLAC	Lossless archival
`wav`	WAV	Local processing, editing
`pcm`	Raw 16-bit PCM	Pipe into audio processing

speed affects the wall-clock duration but not the pitch.

Streaming playback

The endpoint streams audio bytes as they are generated. With fetch:

const res = await fetch('https://api.aivene.com/v1/audio/speech', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${process.env.AIVENE_API_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'tts-1',
    voice: 'nova',
    input: 'Streaming this sentence as it arrives.'
  })
});

const reader = res.body?.getReader();
while (reader) {
  const { done, value } = await reader.read();
  if (done) break;
  // Pipe `value` (a chunk of audio bytes) into your audio sink.
}

In the browser, feed the chunks into a MediaSource to play before the whole clip lands.

Cost notes

TTS bills per character of input text, regardless of model or voice. The Console usage page shows the per-request character count.

Content policy

Input text passes through the provider safety filter. Disallowed content (e.g. impersonation of real people) returns 400 with a reason in error.message.

Errors

Status	Reason
`400`	Unknown voice id, unsupported format, or rejected text.
`401`	Missing or invalid API key.
`429`	Rate limit exceeded.
`500`	Internal gateway error.
`502` / `503`	Upstream provider failure.

Error bodies are always JSON, even though success bodies are audio. Check Content-Type on the response: audio/* for success, application/json for errors.

Next steps

Audio - go the other direction with transcription.
Chat Completions reference - generate the text first, then speak it.