Multimodal
Send images, audio, and documents to Aivene models, and generate them back.
Multimodal means a single request can carry more than text - images, audio, documents - and a single response can be more than text too. Aivene uses the same chat completions endpoint for understanding tasks and dedicated endpoints for generation tasks.
At a glance
| Modality | Direction | Endpoint | Page |
|---|---|---|---|
| Images | input | /v1/chat/completions | Image Understanding |
| Images | output (native) | /v1/chat/completions | Image Generation |
| Images | output (dedicated) | /v1/images/generations | Image Generation |
| Video | input | /v1/chat/completions | Video Understanding |
| Documents | input | /v1/chat/completions | Documents |
| Audio | input (understanding) | /v1/chat/completions | Audio Understanding |
| Audio | input (transcription) | /v1/audio/transcriptions, /v1/audio/translations | Transcriptions |
| Audio | output | /v1/audio/speech | Text-to-Speech |
Same auth, same shape
Every endpoint takes the standard Authorization: Bearer header and
follows the OpenAI request/response envelope. If you can call chat
completions, you can call all of these.
How content parts work
For understanding tasks, the user message accepts an array of "parts"
instead of a plain string. Each part has a type that tells the model
what it is.
{
role: 'user',
content: [
{ type: 'text', text: 'What is in this image?' },
{ type: 'image_url', image_url: { url: 'https://example.com/cat.png' } }
]
}Part types you will see across the docs:
text- a stringimage_url- URL ordata:base64 (Image Understanding)input_video- base64 video with format (Video Understanding)input_audio- base64 audio with format (Audio Understanding)file- documents (PDF, Word, Excel, PowerPoint, RTF, text files) byfile_data,file_url, or a managedfile_idfrom the Files API
Mix and match in a single message - models that support multiple modalities will reason over all of them together.
Picking the right endpoint
- Understanding (image, audio, document inputs) →
/v1/chat/completions. You get back text (or text + tool calls), not media. - Image generation →
/v1/images/generationsor/v1/images/edits. - Speech synthesis →
/v1/audio/speechreturns an audio file. - Speech to text →
/v1/audio/transcriptionsreturns JSON or plain text. - Speech translation →
/v1/audio/translationsreturns English text.
What is not here
We deliberately scoped this section. Video generation and real-time streaming voice are not covered yet. The Console changelog is the source of truth for new modalities as they ship.
Next steps
- Image Understanding - send images to a chat model.
- Video Understanding - send videos to a chat model.
- Image Generation - create and edit images.
- Documents - send PDFs, Office files, and text documents inline, by URL, or by managed
file_id. - Audio Understanding - reason over audio directly in chat.
- Transcriptions - convert speech to text.
- Text-to-Speech - generate spoken audio from text.