Multimodal

Send images, audio, and documents to Aivene models, and generate them back.

Multimodal means a single request can carry more than text - images, audio, documents - and a single response can be more than text too. Aivene uses the same chat completions endpoint for understanding tasks and dedicated endpoints for generation tasks.

At a glance

Modality	Direction	Endpoint	Page
Images	input	`/v1/chat/completions`	Image Understanding
Images	output (native)	`/v1/chat/completions`	Image Generation
Images	output (dedicated)	`/v1/images/generations`	Image Generation
Video	input	`/v1/chat/completions`	Video Understanding
Documents	input	`/v1/chat/completions`	Documents
Audio	input (understanding)	`/v1/chat/completions`	Audio Understanding
Audio	input (transcription)	`/v1/audio/transcriptions`, `/v1/audio/translations`	Transcriptions
Audio	output	`/v1/audio/speech`	Text-to-Speech

Same auth, same shape

Every endpoint takes the standard Authorization: Bearer header and follows the OpenAI request/response envelope. If you can call chat completions, you can call all of these.

How content parts work

For understanding tasks, the user message accepts an array of "parts" instead of a plain string. Each part has a type that tells the model what it is.

{
  role: 'user',
  content: [
    { type: 'text', text: 'What is in this image?' },
    { type: 'image_url', image_url: { url: 'https://example.com/cat.png' } }
  ]
}

Part types you will see across the docs:

text - a string
image_url - URL or data: base64 (Image Understanding)
input_video - base64 video with format (Video Understanding)
input_audio - base64 audio with format (Audio Understanding)
file - documents (PDF, Word, Excel, PowerPoint, RTF, text files) by file_data, file_url, or a managed file_id from the Files API

Mix and match in a single message - models that support multiple modalities will reason over all of them together.

Picking the right endpoint

Understanding (image, audio, document inputs) → /v1/chat/completions. You get back text (or text + tool calls), not media.
Image generation → /v1/images/generations or /v1/images/edits.
Speech synthesis → /v1/audio/speech returns an audio file.
Speech to text → /v1/audio/transcriptions returns JSON or plain text.
Speech translation → /v1/audio/translations returns English text.

What is not here

We deliberately scoped this section. Video generation and real-time streaming voice are not covered yet. The Console changelog is the source of truth for new modalities as they ship.

Next steps

Image Understanding - send images to a chat model.
Video Understanding - send videos to a chat model.
Image Generation - create and edit images.
Documents - send PDFs, Office files, and text documents inline, by URL, or by managed file_id.
Audio Understanding - reason over audio directly in chat.
Transcriptions - convert speech to text.
Text-to-Speech - generate spoken audio from text.