Multimodal

Send images, audio, and documents to Aivene models, and generate them back.

Multimodal means a single request can carry more than text - images, audio, documents - and a single response can be more than text too. Aivene uses the same chat completions endpoint for understanding tasks and dedicated endpoints for generation tasks.

At a glance

ModalityDirectionEndpointPage
Imagesinput/v1/chat/completionsImage Understanding
Imagesoutput (native)/v1/chat/completionsImage Generation
Imagesoutput (dedicated)/v1/images/generationsImage Generation
Videoinput/v1/chat/completionsVideo Understanding
Documentsinput/v1/chat/completionsDocuments
Audioinput (understanding)/v1/chat/completionsAudio Understanding
Audioinput (transcription)/v1/audio/transcriptions, /v1/audio/translationsTranscriptions
Audiooutput/v1/audio/speechText-to-Speech

Same auth, same shape

Every endpoint takes the standard Authorization: Bearer header and follows the OpenAI request/response envelope. If you can call chat completions, you can call all of these.

How content parts work

For understanding tasks, the user message accepts an array of "parts" instead of a plain string. Each part has a type that tells the model what it is.

{
  role: 'user',
  content: [
    { type: 'text', text: 'What is in this image?' },
    { type: 'image_url', image_url: { url: 'https://example.com/cat.png' } }
  ]
}

Part types you will see across the docs:

Mix and match in a single message - models that support multiple modalities will reason over all of them together.

Picking the right endpoint

  • Understanding (image, audio, document inputs) → /v1/chat/completions. You get back text (or text + tool calls), not media.
  • Image generation/v1/images/generations or /v1/images/edits.
  • Speech synthesis/v1/audio/speech returns an audio file.
  • Speech to text/v1/audio/transcriptions returns JSON or plain text.
  • Speech translation/v1/audio/translations returns English text.

What is not here

We deliberately scoped this section. Video generation and real-time streaming voice are not covered yet. The Console changelog is the source of truth for new modalities as they ship.

Next steps