Image Understanding

Send images to a chat model and ask questions about them.

Vision models accept images as part of a user message and reason over them alongside any text. Use this for OCR, chart reading, UI inspection, moderation, alt-text generation, and visual Q&A.

POST /v1/chat/completions

Which models support images?

Look for vision capability on GET /v1/models. Common picks: gpt-4o, gpt-4o-mini, claude-sonnet, gemini-pro. Embedding and TTS models do not.

URL input

Cheapest path: host the image somewhere reachable and pass the URL.

curl https://api.aivene.com/v1/chat/completions \
  -H "Authorization: Bearer $AIVENE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image in one sentence." },
        { "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } }
      ]
    }]
  }'

The URL must be publicly fetchable - signed S3 URLs work, localhost does not.

Base64 input

For private images, embed the bytes inline as a data: URL.

import { readFile } from 'node:fs/promises';

const bytes = await readFile('chart.png');
const dataUrl = `data:image/png;base64,${bytes.toString('base64')}`;

const res = await client.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'What is the y-axis label?' },
      { type: 'image_url', image_url: { url: dataUrl } }
    ]
  }]
});

Inline images count toward request body size limits - keep under ~10 MB per image for safety.

Multiple images

Pass several image_url parts in the same message. The model treats them as related context.

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Which of these screenshots has the bug?" },
    { "type": "image_url", "image_url": { "url": "https://.../v1.png" } },
    { "type": "image_url", "image_url": { "url": "https://.../v2.png" } },
    { "type": "image_url", "image_url": { "url": "https://.../v3.png" } }
  ]
}

Detail level

Control how much visual detail the model processes with the detail parameter. This affects both quality and token cost.

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/receipt.png",
    "detail": "high"
  }
}

Detail values and provider mapping

Input detailOpenAIGemini 2.5Gemini 3+
lowlowlow (~64 tokens)low (~280 tokens)
mediumhighmedium (~256 tokens)medium (~560 tokens)
highhighhigh (~2,000-2,300 tokens)high (~1,120 tokens)
autoautolow (~64 tokens)low (~280 tokens)
(unspecified)autolow (~64 tokens)low (~280 tokens)

How Gemini applies detail

Gemini applies a single resolution to every image in a request (per the mediaResolution API). When you mix detail values across image parts, we pick the highest one and apply it to the whole request, so no image gets less quality than what you asked for. OpenAI keeps detail per-image as-is.

Cost optimization

For Gemini models, we default to low resolution when you don't specify detail. This can reduce image token costs by 90%+ compared to high resolution. Only use high when you need to read fine text or analyze detailed visuals.

Portable code

Providers that don't support a specific detail value will map it to the closest equivalent. Your code stays portable across providers.

You can also pass mime_type if the URL does not have an extension:

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/image?id=123",
    "mime_type": "image/png"
  }
}

Structured output

Combine vision with response_format: { type: 'json_schema', ... } to get typed extraction directly.

const res = await client.chat.completions.create({
  model: 'gpt-4o-mini',
  response_format: {
    type: 'json_schema',
    json_schema: {
      name: 'receipt',
      schema: {
        type: 'object',
        properties: {
          merchant: { type: 'string' },
          total: { type: 'number' },
          currency: { type: 'string' }
        },
        required: ['merchant', 'total', 'currency']
      }
    }
  },
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Extract receipt fields as JSON.' },
      { type: 'image_url', image_url: { url: receiptUrl } }
    ]
  }]
});

const data = JSON.parse(res.choices[0].message.content ?? '{}');

Cost notes

Images are converted to tokens and counted as part of your prompt tokens in usage. For high-throughput pipelines, downsample on the client before sending - models do not gain accuracy from images larger than they natively process.

Sensitive content

Vision models still pass through the provider safety filter. Faces, IDs, and explicit content may be refused with a 400 and a reason in error.message.

Errors

Same envelope as chat completions. Common ones for vision:

StatusReason
400Image URL unreachable, unsupported format, or too large.
413Inline base64 exceeded the body size limit.
415Model does not support image input.

Next steps