Image Understanding
Send images to a chat model and ask questions about them.
Vision models accept images as part of a user message and reason over them alongside any text. Use this for OCR, chart reading, UI inspection, moderation, alt-text generation, and visual Q&A.
POST /v1/chat/completionsWhich models support images?
Look for vision capability on GET /v1/models. Common picks:
gpt-4o, gpt-4o-mini, claude-sonnet, gemini-pro. Embedding and
TTS models do not.
URL input
Cheapest path: host the image somewhere reachable and pass the URL.
curl https://api.aivene.com/v1/chat/completions \
-H "Authorization: Bearer $AIVENE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{
"role": "user",
"content": [
{ "type": "text", "text": "Describe this image in one sentence." },
{ "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } }
]
}]
}'The URL must be publicly fetchable - signed S3 URLs work, localhost does not.
Base64 input
For private images, embed the bytes inline as a data: URL.
import { readFile } from 'node:fs/promises';
const bytes = await readFile('chart.png');
const dataUrl = `data:image/png;base64,${bytes.toString('base64')}`;
const res = await client.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'What is the y-axis label?' },
{ type: 'image_url', image_url: { url: dataUrl } }
]
}]
});Inline images count toward request body size limits - keep under ~10 MB per image for safety.
Multiple images
Pass several image_url parts in the same message. The model treats them
as related context.
{
"role": "user",
"content": [
{ "type": "text", "text": "Which of these screenshots has the bug?" },
{ "type": "image_url", "image_url": { "url": "https://.../v1.png" } },
{ "type": "image_url", "image_url": { "url": "https://.../v2.png" } },
{ "type": "image_url", "image_url": { "url": "https://.../v3.png" } }
]
}Detail level
Control how much visual detail the model processes with the detail parameter. This affects both quality and token cost.
{
"type": "image_url",
"image_url": {
"url": "https://example.com/receipt.png",
"detail": "high"
}
}Detail values and provider mapping
Input detail | OpenAI | Gemini 2.5 | Gemini 3+ |
|---|---|---|---|
low | low | low (~64 tokens) | low (~280 tokens) |
medium | high | medium (~256 tokens) | medium (~560 tokens) |
high | high | high (~2,000-2,300 tokens) | high (~1,120 tokens) |
auto | auto | low (~64 tokens) | low (~280 tokens) |
| (unspecified) | auto | low (~64 tokens) | low (~280 tokens) |
How Gemini applies detail
Gemini applies a single resolution to every image in a request (per
the mediaResolution API). When you mix detail values across image
parts, we pick the highest one and apply it to the whole request, so
no image gets less quality than what you asked for. OpenAI keeps detail
per-image as-is.
Cost optimization
For Gemini models, we default to low resolution when you don't specify detail.
This can reduce image token costs by 90%+ compared to high resolution.
Only use high when you need to read fine text or analyze detailed visuals.
Portable code
Providers that don't support a specific detail value will map it to the closest equivalent. Your code stays portable across providers.
You can also pass mime_type if the URL does not have an extension:
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image?id=123",
"mime_type": "image/png"
}
}Structured output
Combine vision with response_format: { type: 'json_schema', ... } to get
typed extraction directly.
const res = await client.chat.completions.create({
model: 'gpt-4o-mini',
response_format: {
type: 'json_schema',
json_schema: {
name: 'receipt',
schema: {
type: 'object',
properties: {
merchant: { type: 'string' },
total: { type: 'number' },
currency: { type: 'string' }
},
required: ['merchant', 'total', 'currency']
}
}
},
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Extract receipt fields as JSON.' },
{ type: 'image_url', image_url: { url: receiptUrl } }
]
}]
});
const data = JSON.parse(res.choices[0].message.content ?? '{}');Cost notes
Images are converted to tokens and counted as part of your prompt tokens in
usage. For high-throughput pipelines, downsample on the client before
sending - models do not gain accuracy from images larger than they natively
process.
Sensitive content
Vision models still pass through the provider safety filter. Faces,
IDs, and explicit content may be refused with a 400 and a reason in
error.message.
Errors
Same envelope as chat completions. Common ones for vision:
| Status | Reason |
|---|---|
400 | Image URL unreachable, unsupported format, or too large. |
413 | Inline base64 exceeded the body size limit. |
415 | Model does not support image input. |
Next steps
- Image Generation - create images instead of describing them.
- Documents - extract structured data from PDFs, Office files, and text documents.
- Chat Completions reference - full request schema.