Prompt Caching

Cache expensive prompts and reduce latency and costs.

Prompt caching lets providers reuse previously processed content instead of recomputing it on every request. This cuts both latency and cost, especially when your prompts contain large, repeated sections like system instructions, RAG context, or reference documents.

Why it matters

Faster responses - Cached tokens skip reprocessing, reducing time-to-first-token on subsequent requests.
Lower cost - Cache reads are billed at a fraction of the normal input price (as low as 0.1x depending on the provider).
Efficient for long context - Ideal for apps that reuse a large system prompt or document across many requests.

How caching works

Implicit caching

The provider handles everything automatically. Send a request with a long prompt and the provider caches the common prefix. Future requests sharing the same prefix will hit the cache without any changes to your code.

Explicit caching

You place cache_control markers directly in your request to tell the provider exactly what to cache. This gives precise control but requires you to structure your messages accordingly.

Support type	Description
Implicit only	Caching is automatic, no manual control
Explicit only	Must add `cache_control` to enable caching
Hybrid	Both modes available

Check model support

Visit Models and look at the pricing column. If a model shows a cache read/write price, it supports caching.

Reading cache usage

Every response includes a usage object. When caching is active, look inside prompt_tokens_details for cache metrics:

{
  "usage": {
    "prompt_tokens": 1500,
    "completion_tokens": 200,
    "total_tokens": 1700,
    "prompt_tokens_details": {
      "cached_tokens": 1200,
      "cache_write_tokens": 0
    }
  }
}

Field	What it means
`cached_tokens`	Tokens served from cache - billed at the discounted cache read rate
`cache_write_tokens`	Tokens written into cache for the first time - may have a write cost

If cached_tokens is 0 on every request, the cache isn't being hit. Check that your prompt prefix stays consistent across requests.

Best practices

Static content first - System prompts, persona descriptions, and reference documents should come before any dynamic content. The cache is prefix-based, so anything that changes between requests breaks the match.
Dynamic content last - User messages and per-request variables go at the end of the conversation, after all the static context.
Reuse the same client instance - Some providers tie caching to the connection or session. Reusing the same client helps maintain cache locality.
Check minimum token thresholds - Each provider has a minimum number of tokens before a prompt qualifies for caching. Sending short prompts will never hit cache.

Most models cache automatically

If a model is not listed in the sections below, it uses implicit caching - no cache_control or extra configuration needed. Just send your request normally and the provider handles caching automatically.

Anthropic Claude

Claude uses explicit caching - you opt in by adding cache_control to your request. Cache reads are billed at 0.1x the normal input price, making it very cheap to reuse long context across turns.

Type	Price
Cache write (5-minute TTL)	1.25x input price
Cache write (1-hour TTL)	2x input price
Cache read	0.1x input price

Claude supports two caching modes:

Automatic - Add cache_control at the top level of the request body. The API automatically places the cache breakpoint at the last cacheable block and shifts it forward as the conversation grows. Best for multi-turn chat.
Explicit breakpoints - Add cache_control directly inside individual content blocks. Up to four breakpoints per request. Best for caching specific large content like RAG chunks, documents, or character cards.

Responses API

The Responses API only supports automatic caching via top-level cache_control. Explicit per-block cache breakpoints are not exposed through the Responses API - use the Chat Completions or Anthropic Messages API if you need fine-grained breakpoints.

Minimum token requirements

Each model has a minimum cacheable prompt length. Prompts shorter than the minimum will not be cached.

Min Tokens	Models
4,096	Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, Claude Opus 4.5, Claude Haiku 4.5
1,024	Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4

Cache TTL options

5 minutes (default): "cache_control": { "type": "ephemeral" }
1 hour: "cache_control": { "type": "ephemeral", "ttl": "1h" }

The 1-hour TTL has a higher write cost (2x vs 1.25x) but keeps the cache alive longer, useful for long sessions where users take breaks between turns. For short burst sessions, the default 5-minute TTL is usually cheaper overall.

Examples

Automatic caching (recommended for multi-turn conversations)

Add cache_control at the top level of the request. The API will figure out where to place the breakpoint automatically and keep advancing it as the conversation grows - no changes needed per turn.

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://api.aivene.com/v1',
  apiKey: process.env.AIVENE_API_KEY
});

const response = await client.chat.completions.create({
  model: 'claude-sonnet-4-5',
  // @ts-expect-error - top-level cache_control is a Claude extension
  cache_control: { type: 'ephemeral' },
  messages: [
    {
      role: 'system',
      content: `You are a customer support agent for Acme Corp.
        
        ## Product Catalog
        ... (very long product list) ...
        
        ## Return Policy
        ... (detailed policy text) ...`
    },
    {
      role: 'user',
      content: 'Can I return a product I bought 45 days ago?'
    }
  ]
});

With 1-hour TTL for longer sessions:

const response = await client.chat.completions.create({
  model: 'claude-sonnet-4-5',
  // @ts-expect-error - top-level cache_control is a Claude extension
  cache_control: { type: 'ephemeral', ttl: '1h' },
  messages: [
    {
      role: 'system',
      content: 'You are a legal document assistant...'
    },
    {
      role: 'user',
      content: 'Summarize the liability section.'
    }
  ]
});

Explicit cache breakpoints (fine-grained control)

Place cache_control on the specific content block you want cached. The breakpoint tells Claude to cache everything up to and including that block.

Caching a large system prompt:

const response = await client.chat.completions.create({
  model: 'claude-sonnet-4-5',
  messages: [
    {
      role: 'system',
      content: [
        {
          type: 'text',
          text: 'You are a code review assistant. Follow these guidelines:'
        },
        {
          type: 'text',
          // This block is large and static - cache it
          text: `## Security
            - Never trust user input without validation
            ... (hundreds of lines of guidelines) ...
            
            ## Performance
            - Avoid N+1 queries
            ...`,
          // @ts-expect-error - cache_control is a Claude extension
          cache_control: { type: 'ephemeral' }
        }
      ]
    },
    {
      role: 'user',
      content: 'Review this pull request: ...'
    }
  ]
});

Caching a large user-provided document with 1-hour TTL:

const response = await client.chat.completions.create({
  model: 'claude-sonnet-4-5',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Based on the contract below:' },
        {
          type: 'text',
          text: contractText, // large document
          // @ts-expect-error - cache_control is a Claude extension
          cache_control: { type: 'ephemeral', ttl: '1h' }
        },
        { type: 'text', text: 'List all payment terms mentioned.' }
      ]
    }
  ]
});

Google Gemini

Gemini supports both implicit and explicit caching depending on the model.

Implicit caching

Gemini handle caching automatically - just send your request normally and the provider will cache the common prefix when the prompt is long enough.

No cache write cost
Cache reads billed at 0.25x input price
TTL averages 3–5 minutes

Explicit caching

For older Gemini models, you need to add cache_control breakpoints yourself. Only the last breakpoint in your request is used - additional ones are ignored but won't break anything (useful when sharing the same request format with Claude).

Cache write cost: input token price + 5 minutes of storage
Cache reads billed at 0.25x input price
TTL is 5 minutes

System message caching limitation

Gemini has a single systemInstruction field which is treated as immutable when cached. If you need part of the system prompt to stay dynamic, move that content into a user message instead.

Supported models

Model	Min Tokens	Implicit	Explicit
gemini-2.5-flash	2,048	Yes	Yes
gemini-2.5-flash-lite	2,048	No	Low hit rate
gemini-2.5-pro	2,048	Yes	Yes
gemini-3.1-pro-preview	4,096	No	Yes
gemini-3.5-flash	4,096	No	Yes

Examples

System message caching

Cache a large static system prompt so it's not reprocessed on every request:

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://api.aivene.com/v1',
  apiKey: process.env.AIVENE_API_KEY
});

const response = await client.chat.completions.create({
  model: 'gemini-2.5-flash',
  messages: [
    {
      role: 'system',
      content: [
        {
          type: 'text',
          text: 'You are a document analysis assistant.'
        },
        {
          type: 'text',
          text: fullDocumentText, // large static document
          cache_control: { type: 'ephemeral' }
        }
      ]
    },
    {
      role: 'user',
      content: 'What are the key obligations in section 3?'
    }
  ]
});

User message caching

Useful for RAG scenarios where the retrieved context is large and reused across follow-up questions:

const response = await client.chat.completions.create({
  model: 'gemini-2.5-flash',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Using the data below:' },
        {
          type: 'text',
          text: retrievedChunks, // large RAG context
          cache_control: { type: 'ephemeral' }
        },
        { type: 'text', text: 'Which entries have a value above 1000?' }
      ]
    }
  ]
});

Alibaba Qwen

Qwen only supports explicit caching - you must add cache_control to each content block you want cached. Without it, no caching occurs at all.

Type	Price
Cache write	1.25x input price
Cache read	0.1x input price

Cache writes use a 5-minute TTL. The cache_control syntax is identical to Anthropic's explicit breakpoints.

Example

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://api.aivene.com/v1',
  apiKey: process.env.AIVENE_API_KEY
});

const response = await client.chat.completions.create({
  model: 'qwen3.7-max',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Use the reference below when answering.' },
        {
          type: 'text',
          text: referenceText, // large static content
          cache_control: { type: 'ephemeral' }
        },
        { type: 'text', text: 'Summarize the main implementation details.' }
      ]
    }
  ]
});