Prompt Caching
Cache expensive prompts and reduce latency and costs.
Prompt caching lets providers reuse previously processed content instead of recomputing it on every request. This cuts both latency and cost, especially when your prompts contain large, repeated sections like system instructions, RAG context, or reference documents.
Why it matters
- Faster responses - Cached tokens skip reprocessing, reducing time-to-first-token on subsequent requests.
- Lower cost - Cache reads are billed at a fraction of the normal input price (as low as 0.1x depending on the provider).
- Efficient for long context - Ideal for apps that reuse a large system prompt or document across many requests.
How caching works
Implicit caching
The provider handles everything automatically. Send a request with a long prompt and the provider caches the common prefix. Future requests sharing the same prefix will hit the cache without any changes to your code.
Explicit caching
You place cache_control markers directly in your request to tell the provider
exactly what to cache. This gives precise control but requires you to structure
your messages accordingly.
| Support type | Description |
|---|---|
| Implicit only | Caching is automatic, no manual control |
| Explicit only | Must add cache_control to enable caching |
| Hybrid | Both modes available |
Check model support
Visit Models and look at the pricing column. If a model shows a cache read/write price, it supports caching.
Reading cache usage
Every response includes a usage object. When caching is active, look inside
prompt_tokens_details for cache metrics:
{
"usage": {
"prompt_tokens": 1500,
"completion_tokens": 200,
"total_tokens": 1700,
"prompt_tokens_details": {
"cached_tokens": 1200,
"cache_write_tokens": 0
}
}
}| Field | What it means |
|---|---|
cached_tokens | Tokens served from cache - billed at the discounted cache read rate |
cache_write_tokens | Tokens written into cache for the first time - may have a write cost |
If cached_tokens is 0 on every request, the cache isn't being hit. Check
that your prompt prefix stays consistent across requests.
Best practices
-
Static content first - System prompts, persona descriptions, and reference documents should come before any dynamic content. The cache is prefix-based, so anything that changes between requests breaks the match.
-
Dynamic content last - User messages and per-request variables go at the end of the conversation, after all the static context.
-
Reuse the same client instance - Some providers tie caching to the connection or session. Reusing the same client helps maintain cache locality.
-
Check minimum token thresholds - Each provider has a minimum number of tokens before a prompt qualifies for caching. Sending short prompts will never hit cache.
Most models cache automatically
If a model is not listed in the sections below, it uses implicit caching -
no cache_control or extra configuration needed. Just send your request
normally and the provider handles caching automatically.
Anthropic Claude
Claude uses explicit caching - you opt in by adding cache_control to your
request. Cache reads are billed at 0.1x the normal input price, making it very
cheap to reuse long context across turns.
| Type | Price |
|---|---|
| Cache write (5-minute TTL) | 1.25x input price |
| Cache write (1-hour TTL) | 2x input price |
| Cache read | 0.1x input price |
Claude supports two caching modes:
- Automatic - Add
cache_controlat the top level of the request body. The API automatically places the cache breakpoint at the last cacheable block and shifts it forward as the conversation grows. Best for multi-turn chat. - Explicit breakpoints - Add
cache_controldirectly inside individual content blocks. Up to four breakpoints per request. Best for caching specific large content like RAG chunks, documents, or character cards.
Responses API
The Responses API only supports automatic caching via top-level cache_control.
Explicit per-block cache breakpoints are not exposed through the Responses API -
use the Chat Completions or Anthropic Messages API if you need fine-grained
breakpoints.
Minimum token requirements
Each model has a minimum cacheable prompt length. Prompts shorter than the minimum will not be cached.
| Min Tokens | Models |
|---|---|
| 4,096 | Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, Claude Opus 4.5, Claude Haiku 4.5 |
| 1,024 | Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4 |
Cache TTL options
- 5 minutes (default):
"cache_control": { "type": "ephemeral" } - 1 hour:
"cache_control": { "type": "ephemeral", "ttl": "1h" }
The 1-hour TTL has a higher write cost (2x vs 1.25x) but keeps the cache alive longer, useful for long sessions where users take breaks between turns. For short burst sessions, the default 5-minute TTL is usually cheaper overall.
Examples
Automatic caching (recommended for multi-turn conversations)
Add cache_control at the top level of the request. The API will figure out
where to place the breakpoint automatically and keep advancing it as the
conversation grows - no changes needed per turn.
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.aivene.com/v1',
apiKey: process.env.AIVENE_API_KEY
});
const response = await client.chat.completions.create({
model: 'claude-sonnet-4-5',
// @ts-expect-error - top-level cache_control is a Claude extension
cache_control: { type: 'ephemeral' },
messages: [
{
role: 'system',
content: `You are a customer support agent for Acme Corp.
## Product Catalog
... (very long product list) ...
## Return Policy
... (detailed policy text) ...`
},
{
role: 'user',
content: 'Can I return a product I bought 45 days ago?'
}
]
});With 1-hour TTL for longer sessions:
const response = await client.chat.completions.create({
model: 'claude-sonnet-4-5',
// @ts-expect-error - top-level cache_control is a Claude extension
cache_control: { type: 'ephemeral', ttl: '1h' },
messages: [
{
role: 'system',
content: 'You are a legal document assistant...'
},
{
role: 'user',
content: 'Summarize the liability section.'
}
]
});Explicit cache breakpoints (fine-grained control)
Place cache_control on the specific content block you want cached. The
breakpoint tells Claude to cache everything up to and including that block.
Caching a large system prompt:
const response = await client.chat.completions.create({
model: 'claude-sonnet-4-5',
messages: [
{
role: 'system',
content: [
{
type: 'text',
text: 'You are a code review assistant. Follow these guidelines:'
},
{
type: 'text',
// This block is large and static - cache it
text: `## Security
- Never trust user input without validation
... (hundreds of lines of guidelines) ...
## Performance
- Avoid N+1 queries
...`,
// @ts-expect-error - cache_control is a Claude extension
cache_control: { type: 'ephemeral' }
}
]
},
{
role: 'user',
content: 'Review this pull request: ...'
}
]
});Caching a large user-provided document with 1-hour TTL:
const response = await client.chat.completions.create({
model: 'claude-sonnet-4-5',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Based on the contract below:' },
{
type: 'text',
text: contractText, // large document
// @ts-expect-error - cache_control is a Claude extension
cache_control: { type: 'ephemeral', ttl: '1h' }
},
{ type: 'text', text: 'List all payment terms mentioned.' }
]
}
]
});Google Gemini
Gemini supports both implicit and explicit caching depending on the model.
Implicit caching
Gemini handle caching automatically - just send your request normally and the provider will cache the common prefix when the prompt is long enough.
- No cache write cost
- Cache reads billed at 0.25x input price
- TTL averages 3–5 minutes
Explicit caching
For older Gemini models, you need to add cache_control breakpoints yourself.
Only the last breakpoint in your request is used - additional ones are
ignored but won't break anything (useful when sharing the same request format
with Claude).
- Cache write cost: input token price + 5 minutes of storage
- Cache reads billed at 0.25x input price
- TTL is 5 minutes
System message caching limitation
Gemini has a single systemInstruction field which is treated as immutable
when cached. If you need part of the system prompt to stay dynamic, move that
content into a user message instead.
Supported models
| Model | Min Tokens | Implicit | Explicit |
|---|---|---|---|
| gemini-2.5-flash | 2,048 | Yes | Yes |
| gemini-2.5-flash-lite | 2,048 | No | Low hit rate |
| gemini-2.5-pro | 2,048 | Yes | Yes |
| gemini-3.1-pro-preview | 4,096 | No | Yes |
| gemini-3.5-flash | 4,096 | No | Yes |
Examples
System message caching
Cache a large static system prompt so it's not reprocessed on every request:
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.aivene.com/v1',
apiKey: process.env.AIVENE_API_KEY
});
const response = await client.chat.completions.create({
model: 'gemini-2.5-flash',
messages: [
{
role: 'system',
content: [
{
type: 'text',
text: 'You are a document analysis assistant.'
},
{
type: 'text',
text: fullDocumentText, // large static document
cache_control: { type: 'ephemeral' }
}
]
},
{
role: 'user',
content: 'What are the key obligations in section 3?'
}
]
});User message caching
Useful for RAG scenarios where the retrieved context is large and reused across follow-up questions:
const response = await client.chat.completions.create({
model: 'gemini-2.5-flash',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Using the data below:' },
{
type: 'text',
text: retrievedChunks, // large RAG context
cache_control: { type: 'ephemeral' }
},
{ type: 'text', text: 'Which entries have a value above 1000?' }
]
}
]
});Alibaba Qwen
Qwen only supports explicit caching - you must add cache_control to each
content block you want cached. Without it, no caching occurs at all.
| Type | Price |
|---|---|
| Cache write | 1.25x input price |
| Cache read | 0.1x input price |
Cache writes use a 5-minute TTL. The cache_control syntax is identical to
Anthropic's explicit breakpoints.
Example
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.aivene.com/v1',
apiKey: process.env.AIVENE_API_KEY
});
const response = await client.chat.completions.create({
model: 'qwen3.7-max',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Use the reference below when answering.' },
{
type: 'text',
text: referenceText, // large static content
cache_control: { type: 'ephemeral' }
},
{ type: 'text', text: 'Summarize the main implementation details.' }
]
}
]
});