Pricing model
How a single request is priced - tokens, input vs output, prompt caching discounts, and built-in tool fees.
Every request through the gateway is priced from one thing: the token usage that the provider reports back. There is no per-request fee, no minimum charge, and no flat rate per model. Cost is computed from your actual input and output - nothing else.
What a token is
A token is roughly 4 characters of English text, or about three-quarters of a word. "Hello, world!" is around 4 tokens. A typical paragraph is 80-150 tokens. Code, JSON, and non-Latin scripts can be denser or sparser depending on the tokenizer the model uses.
Every text model in the world prices on tokens - this is not specific to Aivene.
Input vs output
Each model has two prices, both quoted per million tokens (MTok):
| Direction | What it counts | Typical relative cost |
|---|---|---|
| Input | System prompt + user messages + tool definitions + history | Cheaper |
| Output | What the model writes back | 2x to 5x more than input |
This is why a request that returns a long essay costs much more than one that returns a single sentence, even when the prompt is identical.
The formula
The base cost of a single request is:
cost = (input_tokens / 1,000,000) * input_price
+ (output_tokens / 1,000,000) * output_priceThat is the entire model. Everything else - caching discounts, tool fees, discount tiers - is a modifier on top of that formula.
Worked example
Say a model is priced at $3.00 / MTok input and $15.00 / MTok output. You send a request with 2,000 input tokens, and the model replies with 500 output tokens:
input cost = 2,000 / 1,000,000 * $3.00 = $0.006
output cost = 500 / 1,000,000 * $15.00 = $0.0075
total = $0.0135One request, just over a cent.
If your prompts and responses stay roughly that size, a $10 balance gives you around 740 requests on that model. Swap to a model priced at $0.20 / $0.80 per MTok and the same $10 stretches to roughly 14,000 requests - same dollars, different model, ~19x the volume.
This is why we cannot publish a "requests per dollar" number. It is a property of your prompts, not the gateway.
Prompt caching: cheaper repeat input
When you send the same prefix to a model more than once, most providers cache it and bill those repeat input tokens at a steep discount. Cached input is reported in the response as:
{
"usage": {
"prompt_tokens": 2000,
"completion_tokens": 500,
"prompt_tokens_details": {
"cached_tokens": 1800
}
}
}Of the 2,000 input tokens above, 1,800 hit the cache and are charged at the discounted rate; only 200 are billed at the full input price. See Prompt Caching for which providers support this and how big the discount is.
Built-in tool fees
If your request invokes one of our built-in tools - web search, web fetch, or code interpreter - that tool has its own per-call price that is added on top of the model tokens for that request.
total_cost = model_cost + tool_costTool prices are listed on each tool's page under Built-in Tools. You only pay the tool fee when the model actually calls it; defining a tool but never invoking it costs nothing extra.
Other modalities
Tokens are not the only billable unit. Depending on the endpoint:
| Endpoint | Billable units |
|---|---|
| Chat & Completions | input tokens, output tokens, cached tokens, cache write tokens |
| Embeddings | input tokens |
| Images | per image (resolution and quality factored in) |
| Transcriptions | audio seconds |
| Speech | characters |
Whatever the unit, the cost still comes from numbers the provider returns in the response, multiplied by the model's published per-unit price.
Reading what you paid
Every response includes a usage object describing exactly what was
counted. In the Console:
- Logs shows the per-request token breakdown and the dollar amount the gateway subtracted from your balance.
- Usage rolls those numbers up by model, key, and date so you can spot which model or which key is eating the budget.
Estimate before you commit
Use the OpenAI tokenizer (or any per-provider tokenizer) to count
tokens on a sample prompt offline. Multiply by the model's per-token
price for a realistic pre-flight estimate. Then run 10-20 real
requests, look at the actual usage in the response, and extrapolate
from there - it is the only way to get a number that matches your
real prompts.