Pricing model

How a single request is priced - tokens, input vs output, prompt caching discounts, and built-in tool fees.

Every request through the gateway is priced from one thing: the token usage that the provider reports back. There is no per-request fee, no minimum charge, and no flat rate per model. Cost is computed from your actual input and output - nothing else.

What a token is

A token is roughly 4 characters of English text, or about three-quarters of a word. "Hello, world!" is around 4 tokens. A typical paragraph is 80-150 tokens. Code, JSON, and non-Latin scripts can be denser or sparser depending on the tokenizer the model uses.

Every text model in the world prices on tokens - this is not specific to Aivene.

Input vs output

Each model has two prices, both quoted per million tokens (MTok):

Direction	What it counts	Typical relative cost
Input	System prompt + user messages + tool definitions + history	Cheaper
Output	What the model writes back	2x to 5x more than input

This is why a request that returns a long essay costs much more than one that returns a single sentence, even when the prompt is identical.

The formula

The base cost of a single request is:

cost = (input_tokens  / 1,000,000) * input_price
     + (output_tokens / 1,000,000) * output_price

That is the entire model. Everything else - caching discounts, tool fees, discount tiers - is a modifier on top of that formula.

Worked example

Say a model is priced at $3.00 / MTok input and $15.00 / MTok output. You send a request with 2,000 input tokens, and the model replies with 500 output tokens:

input cost  = 2,000 / 1,000,000 * $3.00  = $0.006
output cost = 500   / 1,000,000 * $15.00 = $0.0075
total       = $0.0135

One request, just over a cent.

If your prompts and responses stay roughly that size, a $10 balance gives you around 740 requests on that model. Swap to a model priced at $0.20 / $0.80 per MTok and the same $10 stretches to roughly 14,000 requests - same dollars, different model, ~19x the volume.

This is why we cannot publish a "requests per dollar" number. It is a property of your prompts, not the gateway.

Prompt caching: cheaper repeat input

When you send the same prefix to a model more than once, most providers cache it and bill those repeat input tokens at a steep discount. Cached input is reported in the response as:

{
  "usage": {
    "prompt_tokens": 2000,
    "completion_tokens": 500,
    "prompt_tokens_details": {
      "cached_tokens": 1800
    }
  }
}

Of the 2,000 input tokens above, 1,800 hit the cache and are charged at the discounted rate; only 200 are billed at the full input price. See Prompt Caching for which providers support this and how big the discount is.

Built-in tool fees

If your request invokes one of our built-in tools - web search, web fetch, or code interpreter - that tool has its own per-call price that is added on top of the model tokens for that request.

total_cost = model_cost + tool_cost

Tool prices are listed on each tool's page under Built-in Tools. You only pay the tool fee when the model actually calls it; defining a tool but never invoking it costs nothing extra.

Other modalities

Tokens are not the only billable unit. Depending on the endpoint:

Endpoint	Billable units
Chat & Completions	input tokens, output tokens, cached tokens, cache write tokens
Embeddings	input tokens
Images	per image (resolution and quality factored in)
Transcriptions	audio seconds
Speech	characters

Whatever the unit, the cost still comes from numbers the provider returns in the response, multiplied by the model's published per-unit price.

Reading what you paid

Every response includes a usage object describing exactly what was counted. In the Console:

Logs shows the per-request token breakdown and the dollar amount the gateway subtracted from your balance.
Usage rolls those numbers up by model, key, and date so you can spot which model or which key is eating the budget.

Estimate before you commit

Use the OpenAI tokenizer (or any per-provider tokenizer) to count tokens on a sample prompt offline. Multiply by the model's per-token price for a realistic pre-flight estimate. Then run 10-20 real requests, look at the actual usage in the response, and extrapolate from there - it is the only way to get a number that matches your real prompts.