2026-05-17HexSaga

How AI API Billing Works

A practical explanation of AI API usage billing, including input tokens, output tokens, model prices, relay multipliers, cache discounts, minimum charges, and batch-job estimates.

How AI API Billing Works

Using an AI API is different from paying for a web subscription. A subscription usually buys a product experience for a month. An API is closer to a meter: each request has input, output, a model, possible cache behavior, and a final usage record.

If tokens are still unclear, start with What Is an AI Token? A Practical Guide. This article focuses on the billing side: how API usage turns into cost, and why the balance deduction you see may not match a quick mental estimate.

The short version is that API billing usually has several layers:

LayerWhat you may seeWhat affects cost
Model usageInput tokens, output tokensHow much context you send and how much the model generates
Model pricePrice per 1 million tokensDifferent models and token types have different prices
Platform rulesCache, minimum charge, roundingThe posted formula may not be the whole ledger
Relay conversionBalance, exchange rate, multiplierA relay station converts model usage into its own balance system

All prices in this article are hypothetical. They explain the calculation method only. They are not live prices for any model or provider.

Input tokens and output tokens are separate line items

An API request usually has two token counts:

  • Input tokens: everything sent to the model, including system prompts, user messages, chat history, document snippets, code context, retrieved passages, and tool results.
  • Output tokens: everything generated by the model, such as the answer, JSON, code, summaries, labels, or extracted fields.

Input tokens are easy to underestimate because the visible user message may be short. The real request can also contain a long system prompt, fixed formatting rules, earlier conversation history, retrieved documents, or logs returned by tools.

For example, in a customer-support bot, the user may only ask, "What is the refund policy?" The actual request may include:

  • the bot's role instructions
  • recent conversation history
  • retrieved refund-policy text
  • response format requirements
  • the current user question

All of that is input. The answer generated by the model is output.

The basic formula is usually:

Total cost = input tokens / 1,000,000 * input price
           + output tokens / 1,000,000 * output price

Some platforms show prices per 1,000 tokens, while others show prices per 1 million tokens. The math is the same; only the unit changes.

Why output tokens often cost more

Many models price output tokens higher than input tokens. The reason is practical: reading existing text and generating new text are not the same workload.

During input processing, the model is encoding the text you already provided. During output generation, the model has to produce one token after another. Each new token depends on the previous context and the tokens already generated. Longer answers keep the model busy for longer and usually increase latency.

That is why a common pricing pattern looks like this:

  • input is cheaper because the model is reading existing content
  • output is more expensive because the model is writing new content

This is not a universal rule. Each provider can price models differently. But for estimation, a useful rule is: long input raises input cost, long answers raise output cost, and output length is often the part worth controlling first.

A small calculation example

Assume a model has these hypothetical prices:

  • Input: $1 per 1 million tokens
  • Output: $4 per 1 million tokens

One request uses 20,000 input tokens and generates 2,000 output tokens. The estimated cost is:

Input cost = 20,000 / 1,000,000 * 1 = $0.020
Output cost = 2,000 / 1,000,000 * 4 = $0.008
Total cost = $0.028

In this example, output tokens are only one-tenth of the input tokens, but the output cost is still meaningful because the output unit price is higher.

Using the same hypothetical prices:

TaskInput tokensOutput tokensEstimated costMain driver
Simple classification2005$0.00022Prompt input
Long-text extraction12,000300$0.01320Source text
Long-form generation8003,000$0.01280Generated output

The point is not the exact number. It is that different tasks spend money in different places. Classification may be dominated by the fixed prompt. Long-document work is often dominated by input. Writing a long answer is dominated by output.

Relay balances, exchange rates, and multipliers

If you call an official provider API directly, billing usually revolves around model prices and token usage. If you call models through an AI relay station, there is another conversion layer. For the broader relay-vs-subscription tradeoff, see Why an AI Relay Station Can Be Cheaper Than AI Subscriptions.

Relay stations may display usage in several ways:

  • balance in local currency, dollars, credits, or points
  • model labels such as 1x, 0.5x, or 2x
  • different multipliers for different routes, quality levels, or supply channels
  • usage shown as platform credits rather than raw provider cost

A multiplier is best understood as the relay's conversion rule. It is not necessarily the official model price, and it is not necessarily an exchange rate. More precisely, it answers this question:

Given this model usage, how many units should be deducted from this platform balance?

Continue with the earlier example. Suppose a request would cost $0.028 under the provider-style calculation. If a relay labels that model as 0.5x, and its balance is displayed as "API dollar credits," the deduction may be close to:

Balance deduction ≈ 0.028 * 0.5 = 0.014 credits

If the balance is displayed in local currency, points, or internal credits, the platform may also apply its own recharge ratio, exchange rate, bonus-credit rule, or discount logic. Relay platforms are not all the same, so do not judge only by the multiplier. Check whether the final usage record is transparent.

A practical way to verify the math is to take one real request and record the model, input tokens, output tokens, cache fields, multiplier, and final deduction. If those fields reconcile, the ledger is understandable. If they consistently do not, be careful.

Cache and read discounts can help, but do not assume them

Some models and platforms support prompt caching. In simple terms, if many requests share the same long prefix, such as a fixed system prompt, fixed document, or fixed code context, the platform may cache that prefix. Later requests that read the same cached content may be billed at a lower rate.

You may see fields such as:

  • cached input
  • cache read
  • prompt cache
  • cache creation
  • cache hit

The names vary by provider, but the idea is similar. The first request may create the cache under one billing rule. Later requests may get a cheaper read price if the cache is hit.

There are several catches:

  • Not every model supports cache discounts.
  • Not every request automatically hits cache.
  • A prefix may need to match exactly.
  • Cache behavior may have minimum length, expiration, and model restrictions.
  • A relay station may or may not pass the discount through clearly.

Caching is useful for batch jobs, long system prompts, and repeated document prefixes. It should not be blindly assumed in a budget. Estimate with non-cached prices first, then treat cache hits as upside.

Minimum charges and rounding matter for tiny requests

The clean formula is not always the full bill. Real platforms may have minimum billing units, precision limits, rounding, or a minimum charge per request. For very small requests, your calculated value may be tiny, but the recorded deduction may use the platform's smallest ledger unit.

For example, suppose a platform has a hypothetical minimum charge of 0.0001 credits per request. If a very short request calculates to only 0.000028 credits, the platform may still record 0.0001 credits. This is only an example; it is not a universal rule.

This matters for two types of workloads:

  • High-frequency small requests: for example, classifying tens of thousands of short texts one by one. Per-request minimums and overhead can become visible.
  • Failed retries: if a request has already reached model processing, it may produce partial usage; an automatic retry can spend again.

The usual optimization is not to cram everything into one giant prompt. It is to batch carefully when quality allows: process 10, 20, or 50 items at a time and require a strict JSON array response. That reduces per-request overhead while keeping output length manageable.

Usage records are the source of truth

Local estimates can get close, but they do not replace the provider's usage records. There are many reasons:

  • Different models use different tokenizers.
  • The app may add hidden system prompts or formatting instructions.
  • Tool calls, retrieved passages, and function results may enter the input.
  • If streaming is stopped early, actual output depends on where it stopped.
  • Cache hits are only visible from provider records.
  • Relay stations may apply multipliers, exchange rates, bonus credits, rounding, and minimum charges.

When investigating a billing question, the useful data is not "the prompt felt short." The useful fields are:

  • request time
  • request id or trace id
  • model name
  • input tokens
  • output tokens
  • cached tokens
  • multiplier or route
  • final deduction

If those fields line up, the cost can usually be explained. If the platform only shows that the balance went down, without a detailed usage record, it is hard to know whether the cause was model price, long output, a cache miss, or an opaque conversion rule.

How to estimate before a batch job

Batch jobs are where small mistakes become expensive. One item may look cheap, but 100,000 items change the picture. The safest method is to run a real sample before scaling up.

A practical workflow:

  1. Pick 20 to 100 realistic samples, not just the shortest ones.
  2. Use the real prompt, real model, and real output format.
  3. Record input tokens, output tokens, cache behavior, and actual deduction for each item.
  4. Look at the average, but also inspect P90 or long-tail examples.
  5. Estimate total cost and add a 20% to 30% buffer for retries and unusual cases.
  6. Set maximum output length, concurrency limits, and daily budget limits.
  7. Run 1% or 5% of the data first, compare the invoice, then scale.

The estimate can be written as:

Cost per item ≈ input cost + output cost
Batch cost ≈ cost per item * number of items * safety factor
Relay balance usage ≈ batch cost * model multiplier * platform conversion rule

For classification, tagging, and field extraction, output should be short and strict, ideally only the JSON you need. For summarization, rewriting, and long-form generation, output length is the main variable to control.

How to tell whether the cost is reasonable

Do not evaluate AI API cost by model price alone. A better checklist is:

  • Is this task mostly input-heavy or output-heavy?
  • Is the model stronger than necessary?
  • Can the fixed prompt be shortened or cached?
  • Can the output format be made more compact?
  • Are the relay multiplier and balance conversion transparent?
  • Can each balance deduction be explained from usage records?
  • Was a real sample run before the batch job?

One sentence summary: AI API billing starts with tokens, usage is split into input and output, and relay balances depend on conversion rules. The most reliable cost source is always the provider's usage record or invoice.

Related Reading