Tokens, context, costs: understanding LLM billing

LLM & Modèles 🟢 Beginner ⏱️ 16 min read 📅 2026-02-24

🔤 What is a token?

The basic building block of LLMs

A token is not a word. Nor is it a character. It is an intermediate unit that the model uses to "read" and "write" text.

Specifically, a tokenizer (splitting algorithm) divides the text into pieces called subwords. Here is how it works:

Original text	Tokens (approximate)	Count
"Hello world"	["Hello", " world"]	2
"Bonjour le monde"	["Bon", "jour", " le", " monde"]	4
"anticonstitutionnellement"	["anti", "constit", "ution", "nelle", "ment"]	5
"GPT-4"	["G", "PT", "-", "4"]	4

Basic rules to remember

In English: 1 token ≈ 4 characters ≈ 0.75 word
In French: 1 token ≈ 3 characters ≈ 0.5 word (French is ~30% more "expensive" in tokens)
Spaces count (often attached to the following word)
Numbers are often split individually
Code is generally token-efficient (short keywords)
Special characters and emojis can cost several tokens each

Why French is more expensive

Tokenizers are primarily trained on English text. As a result: common English words are often a single token, whereas their French equivalents are split into multiple pieces. For example, with the GPT-4 tokenizer, the sentence "The cat is on the table" counts as 7 tokens, compared to 8 for "Le chat est sur la table". Similarly, "Understanding" fits in 1 token, whereas "Compréhension" requires 3. This gap is found across most common expressions: "I need to automate this" (6 tokens) versus "J'ai besoin d'automatiser" (9 tokens).

💡 Tip: to count your tokens precisely, use OpenAI's tiktoken library (to install via pip and import in Python to encode your text and get the exact token count) or the online counting tool provided in Anthropic's documentation (which allows you to paste text and get the count without writing code).

The different tokenizers

Each model family uses its own tokenizer:

Model	Tokenizer	Vocabulary
GPT-4 / GPT-4o	cl100k_base / o200k_base	100K-200K tokens
Claude 3/4	Anthropic Proprietary	~100K tokens
Llama 3	SentencePiece	128K tokens
Gemini	SentencePiece	256K tokens
Mistral	SentencePiece	32K tokens

The same text can therefore yield a different number of tokens depending on the model used. This is important for estimating costs!

📐 The context window: your working memory

What is context?

The context window represents the maximum amount of text a model can "see" at one time. It includes:

The system prompt (base instructions)
The conversation history (previous messages)
The user message (your question)
The model response (what it generates)

All of this must fit within the window. If it exceeds the limit → the model "forgets" the oldest messages.

The spectacular evolution of context windows

Year	Model	Context	Text equivalent
2022	GPT-3.5	4K tokens	~3,000 words
2023	GPT-4	8K-32K tokens	~6,000-24,000 words
2023	Claude 2	100K tokens	~75,000 words
2024	GPT-4 Turbo	128K tokens	~96,000 words
2024	Claude 3	200K tokens	~150,000 words
2024	Gemini 1.5 Pro	1M tokens	~750,000 words
2025	Gemini 2.0	2M tokens	~1,500,000 words
2025	Claude 3.5/4	200K tokens	~150,000 words
2026	Claude Opus 4	200K tokens	~150,000 words

To put things into perspective: 200K tokens is roughly a 500-page novel. 1M tokens is the entire Harry Potter series.

Long context: advantages and pitfalls

✅ Advantages of a large context:
- Analyze entire documents (contracts, reports, source code)
- Maintain long conversations without memory loss
- Provide plenty of examples (few-shot learning)

⚠️ Pitfalls to avoid:
- More context = more expensive (you pay for ALL input tokens)
- "Lost in the middle": models tend to use information in the middle of a long context less effectively
- Latency: the longer the context, the slower the response
- It's not memory: each query re-processes the entire context from scratch

The impact on cost is direct with a model like Claude Opus 4 (at $15/million input tokens). A short conversation of 2K tokens costs about $0.03, but add a 50-page document (30K tokens) and the cost rises to $0.45. With a long history reaching 100K tokens, you jump to $1.50. And if you fill the maximum 200K token window, each message costs you $3.00 — and that's just for the input, every single time you send a message.

Strategies for managing context

Sliding summary: periodically summarize the conversation to free up space
RAG (Retrieval Augmented Generation): search only for relevant info instead of sending everything
Smart chunking: split documents and only send the useful parts
Concise system prompts: every word in the system prompt is re-sent with every message

💰 How costs are calculated

The input/output model

LLM API billing is based on a simple principle:

Input tokens (what you send): price X per million tokens
Output tokens (what the model generates): price Y per million tokens
Output is always more expensive than input (often 3 to 5x more)

Why does output cost more? Because generation requires much more computation than simply "reading" the input.

Calculation formula

The total cost is calculated in two steps. First, multiply the number of input tokens by the price per million, then divide by a million. Next, do the same for the output tokens. The sum of the two gives the total cost. Let's take an example with Claude Sonnet 4 ($3 input / $15 output per million): if you send 2,000 tokens and the model responds with 500 tokens, the input cost is (2,000 × 3 / 1,000,000) = $0.006, and the output cost is (500 × 15 / 1,000,000) = $0.0075. The total cost is therefore $0.0135, or about 1.3 cents.

Hidden costs to watch out for

System prompt: sent with EVERY request, it adds up quickly
Conversation history: grows with every exchange
Retries: if your app automatically retries, you pay double
Streaming: same cost as non-streaming, but perceived latency decreases
Images/files: converted into tokens (an image can cost 1K-10K tokens)
Thinking/reasoning tokens: "thinker" models (o1, Claude with thinking) generate billable reflection tokens

📊 2026 Price Comparison Table

Here are the current prices for the main models (in dollars per million tokens):

Premium Models (advanced reasoning)

Model	Input $/M	Output $/M	Context	Ideal for
Claude Opus 4	$15	$75	200K	Complex tasks, code, analysis
GPT-4.5	$75	$150	128K	Creativity, nuance
o3 (OpenAI)	$10	$40	200K	Reasoning, math, code
Gemini 2.0 Ultra	$7	$21	2M	Large contexts, multimodal

Intermediate Models (best value for money)

Model	Input $/M	Output $/M	Context	Ideal for
Claude Sonnet 4	$3	$15	200K	Daily use, code
GPT-4o	$2.50	$10	128K	Versatile, fast
Gemini 2.0 Flash	$0.10	$0.40	1M	Volume, long context
Llama 3.3 70B	$0.40	$0.40	128K	Self-hosted, private

Budget Models (high volume)

Model	Input $/M	Output $/M	Context	Ideal for
Claude Haiku 3.5	$0.80	$4	200K	Classification, extraction
GPT-4o mini	$0.15	$0.60	128K	Simple tasks, volume
Gemini 2.0 Flash Lite	$0.02	$0.10	1M	Ultra-volume
Mistral Small	$0.10	$0.30	128K	Europe, GDPR

⚠️ Note: these prices evolve rapidly. Always check current prices on official websites. The prices above reflect early 2026 rates.

To choose the model best suited to your needs and budget, check out our guide Claude, GPT, Gemini, Llama : which model to choose in 2026 ?.

Cost per typical task

To better visualize, here is the estimated cost of common tasks:

Task	Tokens (in+out)	Budget model	Mid model	Premium model
Simple question	500+200	$0.0002	$0.005	$0.02
Article summary	3K+500	$0.001	$0.02	$0.08
20-page document analysis	15K+2K	$0.004	$0.08	$0.37
Long article generation	2K+4K	$0.003	$0.07	$0.33
Coding session (1h)	50K+20K	$0.02	$0.45	$1.75
Autonomous agent (task)	200K+50K	$0.07	$1.35	$6.75

If these costs seem high for your usage, our article on using free models without sacrificing quality will help you reduce the bill.

🧮 Calculating your monthly budget

Simple method

To estimate your monthly budget, proceed step by step. First, calculate the cost of a single request: multiply your average number of input tokens by the price per million, divide by a million, then do the same for the output. Add the two amounts together. Next, multiply this cost per request by the number of requests per day, then by 30 (days). For example, a developer using Claude Sonnet 4 with 50 requests per day, 3,000 input tokens and 1,000 output tokens, would pay (3,000 × 3 / 1,000,000) + (1,000 × 15 / 1,000,000) = $0.024 per request, or $0.024 × 50 × 30 = $36.00 per month.

Typical consumption profiles

Profile	Requests/day	Typical model	Budget/month
Occasional explorer	5-10	GPT-4o mini	$0.50-2
Daily professional	30-50	Claude Sonnet 4	$20-40
Intensive developer	100-200	Mix Sonnet/Haiku	$30-80
Startup / Product	1K-10K	GPT-4o mini + Opus	$50-500
Enterprise	10K+	Multi-model mix	$500+

💡 12 tips to reduce your costs

1. Choose the right model for each task

Don't take a Ferrari to buy bread. Use an economical model for simple tasks and save premium models for complex cases.

Intelligent routing consists of classifying your tasks into three categories and assigning a different model to each. Simple tasks (classification, extraction, rephrasing) are routed to an economical model like Claude Haiku 3.5 ($0.80/$4 per million). Intermediate tasks (summarization, writing, simple code) go to Claude Sonnet 4 ($3/$15 per million). Finally, complex tasks (in-depth analysis, reasoning, complex code) are sent to Claude Opus 4 ($15/$75 per million). This single strategy can divide your bill by 3 to 5.

2. Optimize your prompts

A concise but precise prompt costs less than a verbose one. Eliminate repetitions and unnecessary instructions. To go further, advanced prompting really makes a difference.

3. Use caching (prompt caching)

Claude and GPT support prompt caching: if you send the same system prompt or context repeatedly, cached tokens cost up to 90% less. Without cache, input on Claude is billed at $3/million. With cache enabled, already seen tokens only cost $0.30/million (with a slight extra cost of $3.75/million for writing to the cache). On repetitive requests with a long system prompt, the savings can reach 90%.

4. Limit conversation history

Don't keep 100 messages of history. Summarize regularly or only keep the last N exchanges.

5. Use open-source models in self-hosting

For high volume, hosting Llama 3.3 or Qwen3.6 : Alibaba débarque avec une nouvelle famille de modèles LLM on your own GPU can be much more cost-effective than APIs.

6. Batching (grouping requests)

OpenAI and Anthropic offer batch APIs with a 50% discount for non-urgent tasks.

7. Structured output to reduce output

Asking for structured JSON instead of free text significantly reduces output tokens. As a comparison, a verbose response like "The sentiment of this text is positive. Indeed, the author uses words like 'excellent', 'wonderful'..." consumes about 200 tokens. The same result in structured JSON — {"sentiment": "positif", "score": 0.92, "mots_cles": ["excellent", "formidable"]} — only consumes about 20 tokens, a 90% reduction.

8. Pre-filter with lightweight models

Use an economical model to filter/classify, then only send relevant cases to the premium model.

9. Set spending limits

All APIs offer spending limits. Configure them to avoid surprises.

10. Monitor your consumption

Use your provider's dashboards or tools like OpenRouter to track your spending in real time.

11. Compress documents before sending

Summarize or extract the relevant parts of a document before sending it to the LLM, rather than sending the entire document.

12. Use a model router

Services like OpenRouter make it easy to switch between models and compare prices. This is particularly useful with new models like DeepSeek V4 which are changing the game on pricing.

🔧 Handy tools for managing your tokens

Counting your tokens

To count your tokens with tiktoken (OpenAI's library), install it via pip, then import it into your Python script. Use tiktoken.encoding_for_model("gpt-4o") to get the encoder suited to your model, then call enc.encode(votre_texte) to retrieve the list of tokens and len() to find out their number. You can also decode each token individually with enc.decode([token]) to see how the text is split up.

On the Anthropic side, install the official Python SDK, initialize an Anthropic() client, then call client.count_tokens("Votre texte ici") to directly get the number of tokens exactly as the Claude model would count them. This method is more reliable than tiktoken if you are specifically targeting the Claude API, since each model has its own tokenizer.

Monitoring your spending with OpenRouter

OpenRouter is a multi-model router that offers a centralized dashboard to track your costs across all providers. To check your consumption via its API, send a GET request to https://openrouter.ai/api/v1/auth/key with your API key in the header (Authorization: Bearer VOTRE_CLE). The response includes your current usage, your limits, and the rate limits associated with your key.

❌ Common mistakes

Ignoring system prompt tokens: your system prompt is sent back with every message. A 2,000-token prompt sent 100 times a day = 200K tokens paid just for the instructions, without counting the rest.
Forgetting the history: in a 50-message conversation, the history can account for 80% of the billed tokens. Summarize or truncate it regularly.
Using a premium model for everything: a binary classification or a name extraction doesn't need Claude Opus 4. Haiku or GPT-4o mini are sufficient.
Not enabling caching: if your system prompt or reference document is identical from one request to another, caching can reduce input by 90%.
Comparing prices without normalizing: a model at $0.10/M that consumes 3x more tokens for the same result isn't necessarily cheaper than a model at $0.30/M.

❓ FAQ

Is a token a word?
No. In English, a token ≈ 0.75 word. In French, it's closer to 0.5 words. Long or rare words are split into multiple tokens.

Why does output cost more than input?
Generating text (output) requires much more computing power than simply reading (input). The model has to predict each next token one by one.

Does caching work on all models?
No. Claude and GPT support it, but not all providers and models implement it. Check your model's documentation.

How much does an image cost in tokens?
It depends on the resolution and the model. Generally, an image costs between 1,000 and 10,000 tokens in input. High-resolution images can go even higher.

Is self-hosting always cheaper?
Not necessarily. Self-hosting involves GPU costs (rental or purchase), maintenance, and energy. It becomes profitable from a certain volume, generally several million tokens per day.

🛠️ Recommended tools

Tool	Usage	Link
tiktoken	Count tokens (OpenAI models)	github.com/openai/tiktoken
Anthropic Tokenizer	Count tokens (Claude models)	Documentation Anthropic
OpenRouter	Route between models, compare prices	openrouter.ai
LiteLLM	Unified proxy to multiply providers	github.com/BerriAI/litellm

🎯 The essentials

A token ≈ 3-4 characters, not a whole word
The context window encompasses everything the model "sees": prompt, history, and response
Input (what you send) is cheaper than output (what the model generates), often 3 to 5x cheaper
The three most powerful optimization levers: choose the right model for each task, enable prompt caching, and favor structured output
Don't pay for power you don't need: a well-prompted budget model often beats a poorly used premium model

Conclusion

LLM billing may seem complex at first glance, but it relies on simple and predictable mechanisms. By understanding what a token is, how the context window impacts your costs, and how input/output pricing works, you take back control of your budget.

The room for maneuver is real. Between choosing the right model for each task, caching repetitive prompts, reducing output via structured responses, and actively monitoring your consumption, it is entirely possible to divide your costs by 5 to 10 without sacrificing the quality of the results. The key is to treat your token expenses the same way you would treat any cost item in a project: with measurement, common sense, and the right tools.

#Billing #Costs #Tokens #llm

📚 Related articles

LLM & Modèles 🟢 Débutant 11 min

Gemini 3.5 Flash : the fast model that beats Opus 4.7 and GPT-5.5 on agent benchmarks — 289 tokens/second

Discover Gemini 3.5 Flash: the ultra-fast model at 289 tokens/sec beating Claude Opus 4.7 and GPT-5.5 on agent benchmarks.

2026-05-20 14:09

LLM & Modèles 🟢 Débutant 14 min

General Preference RL: this paper unifies reinforcement learning and preference optimization for LLMs

Discover the General Preference RL paper unifying reinforcement learning and preference optimization to solve LLM post-training.

2026-05-19 18:01

LLM & Modèles 🟢 Débutant 12 min

OpenAI Parameter Golf: The challenge that proves small models are the future of AI

Discover the OpenAI Parameter Golf challenge: why compressing an LLM into 16 MB proves small models are the future of AI.

2026-05-18 17:02

📑 Table of contents