🔤 What is a token?
The basic building block of LLMs
A token is not a word. Nor is it a character. It is an intermediate unit that the model uses to "read" and "write" text.
Specifically, a tokenizer (splitting algorithm) divides the text into pieces called subwords. Here is how it works:
| Original text | Tokens (approximate) | Count |
|---|---|---|
| "Hello world" | ["Hello", " world"] | 2 |
| "Bonjour le monde" | ["Bon", "jour", " le", " monde"] | 4 |
| "anticonstitutionnellement" | ["anti", "constit", "ution", "nelle", "ment"] | 5 |
| "GPT-4" | ["G", "PT", "-", "4"] | 4 |
Basic rules to remember
- In English: 1 token ≈ 4 characters ≈ 0.75 word
- In French: 1 token ≈ 3 characters ≈ 0.5 word (French is ~30% more "expensive" in tokens)
- Spaces count (often attached to the following word)
- Numbers are often split individually
- Code is generally token-efficient (short keywords)
- Special characters and emojis can cost several tokens each
Why French is more expensive
Tokenizers are primarily trained on English text. As a result: common English words are often a single token, whereas their French equivalents are split into multiple pieces. For example, with the GPT-4 tokenizer, the sentence "The cat is on the table" counts as 7 tokens, compared to 8 for "Le chat est sur la table". Similarly, "Understanding" fits in 1 token, whereas "Compréhension" requires 3. This gap is found across most common expressions: "I need to automate this" (6 tokens) versus "J'ai besoin d'automatiser" (9 tokens).
💡 Tip: to count your tokens precisely, use OpenAI's tiktoken library (to install via pip and import in Python to encode your text and get the exact token count) or the online counting tool provided in Anthropic's documentation (which allows you to paste text and get the count without writing code).
The different tokenizers
Each model family uses its own tokenizer:
| Model | Tokenizer | Vocabulary |
|---|---|---|
| GPT-4 / GPT-4o | cl100k_base / o200k_base | 100K-200K tokens |
| Claude 3/4 | Anthropic Proprietary | ~100K tokens |
| Llama 3 | SentencePiece | 128K tokens |
| Gemini | SentencePiece | 256K tokens |
| Mistral | SentencePiece | 32K tokens |
The same text can therefore yield a different number of tokens depending on the model used. This is important for estimating costs!
📐 The context window: your working memory
What is context?
The context window represents the maximum amount of text a model can "see" at one time. It includes:
- The system prompt (base instructions)
- The conversation history (previous messages)
- The user message (your question)
- The model response (what it generates)
All of this must fit within the window. If it exceeds the limit → the model "forgets" the oldest messages.
The spectacular evolution of context windows
| Year | Model | Context | Text equivalent |
|---|---|---|---|
| 2022 | GPT-3.5 | 4K tokens | ~3,000 words |
| 2023 | GPT-4 | 8K-32K tokens | ~6,000-24,000 words |
| 2023 | Claude 2 | 100K tokens | ~75,000 words |
| 2024 | GPT-4 Turbo | 128K tokens | ~96,000 words |
| 2024 | Claude 3 | 200K tokens | ~150,000 words |
| 2024 | Gemini 1.5 Pro | 1M tokens | ~750,000 words |
| 2025 | Gemini 2.0 | 2M tokens | ~1,500,000 words |
| 2025 | Claude 3.5/4 | 200K tokens | ~150,000 words |
| 2026 | Claude Opus 4 | 200K tokens | ~150,000 words |
To put things into perspective: 200K tokens is roughly a 500-page novel. 1M tokens is the entire Harry Potter series.
Long context: advantages and pitfalls
✅ Advantages of a large context:
- Analyze entire documents (contracts, reports, source code)
- Maintain long conversations without memory loss
- Provide plenty of examples (few-shot learning)
⚠️ Pitfalls to avoid:
- More context = more expensive (you pay for ALL input tokens)
- "Lost in the middle": models tend to use information in the middle of a long context less effectively
- Latency: the longer the context, the slower the response
- It's not memory: each query re-processes the entire context from scratch
The impact on cost is direct with a model like Claude Opus 4 (at $15/million input tokens). A short conversation of 2K tokens costs about $0.03, but add a 50-page document (30K tokens) and the cost rises to $0.45. With a long history reaching 100K tokens, you jump to $1.50. And if you fill the maximum 200K token window, each message costs you $3.00 — and that's just for the input, every single time you send a message.
Strategies for managing context
- Sliding summary: periodically summarize the conversation to free up space
- RAG (Retrieval Augmented Generation): search only for relevant info instead of sending everything
- Smart chunking: split documents and only send the useful parts
- Concise system prompts: every word in the system prompt is re-sent with every message
💰 How costs are calculated
The input/output model
LLM API billing is based on a simple principle:
- Input tokens (what you send): price X per million tokens
- Output tokens (what the model generates): price Y per million tokens
- Output is always more expensive than input (often 3 to 5x more)
Why does output cost more? Because generation requires much more computation than simply "reading" the input.
Calculation formula
The total cost is calculated in two steps. First, multiply the number of input tokens by the price per million, then divide by a million. Next, do the same for the output tokens. The sum of the two gives the total cost. Let's take an example with Claude Sonnet 4 ($3 input / $15 output per million): if you send 2,000 tokens and the model responds with 500 tokens, the input cost is (2,000 × 3 / 1,000,000) = $0.006, and the output cost is (500 × 15 / 1,000,000) = $0.0075. The total cost is therefore $0.0135, or about 1.3 cents.
Hidden costs to watch out for
- System prompt: sent with EVERY request, it adds up quickly
- Conversation history: grows with every exchange
- Retries: if your app automatically retries, you pay double
- Streaming: same cost as non-streaming, but perceived latency decreases
- Images/files: converted into tokens (an image can cost 1K-10K tokens)
- Thinking/reasoning tokens: "thinker" models (o1, Claude with thinking) generate billable reflection tokens
📊 2026 Price Comparison Table
Here are the current prices for the main models (in dollars per million tokens):
Premium Models (advanced reasoning)
| Model | Input $/M | Output $/M | Context | Ideal for |
|---|---|---|---|---|
| Claude Opus 4 | $15 | $75 | 200K | Complex tasks, code, analysis |
| GPT-4.5 | $75 | $150 | 128K | Creativity, nuance |
| o3 (OpenAI) | $10 | $40 | 200K | Reasoning, math, code |
| Gemini 2.0 Ultra | $7 | $21 | 2M | Large contexts, multimodal |
Intermediate Models (best value for money)
| Model | Input $/M | Output $/M | Context | Ideal for |
|---|---|---|---|---|
| Claude Sonnet 4 | $3 | $15 | 200K | Daily use, code |
| GPT-4o | $2.50 | $10 | 128K | Versatile, fast |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M | Volume, long context |
| Llama 3.3 70B | $0.40 | $0.40 | 128K | Self-hosted, private |
Budget Models (high volume)
| Model | Input $/M | Output $/M | Context | Ideal for |
|---|---|---|---|---|
| Claude Haiku 3.5 | $0.80 | $4 | 200K | Classification, extraction |
| GPT-4o mini | $0.15 | $0.60 | 128K | Simple tasks, volume |
| Gemini 2.0 Flash Lite | $0.02 | $0.10 | 1M | Ultra-volume |
| Mistral Small | $0.10 | $0.30 | 128K | Europe, GDPR |
⚠️ Note: these prices evolve rapidly. Always check current prices on official websites. The prices above reflect early 2026 rates.
To choose the model best suited to your needs and budget, check out our guide Claude, GPT, Gemini, Llama : which model to choose in 2026 ?.
Cost per typical task
To better visualize, here is the estimated cost of common tasks:
| Task | Tokens (in+out) | Budget model | Mid model | Premium model |
|---|---|---|---|---|
| Simple question | 500+200 | $0.0002 | $0.005 | $0.02 |
| Article summary | 3K+500 | $0.001 | $0.02 | $0.08 |
| 20-page document analysis | 15K+2K | $0.004 | $0.08 | $0.37 |
| Long article generation | 2K+4K | $0.003 | $0.07 | $0.33 |
| Coding session (1h) | 50K+20K | $0.02 | $0.45 | $1.75 |
| Autonomous agent (task) | 200K+50K | $0.07 | $1.35 | $6.75 |
If these costs seem high for your usage, our article on using free models without sacrificing quality will help you reduce the bill.
🧮 Calculating your monthly budget
Simple method
To estimate your monthly budget, proceed step by step. First, calculate the cost of a single request: multiply your average number of input tokens by the price per million, divide by a million, then do the same for the output. Add the two amounts together. Next, multiply this cost per request by the number of requests per day, then by 30 (days). For example, a developer using Claude Sonnet 4 with 50 requests per day, 3,000 input tokens and 1,000 output tokens, would pay (3,000 × 3 / 1,000,000) + (1,000 × 15 / 1,000,000) = $0.024 per request, or $0.024 × 50 × 30 = $36.00 per month.
Typical consumption profiles
| Profile | Requests/day | Typical model | Budget/month |
|---|---|---|---|
| Occasional explorer | 5-10 | GPT-4o mini | $0.50-2 |
| Daily professional | 30-50 | Claude Sonnet 4 | $20-40 |
| Intensive developer | 100-200 | Mix Sonnet/Haiku | $30-80 |
| Startup / Product | 1K-10K | GPT-4o mini + Opus | $50-500 |
| Enterprise | 10K+ | Multi-model mix | $500+ |
💡 12 tips to reduce your costs
1. Choose the right model for each task
Don't take a Ferrari to buy bread. Use an economical model for simple tasks and save premium models for complex cases.
Intelligent routing consists of classifying your tasks into three categories and assigning a different model to each. Simple tasks (classification, extraction, rephrasing) are routed to an economical model like Claude Haiku 3.5 ($0.80/$4 per million). Intermediate tasks (summarization, writing, simple code) go to Claude Sonnet 4 ($3/$15 per million). Finally, complex tasks (in-depth analysis, reasoning, complex code) are sent to Claude Opus 4 ($15/$75 per million). This single strategy can divide your bill by 3 to 5.
2. Optimize your prompts
A concise but precise prompt costs less than a verbose one. Eliminate repetitions and unnecessary instructions. To go further, advanced prompting really makes a difference.
3. Use caching (prompt caching)
Claude and GPT support prompt caching: if you send the same system prompt or context repeatedly, cached tokens cost up to 90% less. Without cache, input on Claude is billed at $3/million. With cache enabled, already seen tokens only cost $0.30/million (with a slight extra cost of $3.75/million for writing to the cache). On repetitive requests with a long system prompt, the savings can reach 90%.
4. Limit conversation history
Don't keep 100 messages of history. Summarize regularly or only keep the last N exchanges.
5. Use open-source models in self-hosting
For high volume, hosting Llama 3.3 or Qwen3.6 : Alibaba débarque avec une nouvelle famille de modèles LLM on your own GPU can be much more cost-effective than APIs.
6. Batching (grouping requests)
OpenAI and Anthropic offer batch APIs with a 50% discount for non-urgent tasks.
7. Structured output to reduce output
Asking for structured JSON instead of free text significantly reduces output tokens. As a comparison, a verbose response like "The sentiment of this text is positive. Indeed, the author uses words like 'excellent', 'wonderful'..." consumes about 200 tokens. The same result in structured JSON — {"sentiment": "positif", "score": 0.92, "mots_cles": ["excellent", "formidable"]} — only consumes about 20 tokens, a 90% reduction.
8. Pre-filter with lightweight models
Use an economical model to filter/classify, then only send relevant cases to the premium model.
9. Set spending limits
All APIs offer spending limits. Configure them to avoid surprises.
10. Monitor your consumption
Use your provider's dashboards or tools like OpenRouter to track your spending in real time.
11. Compress documents before sending
Summarize or extract the relevant parts of a document before sending it to the LLM, rather than sending the entire document.
12. Use a model router
Services like OpenRouter make it easy to switch between models and compare prices. This is particularly useful with new models like DeepSeek V4 which are changing the game on pricing.
🔧 Handy tools for managing your tokens
Counting your tokens
To count your tokens with tiktoken (OpenAI's library), install it via pip, then import it into your Python script. Use tiktoken.encoding_for_model("gpt-4o") to get the encoder suited to your model, then call enc.encode(votre_texte) to retrieve the list of tokens and len() to find out their number. You can also decode each token individually with enc.decode([token]) to see how the text is split up.
On the Anthropic side, install the official Python SDK, initialize an Anthropic() client, then call client.count_tokens("Votre texte ici") to directly get the number of tokens exactly as the Claude model would count them. This method is more reliable than tiktoken if you are specifically targeting the Claude API, since each model has its own tokenizer.
Monitoring your spending with OpenRouter
OpenRouter is a multi-model router that offers a centralized dashboard to track your costs across all providers. To check your consumption via its API, send a GET request to https://openrouter.ai/api/v1/auth/key with your API key in the header (Authorization: Bearer VOTRE_CLE). The response includes your current usage, your limits, and the rate limits associated with your key.
❌ Common mistakes
- Ignoring system prompt tokens: your system prompt is sent back with every message. A 2,000-token prompt sent 100 times a day = 200K tokens paid just for the instructions, without counting the rest.
- Forgetting the history: in a 50-message conversation, the history can account for 80% of the billed tokens. Summarize or truncate it regularly.
- Using a premium model for everything: a binary classification or a name extraction doesn't need Claude Opus 4. Haiku or GPT-4o mini are sufficient.
- Not enabling caching: if your system prompt or reference document is identical from one request to another, caching can reduce input by 90%.
- Comparing prices without normalizing: a model at $0.10/M that consumes 3x more tokens for the same result isn't necessarily cheaper than a model at $0.30/M.
❓ FAQ
Is a token a word?
No. In English, a token ≈ 0.75 word. In French, it's closer to 0.5 words. Long or rare words are split into multiple tokens.
Why does output cost more than input?
Generating text (output) requires much more computing power than simply reading (input). The model has to predict each next token one by one.
Does caching work on all models?
No. Claude and GPT support it, but not all providers and models implement it. Check your model's documentation.
How much does an image cost in tokens?
It depends on the resolution and the model. Generally, an image costs between 1,000 and 10,000 tokens in input. High-resolution images can go even higher.
Is self-hosting always cheaper?
Not necessarily. Self-hosting involves GPU costs (rental or purchase), maintenance, and energy. It becomes profitable from a certain volume, generally several million tokens per day.
🛠️ Recommended tools
| Tool | Usage | Link |
|---|---|---|
| tiktoken | Count tokens (OpenAI models) | github.com/openai/tiktoken |
| Anthropic Tokenizer | Count tokens (Claude models) | Documentation Anthropic |
| OpenRouter | Route between models, compare prices | openrouter.ai |
| LiteLLM | Unified proxy to multiply providers | github.com/BerriAI/litellm |
🎯 The essentials
- A token ≈ 3-4 characters, not a whole word
- The context window encompasses everything the model "sees": prompt, history, and response
- Input (what you send) is cheaper than output (what the model generates), often 3 to 5x cheaper
- The three most powerful optimization levers: choose the right model for each task, enable prompt caching, and favor structured output
- Don't pay for power you don't need: a well-prompted budget model often beats a poorly used premium model
Conclusion
LLM billing may seem complex at first glance, but it relies on simple and predictable mechanisms. By understanding what a token is, how the context window impacts your costs, and how input/output pricing works, you take back control of your budget.
The room for maneuver is real. Between choosing the right model for each task, caching repetitive prompts, reducing output via structured responses, and actively monitoring your consumption, it is entirely possible to divide your costs by 5 to 10 without sacrificing the quality of the results. The key is to treat your token expenses the same way you would treat any cost item in a project: with measurement, common sense, and the right tools.