📑 Table des matières

Tokens, contexte, coûts : comprendre la facturation des LLM

LLM & Modèles 🟢 Débutant ⏱️ 14 min de lecture 📅 2026-02-24

You're using ChatGPT, Claude, or another AI model and seeing terms like "tokens," "200K context," or "$3/million tokens" without really understanding what they mean? You're not alone. LLM (Large Language Model) pricing is based on simple concepts once mastered, but remains obscure to many users.

In this comprehensive guide, we'll break it all down: what a token is, how the context window works, how costs are calculated, and most importantly, how to optimize your spending to get the most out of AI without blowing your budget.


🔤 What is a token?

The basic building block of LLMs

A token is not a word. Nor is it a character. It's an intermediate unit that the model uses to "read" and "write" text.

Concretely, a tokenizer (splitting algorithm) divides text into chunks called subwords. Here's how it works:

Original text Tokens (approximate) Count
"Hello world" ["Hello", " world"] 2
"Bonjour le monde" ["Bon", "jour", " le", " monde"] 4
"anticonstitutionnellement" ["anti", "constit", "ution", "nelle", "ment"] 5
"GPT-4" ["G", "PT", "-", "4"] 4

Key rules to remember

  • In English: 1 token ≈ 4 characters ≈ 0.75 words
  • In French: 1 token ≈ 3 characters ≈ 0.5 words (French is ~30% more "expensive" in tokens)
  • Spaces count (often attached to the next word)
  • Numbers are often split individually
  • Code is generally token-efficient (short keywords)
  • Special characters and emojis can cost multiple tokens each

Why French costs more

Tokenizers are primarily trained on English text. Result: common English words are often a single token, while their French equivalents are split into multiple pieces.

# Example with GPT-4 tokenizer
"The cat is on the table"      7 tokens
"Le chat est sur la table"     8 tokens

"Understanding"                1 token
"Compréhension"                3 tokens

"I need to automate this"      6 tokens
"J'ai besoin d'automatiser"   9 tokens

💡 Tip: To precisely count your tokens, use OpenAI's tiktoken tool or Anthropic's online tokenizer.

Different tokenizers

Each model family uses its own tokenizer:

Model Tokenizer Vocabulary
GPT-4 / GPT-4o cl100k_base / o200k_base 100K-200K tokens
Claude 3/4 Anthropic proprietary ~100K tokens
Llama 3 SentencePiece 128K tokens
Gemini SentencePiece 256K tokens
Mistral SentencePiece 32K tokens

The same text can therefore yield a different token count depending on the model used. This is important for cost estimation!


📐 Context window: your working memory

What is context?

The context window represents the maximum amount of text a model can "see" at once. It includes:

  1. The system prompt (basic instructions)
  2. Conversation history (previous messages)
  3. User message (your question)
  4. Model response (what it generates)

All this must fit within the window. If it exceeds → the model "forgets" the oldest messages.

The spectacular evolution of context windows

Year Model Context Text equivalent
2022 GPT-3.5 4K tokens ~3,000 words
2023 GPT-4 8K-32K tokens ~6,000-24,000 words
2023 Claude 2 100K tokens ~75,000 words
2024 GPT-4 Turbo 128K tokens ~96,000 words
2024 Claude 3 200K tokens ~150,000 words
2024 Gemini 1.5 Pro 1M tokens ~750,000 words
2025 Gemini 2.0 2M tokens ~1,500,000 words
2025 Claude 3.5/4 200K tokens ~150,000 words
2026 Claude Opus 4 200K tokens ~150,000 words

For perspective: 200K tokens is about a 500-page novel. 1M tokens is the entire Harry Potter series.

Long context: advantages and pitfalls

✅ Advantages of large context:
- Analyzing entire documents (contracts, reports, source code)
- Maintaining long conversations without memory loss
- Providing many examples (few-shot learning)

⚠️ Pitfalls to avoid:
- More context = more expensive (you pay for ALL input tokens)
- "Lost in the middle": models tend to use information in the middle of long contexts less effectively
- Latency: longer context = slower response
- Not memory: each request reprocesses the entire context from scratch

# Impact of context on cost (Claude Opus 4)
# Input price: $15/million tokens

Short conversation (2K tokens)   $0.03
+ 50-page document (30K tokens)  $0.45
+ Long history (100K tokens)     $1.50
Max context (200K tokens)        $3.00

# And that's JUST the input, for EVERY message!

Strategies for managing context

  1. Sliding summary: periodically summarize the conversation to free up space
  2. RAG (Retrieval Augmented Generation): retrieve only relevant information instead of sending everything
  3. Smart chunking: split documents and only send useful parts
  4. Concise system prompts: every word in the system prompt is resent with each message

💰 How costs are calculated

The input/output model

LLM API pricing is based on a simple principle:

  • Input tokens (what you send): price X per million tokens
  • Output tokens (what the model generates): price Y per million tokens
  • Output always costs more than input (often 3-5x more)

Why does output cost more? Because generation requires much more computation than simply "reading" the input.

Cost calculation formula

Total cost = (input_tokens × input_price / 1M) + (output_tokens × output_price / 1M)

# Concrete example:
# Model: Claude Sonnet 4 ($3 input / $15 output per million)
# You send: 2,000 tokens (question + context)
# Model responds: 500 tokens

Cost = (2000 × 3 / 1000000) + (500 × 15 / 1000000)
Cost = $0.006 + $0.0075
Cost = $0.0135 ( 1.3 cents)

Hidden costs to watch out for

  • System prompt: sent with EVERY request, accumulates quickly
  • Conversation history: grows with each exchange
  • Retries: if your app automatically retries, you pay double
  • Streaming: same cost as non-streaming, but perceived latency decreases
  • Images/files: converted to tokens (an image can cost 1K-10K tokens)
  • Thinking/reasoning tokens: "thinking" models (o1, Claude with thinking) generate reasoning tokens that are billed

📊 2026 Price Comparison Table

Here are the current prices for major models (in dollars per million tokens):

Premium Models (advanced reasoning)

Model Input $/M Output $/M Context Ideal for
Claude Opus 4 $15 $75 200K Complex tasks, code, analysis
GPT-4.5 $75 $150 128K Creativity, nuance
o3 (OpenAI) $10 $40 200K Reasoning, math, code
Gemini 2.0 Ultra $7 $21 2M Large contexts, multimodal

Mid-range Models (best value)

Model Input $/M Output $/M Context Ideal for
Claude Sonnet 4 $3 $15 200K Daily use, code
GPT-4o $2.50 $10 128K Versatile, fast
Gemini 2.0 Flash $0.10 $0.40 1M Volume, long context
Llama 3.3 70B $0.40 $0.40 128K Self-hosted, private

Budget Models (high volume)

Model Input $/M Output $/M Context Ideal for
Claude Haiku 3.5 $0.80 $4 200K Classification, extraction
GPT-4o mini $0.15 $0.60 128K Simple tasks, volume
Gemini 2.0 Flash Lite $0.02 $0.10 1M Ultra-high volume
Mistral Small $0.10 $0.30 128K Europe, GDPR

⚠️ Note: These prices change rapidly. Always check current prices on official websites. The above prices reflect early 2026 rates.

Cost per typical task

To better visualize, here's the estimated cost of common tasks:

Task Tokens (in+out) Budget model Mid-range model Premium model
Simple question 500+200 $0.0002 $0.005 $0.02
Article summary 3K+500 $0.001 $0.02 $0.08
20-page document analysis 15K+2K $0.004 $0.08 $0.37
Long article generation 2K+4K $0.003 $0.07 $0.33
Coding session (1h) 50K+20K $0.02 $0.45 $1.75
Autonomous agent (task) 200K+50K $0.07 $1.35 $6.75

🧮 Calculating your monthly budget

Simple method

# Quick monthly budget calculation
def calculate_budget(
    requests_per_day: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_price_per_m: float,
    output_price_per_m: float
) -> float:
    """Returns estimated monthly cost in dollars."""
    cost_per_request = (
        (avg_input_tokens * input_price_per_m / 1_000_000) +
        (avg_output_tokens * output_price_per_m / 1_000_000)
    )
    return cost_per_request * requests_per_day * 30

# Example: Developer using Claude Sonnet 4
budget = calculate_budget(
    requests_per_day=50,
    avg_input_tokens=3000,
    avg_output_tokens=1000,
    input_price_per_m=3.0,    # Claude Sonnet 4
    output_price_per_m=15.0
)
print(f"Estimated monthly budget: ${budget:.2f}")
# → Estimated monthly budget: $36.00

Typical consumption profiles

Profile Requests/day Typical model Budget/month
Occasional user 5-10 GPT-4o mini $0.50-2
Daily professional 30-50 Claude Sonnet 4 $20-40
Intensive developer 100-200 Sonnet/Haiku mix $30-80
Startup/Product 1K-10K GPT-4o mini + Opus $50-500
Enterprise 10K+ Multi-model mix $500+

💡 12 tips to reduce your costs

1. Choose the right model for each task

Don't use a Ferrari to buy bread. Use a budget model for simple tasks and reserve premium models for complex cases.

# Smart routing by complexity
def choose_model(task: str) -> str:
    simple_tasks = ["classification", "extraction", "rephrasing"]
    medium_tasks = ["summary", "writing", "simple_code"]
    complex_tasks = ["analysis", "reasoning", "complex_code"]

    if task in simple_tasks:
        return "haiku-3.5"      # $0.80/$4 per M
    elif task in medium_tasks:
        return "sonnet-4"       # $3/$15 per M
    else:
        return "opus-4"         # $15/$75 per M

2. Optimize your prompts

A concise but precise prompt costs less than a verbose one. Eliminate repetitions and unnecessary instructions.

3. Use caching (prompt caching)

Claude and GPT support prompt caching: if you send the same system prompt or context repeatedly, cached tokens cost up to 90% less.

# Pricing with caching (Claude)
Without cache: $3/M input
With cache: $0.30/M (cached tokens) + $3.75/M (cache write)
Savings on repetitive requests: up to 90%

4. Limit conversation history

Don't keep 100 message histories. Regularly summarize or keep only the last N exchanges.

5. Use open-source models in self-hosting

For high volume, hosting Llama 3.3 or Mistral on your own GPU can be much more cost-effective than APIs.

6. Batching (request grouping)

Group multiple requests into a single API call when possible to reduce overhead.

7. Implement RAG (Retrieval Augmented Generation)

Instead of sending entire documents, retrieve only relevant chunks when needed.

8. Monitor token usage

Use API response headers that show token counts to track your spending in real-time.

9. Compress your context

Remove unnecessary whitespace, format consistently, and use shorter variable names in code.

10. Use async processing

For non-critical tasks, process requests asynchronously during off-peak hours when some providers offer discounts.

11. Negotiate enterprise contracts

For very high volume, contact providers directly for customized pricing.

12. Implement rate limiting

Prevent accidental cost spikes by setting hard limits on your API usage.