You're using ChatGPT, Claude, or another AI model and seeing terms like "tokens," "200K context," or "$3/million tokens" without really understanding what they mean? You're not alone. LLM (Large Language Model) pricing is based on simple concepts once mastered, but remains obscure to many users.
In this comprehensive guide, we'll break it all down: what a token is, how the context window works, how costs are calculated, and most importantly, how to optimize your spending to get the most out of AI without blowing your budget.
🔤 What is a token?
The basic building block of LLMs
A token is not a word. Nor is it a character. It's an intermediate unit that the model uses to "read" and "write" text.
Concretely, a tokenizer (splitting algorithm) divides text into chunks called subwords. Here's how it works:
| Original text | Tokens (approximate) | Count |
|---|---|---|
| "Hello world" | ["Hello", " world"] | 2 |
| "Bonjour le monde" | ["Bon", "jour", " le", " monde"] | 4 |
| "anticonstitutionnellement" | ["anti", "constit", "ution", "nelle", "ment"] | 5 |
| "GPT-4" | ["G", "PT", "-", "4"] | 4 |
Key rules to remember
- In English: 1 token ≈ 4 characters ≈ 0.75 words
- In French: 1 token ≈ 3 characters ≈ 0.5 words (French is ~30% more "expensive" in tokens)
- Spaces count (often attached to the next word)
- Numbers are often split individually
- Code is generally token-efficient (short keywords)
- Special characters and emojis can cost multiple tokens each
Why French costs more
Tokenizers are primarily trained on English text. Result: common English words are often a single token, while their French equivalents are split into multiple pieces.
# Example with GPT-4 tokenizer
"The cat is on the table" → 7 tokens
"Le chat est sur la table" → 8 tokens
"Understanding" → 1 token
"Compréhension" → 3 tokens
"I need to automate this" → 6 tokens
"J'ai besoin d'automatiser" → 9 tokens
💡 Tip: To precisely count your tokens, use OpenAI's tiktoken tool or Anthropic's online tokenizer.
Different tokenizers
Each model family uses its own tokenizer:
| Model | Tokenizer | Vocabulary |
|---|---|---|
| GPT-4 / GPT-4o | cl100k_base / o200k_base | 100K-200K tokens |
| Claude 3/4 | Anthropic proprietary | ~100K tokens |
| Llama 3 | SentencePiece | 128K tokens |
| Gemini | SentencePiece | 256K tokens |
| Mistral | SentencePiece | 32K tokens |
The same text can therefore yield a different token count depending on the model used. This is important for cost estimation!
📐 Context window: your working memory
What is context?
The context window represents the maximum amount of text a model can "see" at once. It includes:
- The system prompt (basic instructions)
- Conversation history (previous messages)
- User message (your question)
- Model response (what it generates)
All this must fit within the window. If it exceeds → the model "forgets" the oldest messages.
The spectacular evolution of context windows
| Year | Model | Context | Text equivalent |
|---|---|---|---|
| 2022 | GPT-3.5 | 4K tokens | ~3,000 words |
| 2023 | GPT-4 | 8K-32K tokens | ~6,000-24,000 words |
| 2023 | Claude 2 | 100K tokens | ~75,000 words |
| 2024 | GPT-4 Turbo | 128K tokens | ~96,000 words |
| 2024 | Claude 3 | 200K tokens | ~150,000 words |
| 2024 | Gemini 1.5 Pro | 1M tokens | ~750,000 words |
| 2025 | Gemini 2.0 | 2M tokens | ~1,500,000 words |
| 2025 | Claude 3.5/4 | 200K tokens | ~150,000 words |
| 2026 | Claude Opus 4 | 200K tokens | ~150,000 words |
For perspective: 200K tokens is about a 500-page novel. 1M tokens is the entire Harry Potter series.
Long context: advantages and pitfalls
✅ Advantages of large context:
- Analyzing entire documents (contracts, reports, source code)
- Maintaining long conversations without memory loss
- Providing many examples (few-shot learning)
⚠️ Pitfalls to avoid:
- More context = more expensive (you pay for ALL input tokens)
- "Lost in the middle": models tend to use information in the middle of long contexts less effectively
- Latency: longer context = slower response
- Not memory: each request reprocesses the entire context from scratch
# Impact of context on cost (Claude Opus 4)
# Input price: $15/million tokens
Short conversation (2K tokens) → $0.03
+ 50-page document (30K tokens) → $0.45
+ Long history (100K tokens) → $1.50
Max context (200K tokens) → $3.00
# And that's JUST the input, for EVERY message!
Strategies for managing context
- Sliding summary: periodically summarize the conversation to free up space
- RAG (Retrieval Augmented Generation): retrieve only relevant information instead of sending everything
- Smart chunking: split documents and only send useful parts
- Concise system prompts: every word in the system prompt is resent with each message
💰 How costs are calculated
The input/output model
LLM API pricing is based on a simple principle:
- Input tokens (what you send): price X per million tokens
- Output tokens (what the model generates): price Y per million tokens
- Output always costs more than input (often 3-5x more)
Why does output cost more? Because generation requires much more computation than simply "reading" the input.
Cost calculation formula
Total cost = (input_tokens × input_price / 1M) + (output_tokens × output_price / 1M)
# Concrete example:
# Model: Claude Sonnet 4 ($3 input / $15 output per million)
# You send: 2,000 tokens (question + context)
# Model responds: 500 tokens
Cost = (2000 × 3 / 1000000) + (500 × 15 / 1000000)
Cost = $0.006 + $0.0075
Cost = $0.0135 (≈ 1.3 cents)
Hidden costs to watch out for
- System prompt: sent with EVERY request, accumulates quickly
- Conversation history: grows with each exchange
- Retries: if your app automatically retries, you pay double
- Streaming: same cost as non-streaming, but perceived latency decreases
- Images/files: converted to tokens (an image can cost 1K-10K tokens)
- Thinking/reasoning tokens: "thinking" models (o1, Claude with thinking) generate reasoning tokens that are billed
📊 2026 Price Comparison Table
Here are the current prices for major models (in dollars per million tokens):
Premium Models (advanced reasoning)
| Model | Input $/M | Output $/M | Context | Ideal for |
|---|---|---|---|---|
| Claude Opus 4 | $15 | $75 | 200K | Complex tasks, code, analysis |
| GPT-4.5 | $75 | $150 | 128K | Creativity, nuance |
| o3 (OpenAI) | $10 | $40 | 200K | Reasoning, math, code |
| Gemini 2.0 Ultra | $7 | $21 | 2M | Large contexts, multimodal |
Mid-range Models (best value)
| Model | Input $/M | Output $/M | Context | Ideal for |
|---|---|---|---|---|
| Claude Sonnet 4 | $3 | $15 | 200K | Daily use, code |
| GPT-4o | $2.50 | $10 | 128K | Versatile, fast |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M | Volume, long context |
| Llama 3.3 70B | $0.40 | $0.40 | 128K | Self-hosted, private |
Budget Models (high volume)
| Model | Input $/M | Output $/M | Context | Ideal for |
|---|---|---|---|---|
| Claude Haiku 3.5 | $0.80 | $4 | 200K | Classification, extraction |
| GPT-4o mini | $0.15 | $0.60 | 128K | Simple tasks, volume |
| Gemini 2.0 Flash Lite | $0.02 | $0.10 | 1M | Ultra-high volume |
| Mistral Small | $0.10 | $0.30 | 128K | Europe, GDPR |
⚠️ Note: These prices change rapidly. Always check current prices on official websites. The above prices reflect early 2026 rates.
Cost per typical task
To better visualize, here's the estimated cost of common tasks:
| Task | Tokens (in+out) | Budget model | Mid-range model | Premium model |
|---|---|---|---|---|
| Simple question | 500+200 | $0.0002 | $0.005 | $0.02 |
| Article summary | 3K+500 | $0.001 | $0.02 | $0.08 |
| 20-page document analysis | 15K+2K | $0.004 | $0.08 | $0.37 |
| Long article generation | 2K+4K | $0.003 | $0.07 | $0.33 |
| Coding session (1h) | 50K+20K | $0.02 | $0.45 | $1.75 |
| Autonomous agent (task) | 200K+50K | $0.07 | $1.35 | $6.75 |
🧮 Calculating your monthly budget
Simple method
# Quick monthly budget calculation
def calculate_budget(
requests_per_day: int,
avg_input_tokens: int,
avg_output_tokens: int,
input_price_per_m: float,
output_price_per_m: float
) -> float:
"""Returns estimated monthly cost in dollars."""
cost_per_request = (
(avg_input_tokens * input_price_per_m / 1_000_000) +
(avg_output_tokens * output_price_per_m / 1_000_000)
)
return cost_per_request * requests_per_day * 30
# Example: Developer using Claude Sonnet 4
budget = calculate_budget(
requests_per_day=50,
avg_input_tokens=3000,
avg_output_tokens=1000,
input_price_per_m=3.0, # Claude Sonnet 4
output_price_per_m=15.0
)
print(f"Estimated monthly budget: ${budget:.2f}")
# → Estimated monthly budget: $36.00
Typical consumption profiles
| Profile | Requests/day | Typical model | Budget/month |
|---|---|---|---|
| Occasional user | 5-10 | GPT-4o mini | $0.50-2 |
| Daily professional | 30-50 | Claude Sonnet 4 | $20-40 |
| Intensive developer | 100-200 | Sonnet/Haiku mix | $30-80 |
| Startup/Product | 1K-10K | GPT-4o mini + Opus | $50-500 |
| Enterprise | 10K+ | Multi-model mix | $500+ |
💡 12 tips to reduce your costs
1. Choose the right model for each task
Don't use a Ferrari to buy bread. Use a budget model for simple tasks and reserve premium models for complex cases.
# Smart routing by complexity
def choose_model(task: str) -> str:
simple_tasks = ["classification", "extraction", "rephrasing"]
medium_tasks = ["summary", "writing", "simple_code"]
complex_tasks = ["analysis", "reasoning", "complex_code"]
if task in simple_tasks:
return "haiku-3.5" # $0.80/$4 per M
elif task in medium_tasks:
return "sonnet-4" # $3/$15 per M
else:
return "opus-4" # $15/$75 per M
2. Optimize your prompts
A concise but precise prompt costs less than a verbose one. Eliminate repetitions and unnecessary instructions.
3. Use caching (prompt caching)
Claude and GPT support prompt caching: if you send the same system prompt or context repeatedly, cached tokens cost up to 90% less.
# Pricing with caching (Claude)
Without cache: $3/M input
With cache: $0.30/M (cached tokens) + $3.75/M (cache write)
Savings on repetitive requests: up to 90%
4. Limit conversation history
Don't keep 100 message histories. Regularly summarize or keep only the last N exchanges.
5. Use open-source models in self-hosting
For high volume, hosting Llama 3.3 or Mistral on your own GPU can be much more cost-effective than APIs.
6. Batching (request grouping)
Group multiple requests into a single API call when possible to reduce overhead.
7. Implement RAG (Retrieval Augmented Generation)
Instead of sending entire documents, retrieve only relevant chunks when needed.
8. Monitor token usage
Use API response headers that show token counts to track your spending in real-time.
9. Compress your context
Remove unnecessary whitespace, format consistently, and use shorter variable names in code.
10. Use async processing
For non-critical tasks, process requests asynchronously during off-peak hours when some providers offer discounts.
11. Negotiate enterprise contracts
For very high volume, contact providers directly for customized pricing.
12. Implement rate limiting
Prevent accidental cost spikes by setting hard limits on your API usage.