When facing an AI project, the same question always arises: should you fine-tune a model, implement RAG, or simply improve your prompting? The answer isn't "it depends"—it's "here's how to decide." This guide provides a clear decision tree, including the costs, complexity, and expected results of each approach.
Whether you're a developer, product manager, or entrepreneur, you'll know exactly which method to choose for your use case.
🌳 The Decision Tree: Where to Start?
Before diving into technical details, here’s the fundamental question:
Does the base model already provide decent results with a good prompt?
- Yes, but not precise enough → Improve your prompting (Section 1)
- No, it lacks specific knowledge → Implement RAG (Section 2)
- No, it doesn’t understand the style/format/domain → Consider fine-tuning (Section 3)
┌─────────────────────┐
│ Your use case │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Does the base model │
│ understand the task?│
└──────────┬──────────┘
┌────┴────┐
YES NO
│ │
┌────────▼───┐ ┌──▼─────────────┐
│ Results │ │ Missing │
│ sufficient?│ │ knowledge? │
└─────┬──────┘ └──┬──────────────┘
┌────┴────┐ ┌───┴───┐
YES NO YES NO
│ │ │ │
┌───▼──┐ ┌───▼───▼┐ ┌───▼────────┐
│ STOP │ │ RAG │ │ FINE-TUNING│
│ │ │ │ │ │
└──────┘ └────────┘ └────────────┘
The Golden Rule: ALWAYS Start with Prompting
Advanced prompting is free (in terms of development), instantaneous, and often sufficient. Don’t skip this step.
| Approach | Start here if... |
|---|---|
| Prompting | Always. This is your starting point. |
| RAG | The model needs information it doesn’t have (internal docs, recent data) |
| Fine-tuning | The model must adopt a very specific style/behavior at scale |
🎯 Advanced Prompting: The Art of Asking Well
Why Prompting Is Underrated
80% of projects that think they need fine-tuning actually need a better prompt. Advanced prompting is far more than "ask your question clearly"—it’s an engineering discipline in its own right.
System Prompt: Your Foundation
The system prompt defines the model’s baseline behavior. It’s the most powerful and underused lever.
# ❌ Weak system prompt
You are a helpful assistant.
# ✅ Structured system prompt
You are an expert in French tax law specializing in SMEs.
## Your Role
- Answer tax-related questions for French SMEs
- Cite relevant legal articles (CGI, BOFiP)
- Flag when a question is outside your scope
## Response Format
1. Brief answer (2-3 sentences)
2. Legal basis (law articles)
3. Key considerations
4. Recommendation (consult an expert if needed)
## Strict Rules
- NEVER invent a law article
- Say "I’m not sure" rather than guessing
- Always mention the last update date of your knowledge
Few-Shot Prompting: Learning by Example
Few-shot prompting involves providing examples of input/output pairs to guide the model:
# Task: Classify support tickets
## Examples
Ticket: "My payment was debited twice"
→ Category: billing
→ Priority: high
→ Sentiment: frustrated
Ticket: "How do I change my password?"
→ Category: account
→ Priority: low
→ Sentiment: neutral
Ticket: "Your app crashes every time I open it since the update"
→ Category: bug
→ Priority: critical
→ Sentiment: dissatisfied
## New ticket to classify
Ticket: "I can’t download my January invoice"
→
Chain-of-Thought (CoT): Making the Model Reason
# Without CoT
Q: If a train leaves at 2:30 PM and takes 2 hours and 47 minutes, what time does it arrive?
A: 5:17 PM
# With CoT
Q: If a train leaves at 2:30 PM and takes 2 hours and 47 minutes, what time does it arrive?
Think step by step before answering.
A: Let’s break it down:
- Departure: 2:30 PM
- Duration: 2h47
- 2:30 PM + 2h = 4:30 PM
- 4:30 PM + 47min = 5:17 PM
The train arrives at 5:17 PM.
Structured Output: Controlling the Format
Requesting structured output (JSON, XML) improves reliability and reduces output tokens:
# Prompt for structured extraction
prompt = """Extract the information from this invoice in JSON format.
Invoice:
ABC Company - Invoice #2024-0892
Date: 01/15/2026
Client: Martin Dupont
Amount (excl. tax): €1,500.00
VAT (20%): €300.00
Total (incl. tax): €1,800.00
Respond ONLY with the JSON, no text before/after.
Expected format:
{
"number": "string",
"date": "YYYY-MM-DD",
"client": "string",
"amount_excl_tax": number,
"vat": number,
"total_incl_tax": number
}"""
When Prompting Isn’t Enough
Prompting reaches its limits when:
- The model simply lacks the necessary knowledge (internal data, post-training info)
- You need a highly specific style at scale (thousands of queries)
- The latency of few-shot is too high (too many examples = too many tokens)
- Consistency across thousands of queries is insufficient
This is where RAG and fine-tuning come into play.
📚 RAG: Giving the Model Knowledge
The Principle of RAG
RAG (Retrieval-Augmented Generation) involves searching for relevant information in a database and then injecting it into the prompt before generating the response.
User question
│
▼
┌───────────────┐
│ Retrieval │ ← Searches your documents
│ (search) │
└───────┬───────┘
│ Relevant documents
▼
┌───────────────┐
│ Augmentation │ ← Adds to the prompt
│ (context) │
└───────┬───────┘
│ Enriched prompt
▼
┌───────────────┐
│ Generation │ ← The LLM responds
│ (answer) │
└───────────────┘
Embeddings: Turning Text into Vectors
To search efficiently, text is converted into vectors (embeddings)—numerical representations that capture meaning.
from openai import OpenAI
client = OpenAI()
# Create an embedding
response = client.embeddings.create(
model="text-embedding-3-small",
input="How to set up OpenClaw on a VPS?"
)
vector = response.data[0].embedding
# → [0.0123, -0.0456, 0.0789, ...] (1536 dimensions)
# Two semantically similar texts
# will have close vectors (high cosine similarity)
Vector Databases: Storing and Searching
Vector databases allow storing millions of embeddings and finding the closest ones in milliseconds:
| Vector Database | Type | Price | Ideal For |
|---|---|---|---|
| Chroma | Open-source, local | Free | Prototyping, small projects |
| Pinecone | Managed cloud | $0.33/M vectors/month | Production, scalability |
| Weaviate | Open-source + cloud | Free → paid | Multimodal, flexible |
| pgvector | PostgreSQL extension | Free | If you already use PostgreSQL |
| Qdrant | Open-source + cloud | Free → paid | Performance, Rust-based |
| FAISS | Meta library | Free | Brute-force search, large volumes |
Complete RAG Pipeline
# Simplified RAG pipeline
import chromadb
from openai import OpenAI
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("my_docs")
# 1. INDEXING (one-time)
documents = [
"OpenClaw is an autonomous AI agent...",
"To install OpenClaw on a VPS...",
"Configuration is done via config.yaml...",
]
for i, doc in enumerate(documents):
# Create embedding
resp = client.embeddings.create(
model="text-embedding-3-small", input=doc
)
# Store in ChromaDB
collection.add(
ids=[f"doc_{i}"],
embeddings=[resp.data[0].embedding],
documents=[doc]
)
# 2. RETRIEVAL (per query)
question = "How to install OpenClaw?"
q_embedding = client.embeddings.create(
model="text-embedding-3-small", input=question
).data[0].embedding
results = collection.query(
query_embeddings=[q_embedding],
n_results=3 # Top 3 relevant documents
)
# 3. GENERATION (augmented response)
context = "\n\n".join(results["documents"][0])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"""Answer based
SOLELY on the provided context. If the info isn’t
in the context, say so.
Context:\n{context}"""},
{"role": "user", "content": question}
]
)
print(response.choices[0].message.content)
Chunking: Smart Document Splitting
RAG quality heavily depends on chunking (how documents are split):
# Chunking strategies
# 1. Fixed size (simple but suboptimal)
def chunk_fixed(text, size=500, overlap=50):
chunks = []
for i in range(0, len(text), size - overlap):
chunks.append(text[i:i + size])
return chunks
# 2. By paragraphs/sections (smarter)
def chunk_by_sections(text):
sections = text.split("\n## ")
return [s.strip() for s in sections if len(s.strip()) > 50]
# 3. Semantic (best, more complex)
# Uses a model to detect topic shifts
| Method | Quality | Complexity | Use Case |
|---|---|---|---|
| Fixed size | ⭐⭐ | Very simple | Prototyping |
| By paragraph | ⭐⭐⭐ | Simple | Structured documents |
| By section/title | ⭐⭐⭐⭐ | Medium | Documentation, articles |
| Semantic | ⭐⭐⭐⭐⭐ | Complex | Production, max quality |
When to Use RAG
✅ RAG is ideal when:
- Your data changes frequently (news, product docs)
- You need to cite sources (traceability)
- Your documents are too large for the context
- You want to keep the base model (no fine-tuning)
❌ RAG is NOT ideal when:
- The issue is response style, not knowledge
- You need ultra-low latency
- Your data is simple and fits in the prompt
🔧 Fine-Tuning: Customizing the Model
What Is Fine-Tuning?
Fine-tuning involves re-training an existing model on your own data to adopt specific behaviors. It’s like giving the model private lessons.
Base model (GPT-4o, Claude, Llama)
│
▼ + your training data
┌───────────────────┐
│ Fine-tuning │
│ (re-training) │
└───────┬───────────┘
│
▼
Customized model
(understands your domain/style)
Required Data
Fine-tuning requires example pairs (input → expected output):
{"messages": [{"role": "system", "content": "You are XYZ Shop’s assistant."}, {"role": "user", "content": "What are your delivery times?"}, {"role": "assistant", "content": "At XYZ, we deliver within 24-48 hours in mainland France, and 3-5 days for overseas territories. Express delivery (24h) is free for orders over €50!"}]}
{"messages": [{"role": "system", "content": "You are XYZ Shop’s assistant."}, {"role": "user", "content": "I want to return an item"}, {"role": "assistant", "content": "Of course! At XYZ, you have 30 days to return an item. Go to Your Account > Orders > Return. Returns are free with the prepaid label provided."}]}
How Much Data Is Needed?
| Objective | Data Required | Quality |
|---|---|---|
| Light customization (tone, style) | 50–100 examples | Medium |
| Domain adaptation (e.g., legal) | 100–500 examples | High |
| Complex behavior (e.g., multi-step) | 500+ examples | Very high |
Fine-Tuning Process
- Prepare data: Clean, format in JSONL.
- Upload to platform (OpenAI, Hugging Face, etc.).
- Launch training (cost: ~$0.03 per 1K tokens for GPT-3.5).
- Evaluate: Test on unseen data.
- Deploy: Use your custom model.
When to Fine-Tune
✅ Fine-tuning is ideal when:
- You need a consistent style/behavior at scale
- The model must specialize in a niche domain
- Prompting/RAG can’t achieve the desired quality
- You have high-quality training data
❌ Fine-tuning is NOT ideal when:
- Your data changes frequently (retraining needed)
- You lack technical resources (data prep, evaluation)
- The use case is too broad (better to prompt)
- Cost is a blocker (~$100–$1,000 per training run)
📊 Comparison Table: Prompting vs RAG vs Fine-Tuning
| Criteria | Prompting | RAG | Fine-Tuning |
|---|---|---|---|
| Cost | Free | Low to medium | High |
| Implementation Time | Instant | Hours to days | Days to weeks |
| Data Needed | None | Documents | Labeled examples |
| Latency | Low | Medium | Low (after training) |
| Maintenance | None | Update documents | Retrain model |
| Best For | Quick tests, style | Knowledge gaps | Domain specialization |
| Scalability | High | Medium | High |
| Consistency | Medium | High | Very high |
🚀 Decision Flowchart (Summary)
- Start with prompting (always!).
- If results are almost there but inconsistent → refine prompts (CoT, few-shot, system prompt).
- If the model lacks knowledge → implement RAG.
- If you need domain-specific behavior at scale → fine-tune.
- If data changes frequently → prefer RAG over fine-tuning.
Pro Tips
- Combine approaches: Use RAG + fine-tuning for complex cases.
- Monitor costs: RAG scales with document size; fine-tuning with training runs.
- Iterate: Start small (prompting), then scale up (RAG → fine-tuning).
Final Rule: The best approach is the simplest one that meets your needs. 90% of projects can succeed with prompting + RAG alone.