🧠 What is RAG? (In simple terms)
The problem
Imagine a brilliant expert who has read millions of books... but who cannot consult YOUR documents. You ask them "What is my company's revenue in Q3?" and they answer "I don't have that information."
This is exactly what happens with standard LLMs (ChatGPT, Claude, Gemini):
- They possess general knowledge (what they learned during training)
- They do NOT know your data (documents, notes, databases)
- Their knowledge is frozen in time (knowledge cutoff)
The RAG solution
RAG = Retrieval-Augmented Generation.
In simple terms: before answering, the AI searches for relevant information in YOUR documents, then uses it to formulate its response.
Without RAG, the question goes directly from the LLM to the answer, based solely on its general knowledge. With RAG, the question first triggers a search in your documents to retrieve relevant excerpts. These relevant documents are then added to the question before being sent to the LLM, producing an accurate and sourced response.
Simple analogy
Think of a student taking an exam:
- Without RAG: they answer from memory (sometimes they make mistakes or make things up)
- With RAG: they are allowed to consult their revision notes before answering
RAG doesn't make the AI smarter -- it gives it access to your information.
🔢 Embeddings: transforming text into vectors
The key concept
For AI to be able to "search" through your documents, the text first needs to be transformed into something the computer can efficiently compare: vectors (lists of numbers).
Conceptually, an embedding transforms text into numbers. For example, "Le chat dort sur le canapé" and "Le félin se repose au salon" will yield very close vectors because they share the same meaning. On the other hand, a sentence like "Python est un langage" will produce a distant vector, because the topic is completely different.
The first two sentences have close vectors (they are talking about the same topic). The third is distant (different topic).
How does it work?
An embedding model (like text-embedding-3-small from OpenAI) has been trained on billions of texts to understand semantics. It does not compare words one by one -- it understands the meaning.
To generate an embedding, the text is sent to the OpenAI API via an authenticated HTTP POST request. The call specifies the embedding model used and the input text. In return, the API returns an array of numbers (the vector) that captures the semantic meaning of the text.
Similarity comparison
Once you have vectors, comparing two texts comes down to calculating the distance between their vectors. For this, we use cosine similarity: we measure the angle between the two vectors in a multi-dimensional space. A score close to 1 means the texts are very semantically similar, while a score close to 0 indicates they are completely unrelated. This calculation logic makes it possible to rank documents by relevance to a question.
| Similarity score | Meaning |
|---|---|
| 0.90 - 1.00 | Almost identical |
| 0.75 - 0.90 | Very similar |
| 0.50 - 0.75 | Related to the topic |
| 0.25 - 0.50 | Barely related |
| 0.00 - 0.25 | Unrelated |
🗄️ Vector databases: where to store embeddings
Once your texts have been transformed into vectors, you need to store them somewhere and be able to search them efficiently. This is the role of vector databases.
Comparison of solutions
| Solution | Type | Complexity | Performance | Self-hosted | Ideal for |
|---|---|---|---|---|---|
| FAISS | Python Library | Easy | Excellent | Yes | Prototyping, small datasets |
| ChromaDB | Embedded database | Easy | Good | Yes | Side projects, <1M docs |
| Qdrant | Dedicated server | Medium | Excellent | Yes | Production, large volumes |
| pgvector | Postgres Extension | Medium | Good | Yes | If you already have Postgres |
| Pinecone | Managed cloud | Easy | Excellent | No | If you don't want to manage infra |
| Weaviate | Dedicated server | Complex | Excellent | Yes | Advanced use cases, multimodal |
FAISS: the simplest to get started
FAISS works like an optimized index in RAM. You define the dimension of your vectors (1536 for OpenAI embeddings), then you add your vectors to the index. To search for the most relevant documents, you provide the vector of your question and FAISS uses a distance search algorithm (like L2) to return the k closest vectors, along with their distances. Everything happens locally, with no server to manage.
ChromaDB: the user-friendly embedded database
ChromaDB is used as a persistent local database. You create a collection with a distance space (for example, cosine), then you add your documents to it. ChromaDB can generate embeddings automatically or use your own. Each document is associated with a unique identifier and metadata. The search is performed by providing a query text: ChromaDB vectorizes the question, compares the vectors, and returns the most relevant documents.
Qdrant: for production
Qdrant works as a dedicated server accessible via API. You first create a collection by specifying the size of the vectors and the distance metric (cosine, dot product, Euclidean). Insertion is done via "points", each containing an identifier, a vector, and a payload (the metadata like the source text). The search is performed by sending a query vector: Qdrant returns the closest points with their similarity score, which allows you to filter and refine the results on the application side.
🔄 Complete pipeline: from ingestion to response
Here is the complete RAG pipeline, step by step:
Step 1 - Ingestion: Raw documents (PDF, Markdown, TXT, web pages) are retrieved from their sources and loaded into memory with their metadata.
Step 2 - Chunking: Large documents are split into regular chunks of around 500 tokens, with an overlap between chunks to avoid losing context.
Step 3 - Embedding: Each chunk is sent to an embedding model to be transformed into a numerical vector capturing its semantic meaning.
Step 4 - Storage: The chunks and their vectors are inserted into a vector database (like Qdrant or ChromaDB), which indexes them for fast searches.
Step 5 - Retrieval: When a question arrives, it is vectorized in turn, then the vector database searches for the most semantically similar chunks.
Step 6 - Generation: The user's question and the relevant retrieved chunks are combined into an enriched prompt, sent to the LLM to produce an accurate and sourced response.
Step 1: Ingestion
The ingestion step involves retrieving your documents from all available sources: local files (Markdown, text, PDF), databases, web pages, or APIs. Each document is loaded into memory with its raw content and its metadata (source, file type, date). The goal is to build a raw and structured corpus ready to be split.
Step 2: Chunking (splitting)
LLMs have a context limit. The chunking step involves splitting large documents into digestible pieces of regular size (generally 300 to 800 tokens). An overlap system (overlap of 50 to 100 tokens) is used to avoid cutting information in the middle of two chunks. Separators are chosen intelligently: cuts between paragraphs are preferred, then between sentences, to preserve the semantic coherence of each piece.
| Parameter | Recommended value | Impact |
|---|---|---|
| chunk_size | 300-800 tokens | Too small = loss of context, too large = noise |
| overlap | 50-100 tokens | Ensures continuity between chunks |
| Separator | Paragraphs > sentences > words | Respects the text structure |
Step 3: Embedding
The batch embedding step involves sending the chunks to the embedding API in groups (generally of 100) to respect request limits and optimize costs. Each batch is transformed into vectors in a single HTTP request. The returned embeddings are collected and associated with their original chunk. This batch approach drastically reduces processing time compared to individual calls.
Step 4: Storage
The storage step involves inserting the chunks and their embeddings into the chosen vector database. Each chunk is saved with its vector, a unique identifier (for example chunk_0, chunk_1), and useful metadata (source file, chunk index, date). The vector database immediately indexes these vectors to enable fast searches.
Step 5: Retrieval (search)
The retrieval step is triggered when a user asks a question. We start by vectorizing the question with the same embedding model used during ingestion. This vector is sent to the vector database, which compares the question vector to all stored vectors and returns the k most semantically similar chunks (generally the top 5). These chunks constitute the relevant context for answering.
Step 6: Generation
The generation step involves building a complete prompt containing the user's question AND the relevant chunks retrieved during retrieval. This enriched prompt is sent to the LLM via a chat API (like OpenRouter or Anthropic's direct API), explicitly asking it to base its response solely on the provided context. The LLM then formulates an accurate and sourced response, without hallucinating.
🪶 Simple alternatives to RAG
Full RAG (embeddings + vector DB + pipeline) is not always necessary. Here are simpler alternatives that work surprisingly well.
1. Markdown files in the context
The simplest method: directly inject your files into the prompt. Injecting files into the context involves reading the content of your Markdown files (memory, project notes, documentation) and concatenating them directly into the prompt sent to the LLM. Each file is separated by a title to structure the context. The LLM thus receives all the information at once, without any intermediate infrastructure.
Advantages: zero infrastructure, works immediately
Limitations: limited by the LLM's context window (~100-200k tokens)
2. MEMORY.md: the file memory
This is the approach used by OpenClaw: a Markdown file that serves as persistent memory.
This MEMORY.md file contains, for example, user preferences (concise answers, Python version, hosting), current projects with their tech stack, and important dated decisions. The AI reads this file at each conversation and updates it when there is new information to remember.
3. Keyword search (memory_search)
Rather than embeddings, a simple text search. Keyword search works by extracting the terms from the user's question, then scanning each document to count how many of these keywords appear in it. The documents are then ranked in descending order of match, and the top k are returned. This approach is entirely local, requires no external API, and executes almost instantaneously.
Advantages: no embedding API needed, free, fast
Limitations: no semantic understanding ("car" will not match "automobile")
Comparison table
| Method | Complexity | Cost | Accuracy | Scale |
|---|---|---|---|---|
| Files in the context | None | Free | Good (if everything fits) | <100 pages |
| MEMORY.md | None | Free | Good | <50 pages |
| Keyword search | Low | Free | Average | <10k docs |
| Full RAG (embeddings) | Medium | ~0.01$/1k chunks | Excellent | Unlimited |
| RAG + reranking | High | ~0.02$/1k chunks | Optimal | Unlimited |
⚖️ When RAG is overkill vs essential
RAG is OVERKILL when...
| Situation | Recommended alternative |
|---|---|
| Less than 50 pages of context | Files in the prompt |
| Data that rarely changes | MEMORY.md |
| Simple and predictable questions | Keyword search |
| Prototype / MVP | Context injection |
| Zero budget | Markdown files |
RAG is ESSENTIAL when...
| Situation | Why |
|---|---|
| Thousands of documents | No longer fits in context |
| Evolving knowledge base | Incremental updates |
| Unpredictable questions | Need for semantic search |
| Critical accuracy | Verifiable sources |
| Multi-user | Everyone has their own documents |
| Dense technical data | AI needs to find the needle in the haystack |
The simple test
Ask yourself this question:
"Does ALL of my data fit in the LLM's context?"
- Yes (< 100k tokens, or ~75k words) --> No need for RAG, inject directly
- No --> RAG necessary
🏗️ RAG in OpenClaw
OpenClaw natively integrates memory mechanisms that use RAG principles without the complexity:
- MEMORY.md: long-term memory, automatically injected
- memory_search: search through daily notes
- Context files: AGENTS.md, TOOLS.md, etc.
For most personal use cases, this files + search approach is more than enough. Full RAG with a vector database becomes useful when you scale up.
🎯 Summary: where to start?
| Your situation | Our recommendation |
|---|---|
| Beginner, small project | MEMORY.md + context files |
| Medium project, <1000 docs | ChromaDB + embeddings |
| Production, high volume | Qdrant + full pipeline |
| OpenClaw user | Use native memory first |
Concrete steps
- Start simple: MEMORY.md and context files
- When that's no longer enough: add ChromaDB for semantic search
- When you scale up: migrate to Qdrant with an ingestion pipeline
- Optimize: add reranking, intelligent chunking, metadata
RAG isn't magic -- it's just an elegant way to give your AI access to YOUR information. Start simple, only add complexity when necessary.
The essentials
- RAG allows an LLM to answer based on your own documents, not just on its training knowledge.
- Embeddings transform text into numerical vectors capturing semantic meaning, which makes it possible to compare texts by their mathematical proximity.
- Vector databases (FAISS, ChromaDB, Qdrant) store and index these vectors for ultra-fast searches.
- The RAG pipeline follows 6 steps: ingestion, chunking, embedding, storage, retrieval, and generation.
- RAG is not always necessary: for fewer than 50 pages, direct injection into the context is often enough.
Recommended tools
| Tool | Usage | Link |
|---|---|---|
| text-embedding-3-small | OpenAI embedding model, good price/performance ratio | openai.com |
| ChromaDB | Embedded vector database, ideal for prototyping | chromadb.ai |
| Qdrant | Production vector database, high performance | qdrant.tech |
| OpenRouter | Proxy to access Claude, GPT and other LLMs | openrouter.ai |
| LangChain / LlamaIndex | Python frameworks for orchestrating a RAG pipeline | python.langchain.com / llamaindex.ai |
Common mistakes
- Chunk too small: a chunk of 100 tokens loses the context around the information. Aim for 300-800 tokens.
- No overlap: without overlap, you risk cutting an idea in the middle of two adjacent chunks.
- Wrong embedding model: using a different embedding model between ingestion and retrieval completely skews the similarity. Stay consistent.
- Too many results returned: retrieving 20 chunks drowns the LLM in noise. 3 to 5 relevant chunks are better than 15 moderately relevant ones.
- Ignoring metadata: storing only text without metadata (source, date, category) prevents filtering results during retrieval.
FAQ
Does RAG work with PDFs?
Yes, but the ingestion step is more complex. You need to extract the text from the PDF (with tools like pdfplumber or PyMuPDF), then clean up artifacts (headers, footers, page numbers) before chunking.
What is the difference between RAG and fine-tuning?
Fine-tuning modifies the model's weights so it learns a specific style or behavior. RAG does not modify the model: it dynamically provides information to it at the time of the query. To inject factual knowledge, RAG is almost always the right choice.
How much does a RAG pipeline cost?
Most of the cost comes from embeddings (~$0.02 per million tokens with text-embedding-3-small in 2025) and LLM calls for generation. For 10,000 documents of 500 tokens each, the initial embedding cost is less than $1.
Can RAG and keyword search be combined?
Yes, this is called hybrid search. A vector search AND a keyword search are performed, then the results are fused and reranked. This is the approach used by solutions like Qdrant with its filters and payloads.
What tools should be used to orchestrate a RAG pipeline with function calls?
Modern RAG often goes beyond simple retrieval: we want the AI to be able to decide to search documents, but also to call external tools. To understand how to connect an LLM to tools and databases, our guide MCP, Function Calling, Tool Use : le guide complet details the underlying mechanisms.
Conclusion
RAG is a fundamental building block of AI in 2025, but it's not a miracle solution to apply everywhere. The most common mistake is over-engineering: setting up a Qdrant stack + ingestion pipeline when three Markdown files in the context would do the job.
The right approach is progressive scaling: start with direct context injection and a MEMORY.md. When your data exceeds the context window, move to ChromaDB. When you have thousands of documents and production needs, migrate to Qdrant. At each step, you only add complexity because the previous solution has reached its limits.
If you want to dive deeper into smart agent architecture, check out our article on Mémoire IA : comment faire en sorte que votre agent se souvienne de tout. And if your project involves multiple agents that need to collaborate and share a common memory, ruflo : la plateforme d'orchestration multi-agent qui explose sur GitHub will give you the keys to architect all of this.
```