📑 Table of contents

RAG for Dummies: Giving Memory to Your AI

RAG for Dummies: Giving Memory to Your AI

Agents IA 🟡 Intermediate ⏱️ 15 min read 📅 2026-02-24

🧠 What is RAG? (In simple terms)

The problem

Imagine a brilliant expert who has read millions of books... but who cannot consult YOUR documents. You ask them "What is my company's revenue in Q3?" and they answer "I don't have that information."

This is exactly what happens with standard LLMs (ChatGPT, Claude, Gemini):
- They possess general knowledge (what they learned during training)
- They do NOT know your data (documents, notes, databases)
- Their knowledge is frozen in time (knowledge cutoff)

The RAG solution

RAG = Retrieval-Augmented Generation.

In simple terms: before answering, the AI searches for relevant information in YOUR documents, then uses it to formulate its response.

Without RAG, the question goes directly from the LLM to the answer, based solely on its general knowledge. With RAG, the question first triggers a search in your documents to retrieve relevant excerpts. These relevant documents are then added to the question before being sent to the LLM, producing an accurate and sourced response.

Simple analogy

Think of a student taking an exam:
- Without RAG: they answer from memory (sometimes they make mistakes or make things up)
- With RAG: they are allowed to consult their revision notes before answering

RAG doesn't make the AI smarter -- it gives it access to your information.


🔢 Embeddings: transforming text into vectors

The key concept

For AI to be able to "search" through your documents, the text first needs to be transformed into something the computer can efficiently compare: vectors (lists of numbers).

Conceptually, an embedding transforms text into numbers. For example, "Le chat dort sur le canapé" and "Le félin se repose au salon" will yield very close vectors because they share the same meaning. On the other hand, a sentence like "Python est un langage" will produce a distant vector, because the topic is completely different.

The first two sentences have close vectors (they are talking about the same topic). The third is distant (different topic).

How does it work?

An embedding model (like text-embedding-3-small from OpenAI) has been trained on billions of texts to understand semantics. It does not compare words one by one -- it understands the meaning.

To generate an embedding, the text is sent to the OpenAI API via an authenticated HTTP POST request. The call specifies the embedding model used and the input text. In return, the API returns an array of numbers (the vector) that captures the semantic meaning of the text.

Similarity comparison

Once you have vectors, comparing two texts comes down to calculating the distance between their vectors. For this, we use cosine similarity: we measure the angle between the two vectors in a multi-dimensional space. A score close to 1 means the texts are very semantically similar, while a score close to 0 indicates they are completely unrelated. This calculation logic makes it possible to rank documents by relevance to a question.

Similarity score Meaning
0.90 - 1.00 Almost identical
0.75 - 0.90 Very similar
0.50 - 0.75 Related to the topic
0.25 - 0.50 Barely related
0.00 - 0.25 Unrelated

🗄️ Vector databases: where to store embeddings

Once your texts have been transformed into vectors, you need to store them somewhere and be able to search them efficiently. This is the role of vector databases.

Comparison of solutions

Solution Type Complexity Performance Self-hosted Ideal for
FAISS Python Library Easy Excellent Yes Prototyping, small datasets
ChromaDB Embedded database Easy Good Yes Side projects, <1M docs
Qdrant Dedicated server Medium Excellent Yes Production, large volumes
pgvector Postgres Extension Medium Good Yes If you already have Postgres
Pinecone Managed cloud Easy Excellent No If you don't want to manage infra
Weaviate Dedicated server Complex Excellent Yes Advanced use cases, multimodal

FAISS: the simplest to get started

FAISS works like an optimized index in RAM. You define the dimension of your vectors (1536 for OpenAI embeddings), then you add your vectors to the index. To search for the most relevant documents, you provide the vector of your question and FAISS uses a distance search algorithm (like L2) to return the k closest vectors, along with their distances. Everything happens locally, with no server to manage.

ChromaDB: the user-friendly embedded database

ChromaDB is used as a persistent local database. You create a collection with a distance space (for example, cosine), then you add your documents to it. ChromaDB can generate embeddings automatically or use your own. Each document is associated with a unique identifier and metadata. The search is performed by providing a query text: ChromaDB vectorizes the question, compares the vectors, and returns the most relevant documents.

Qdrant: for production

Qdrant works as a dedicated server accessible via API. You first create a collection by specifying the size of the vectors and the distance metric (cosine, dot product, Euclidean). Insertion is done via "points", each containing an identifier, a vector, and a payload (the metadata like the source text). The search is performed by sending a query vector: Qdrant returns the closest points with their similarity score, which allows you to filter and refine the results on the application side.


🔄 Complete pipeline: from ingestion to response

Here is the complete RAG pipeline, step by step:

Step 1 - Ingestion: Raw documents (PDF, Markdown, TXT, web pages) are retrieved from their sources and loaded into memory with their metadata.

Step 2 - Chunking: Large documents are split into regular chunks of around 500 tokens, with an overlap between chunks to avoid losing context.

Step 3 - Embedding: Each chunk is sent to an embedding model to be transformed into a numerical vector capturing its semantic meaning.

Step 4 - Storage: The chunks and their vectors are inserted into a vector database (like Qdrant or ChromaDB), which indexes them for fast searches.

Step 5 - Retrieval: When a question arrives, it is vectorized in turn, then the vector database searches for the most semantically similar chunks.

Step 6 - Generation: The user's question and the relevant retrieved chunks are combined into an enriched prompt, sent to the LLM to produce an accurate and sourced response.

Step 1: Ingestion

The ingestion step involves retrieving your documents from all available sources: local files (Markdown, text, PDF), databases, web pages, or APIs. Each document is loaded into memory with its raw content and its metadata (source, file type, date). The goal is to build a raw and structured corpus ready to be split.

Step 2: Chunking (splitting)

LLMs have a context limit. The chunking step involves splitting large documents into digestible pieces of regular size (generally 300 to 800 tokens). An overlap system (overlap of 50 to 100 tokens) is used to avoid cutting information in the middle of two chunks. Separators are chosen intelligently: cuts between paragraphs are preferred, then between sentences, to preserve the semantic coherence of each piece.

Parameter Recommended value Impact
chunk_size 300-800 tokens Too small = loss of context, too large = noise
overlap 50-100 tokens Ensures continuity between chunks
Separator Paragraphs > sentences > words Respects the text structure

Step 3: Embedding

The batch embedding step involves sending the chunks to the embedding API in groups (generally of 100) to respect request limits and optimize costs. Each batch is transformed into vectors in a single HTTP request. The returned embeddings are collected and associated with their original chunk. This batch approach drastically reduces processing time compared to individual calls.

Step 4: Storage

The storage step involves inserting the chunks and their embeddings into the chosen vector database. Each chunk is saved with its vector, a unique identifier (for example chunk_0, chunk_1), and useful metadata (source file, chunk index, date). The vector database immediately indexes these vectors to enable fast searches.

The retrieval step is triggered when a user asks a question. We start by vectorizing the question with the same embedding model used during ingestion. This vector is sent to the vector database, which compares the question vector to all stored vectors and returns the k most semantically similar chunks (generally the top 5). These chunks constitute the relevant context for answering.

Step 6: Generation

The generation step involves building a complete prompt containing the user's question AND the relevant chunks retrieved during retrieval. This enriched prompt is sent to the LLM via a chat API (like OpenRouter or Anthropic's direct API), explicitly asking it to base its response solely on the provided context. The LLM then formulates an accurate and sourced response, without hallucinating.


🪶 Simple alternatives to RAG

Full RAG (embeddings + vector DB + pipeline) is not always necessary. Here are simpler alternatives that work surprisingly well.

1. Markdown files in the context

The simplest method: directly inject your files into the prompt. Injecting files into the context involves reading the content of your Markdown files (memory, project notes, documentation) and concatenating them directly into the prompt sent to the LLM. Each file is separated by a title to structure the context. The LLM thus receives all the information at once, without any intermediate infrastructure.

Advantages: zero infrastructure, works immediately
Limitations: limited by the LLM's context window (~100-200k tokens)

2. MEMORY.md: the file memory

This is the approach used by OpenClaw: a Markdown file that serves as persistent memory.

This MEMORY.md file contains, for example, user preferences (concise answers, Python version, hosting), current projects with their tech stack, and important dated decisions. The AI reads this file at each conversation and updates it when there is new information to remember.

Rather than embeddings, a simple text search. Keyword search works by extracting the terms from the user's question, then scanning each document to count how many of these keywords appear in it. The documents are then ranked in descending order of match, and the top k are returned. This approach is entirely local, requires no external API, and executes almost instantaneously.

Advantages: no embedding API needed, free, fast
Limitations: no semantic understanding ("car" will not match "automobile")

Comparison table

Method Complexity Cost Accuracy Scale
Files in the context None Free Good (if everything fits) <100 pages
MEMORY.md None Free Good <50 pages
Keyword search Low Free Average <10k docs
Full RAG (embeddings) Medium ~0.01$/1k chunks Excellent Unlimited
RAG + reranking High ~0.02$/1k chunks Optimal Unlimited

⚖️ When RAG is overkill vs essential

RAG is OVERKILL when...

Situation Recommended alternative
Less than 50 pages of context Files in the prompt
Data that rarely changes MEMORY.md
Simple and predictable questions Keyword search
Prototype / MVP Context injection
Zero budget Markdown files

RAG is ESSENTIAL when...

Situation Why
Thousands of documents No longer fits in context
Evolving knowledge base Incremental updates
Unpredictable questions Need for semantic search
Critical accuracy Verifiable sources
Multi-user Everyone has their own documents
Dense technical data AI needs to find the needle in the haystack

The simple test

Ask yourself this question:

"Does ALL of my data fit in the LLM's context?"

  • Yes (< 100k tokens, or ~75k words) --> No need for RAG, inject directly
  • No --> RAG necessary

🏗️ RAG in OpenClaw

OpenClaw natively integrates memory mechanisms that use RAG principles without the complexity:

  1. MEMORY.md: long-term memory, automatically injected
  2. memory_search: search through daily notes
  3. Context files: AGENTS.md, TOOLS.md, etc.

For most personal use cases, this files + search approach is more than enough. Full RAG with a vector database becomes useful when you scale up.


🎯 Summary: where to start?

Your situation Our recommendation
Beginner, small project MEMORY.md + context files
Medium project, <1000 docs ChromaDB + embeddings
Production, high volume Qdrant + full pipeline
OpenClaw user Use native memory first

Concrete steps

  1. Start simple: MEMORY.md and context files
  2. When that's no longer enough: add ChromaDB for semantic search
  3. When you scale up: migrate to Qdrant with an ingestion pipeline
  4. Optimize: add reranking, intelligent chunking, metadata

RAG isn't magic -- it's just an elegant way to give your AI access to YOUR information. Start simple, only add complexity when necessary.


The essentials

  • RAG allows an LLM to answer based on your own documents, not just on its training knowledge.
  • Embeddings transform text into numerical vectors capturing semantic meaning, which makes it possible to compare texts by their mathematical proximity.
  • Vector databases (FAISS, ChromaDB, Qdrant) store and index these vectors for ultra-fast searches.
  • The RAG pipeline follows 6 steps: ingestion, chunking, embedding, storage, retrieval, and generation.
  • RAG is not always necessary: for fewer than 50 pages, direct injection into the context is often enough.

Tool Usage Link
text-embedding-3-small OpenAI embedding model, good price/performance ratio openai.com
ChromaDB Embedded vector database, ideal for prototyping chromadb.ai
Qdrant Production vector database, high performance qdrant.tech
OpenRouter Proxy to access Claude, GPT and other LLMs openrouter.ai
LangChain / LlamaIndex Python frameworks for orchestrating a RAG pipeline python.langchain.com / llamaindex.ai

Common mistakes

  • Chunk too small: a chunk of 100 tokens loses the context around the information. Aim for 300-800 tokens.
  • No overlap: without overlap, you risk cutting an idea in the middle of two adjacent chunks.
  • Wrong embedding model: using a different embedding model between ingestion and retrieval completely skews the similarity. Stay consistent.
  • Too many results returned: retrieving 20 chunks drowns the LLM in noise. 3 to 5 relevant chunks are better than 15 moderately relevant ones.
  • Ignoring metadata: storing only text without metadata (source, date, category) prevents filtering results during retrieval.

FAQ

Does RAG work with PDFs?
Yes, but the ingestion step is more complex. You need to extract the text from the PDF (with tools like pdfplumber or PyMuPDF), then clean up artifacts (headers, footers, page numbers) before chunking.

What is the difference between RAG and fine-tuning?
Fine-tuning modifies the model's weights so it learns a specific style or behavior. RAG does not modify the model: it dynamically provides information to it at the time of the query. To inject factual knowledge, RAG is almost always the right choice.

How much does a RAG pipeline cost?
Most of the cost comes from embeddings (~$0.02 per million tokens with text-embedding-3-small in 2025) and LLM calls for generation. For 10,000 documents of 500 tokens each, the initial embedding cost is less than $1.

Can RAG and keyword search be combined?
Yes, this is called hybrid search. A vector search AND a keyword search are performed, then the results are fused and reranked. This is the approach used by solutions like Qdrant with its filters and payloads.

What tools should be used to orchestrate a RAG pipeline with function calls?
Modern RAG often goes beyond simple retrieval: we want the AI to be able to decide to search documents, but also to call external tools. To understand how to connect an LLM to tools and databases, our guide MCP, Function Calling, Tool Use : le guide complet details the underlying mechanisms.


Conclusion

RAG is a fundamental building block of AI in 2025, but it's not a miracle solution to apply everywhere. The most common mistake is over-engineering: setting up a Qdrant stack + ingestion pipeline when three Markdown files in the context would do the job.

The right approach is progressive scaling: start with direct context injection and a MEMORY.md. When your data exceeds the context window, move to ChromaDB. When you have thousands of documents and production needs, migrate to Qdrant. At each step, you only add complexity because the previous solution has reached its limits.

If you want to dive deeper into smart agent architecture, check out our article on Mémoire IA : comment faire en sorte que votre agent se souvienne de tout. And if your project involves multiple agents that need to collaborate and share a common memory, ruflo : la plateforme d'orchestration multi-agent qui explose sur GitHub will give you the keys to architect all of this.
```