RAG for Dummies: Giving Memory to Your AI

Agents IA 🟡 Intermediate ⏱️ 15 min read 📅 2026-02-24

🧠 What is RAG? (In simple terms)

The problem

Imagine a brilliant expert who has read millions of books... but who cannot consult YOUR documents. You ask them "What is my company's revenue in Q3?" and they answer "I don't have that information."

This is exactly what happens with standard LLMs (ChatGPT, Claude, Gemini):
- They possess general knowledge (what they learned during training)
- They do NOT know your data (documents, notes, databases)
- Their knowledge is frozen in time (knowledge cutoff)

The RAG solution

RAG = Retrieval-Augmented Generation.

In simple terms: before answering, the AI searches for relevant information in YOUR documents, then uses it to formulate its response.

Without RAG, the question goes directly from the LLM to the answer, based solely on its general knowledge. With RAG, the question first triggers a search in your documents to retrieve relevant excerpts. These relevant documents are then added to the question before being sent to the LLM, producing an accurate and sourced response.

Simple analogy

Think of a student taking an exam:
- Without RAG: they answer from memory (sometimes they make mistakes or make things up)
- With RAG: they are allowed to consult their revision notes before answering

RAG doesn't make the AI smarter -- it gives it access to your information.

🔢 Embeddings: transforming text into vectors

The key concept

For AI to be able to "search" through your documents, the text first needs to be transformed into something the computer can efficiently compare: vectors (lists of numbers).

Conceptually, an embedding transforms text into numbers. For example, "Le chat dort sur le canapé" and "Le félin se repose au salon" will yield very close vectors because they share the same meaning. On the other hand, a sentence like "Python est un langage" will produce a distant vector, because the topic is completely different.

The first two sentences have close vectors (they are talking about the same topic). The third is distant (different topic).

How does it work?

An embedding model (like text-embedding-3-small from OpenAI) has been trained on billions of texts to understand semantics. It does not compare words one by one -- it understands the meaning.

To generate an embedding, the text is sent to the OpenAI API via an authenticated HTTP POST request. The call specifies the embedding model used and the input text. In return, the API returns an array of numbers (the vector) that captures the semantic meaning of the text.

Similarity comparison

Once you have vectors, comparing two texts comes down to calculating the distance between their vectors. For this, we use cosine similarity: we measure the angle between the two vectors in a multi-dimensional space. A score close to 1 means the texts are very semantically similar, while a score close to 0 indicates they are completely unrelated. This calculation logic makes it possible to rank documents by relevance to a question.

Similarity score	Meaning
0.90 - 1.00	Almost identical
0.75 - 0.90	Very similar
0.50 - 0.75	Related to the topic
0.25 - 0.50	Barely related
0.00 - 0.25	Unrelated

🗄️ Vector databases: where to store embeddings

Once your texts have been transformed into vectors, you need to store them somewhere and be able to search them efficiently. This is the role of vector databases.

Comparison of solutions

Solution	Type	Complexity	Performance	Self-hosted	Ideal for
FAISS	Python Library	Easy	Excellent	Yes	Prototyping, small datasets
ChromaDB	Embedded database	Easy	Good	Yes	Side projects, <1M docs
Qdrant	Dedicated server	Medium	Excellent	Yes	Production, large volumes
pgvector	Postgres Extension	Medium	Good	Yes	If you already have Postgres
Pinecone	Managed cloud	Easy	Excellent	No	If you don't want to manage infra
Weaviate	Dedicated server	Complex	Excellent	Yes	Advanced use cases, multimodal

FAISS: the simplest to get started

FAISS works like an optimized index in RAM. You define the dimension of your vectors (1536 for OpenAI embeddings), then you add your vectors to the index. To search for the most relevant documents, you provide the vector of your question and FAISS uses a distance search algorithm (like L2) to return the k closest vectors, along with their distances. Everything happens locally, with no server to manage.

ChromaDB: the user-friendly embedded database

ChromaDB is used as a persistent local database. You create a collection with a distance space (for example, cosine), then you add your documents to it. ChromaDB can generate embeddings automatically or use your own. Each document is associated with a unique identifier and metadata. The search is performed by providing a query text: ChromaDB vectorizes the question, compares the vectors, and returns the most relevant documents.

Qdrant: for production

Qdrant works as a dedicated server accessible via API. You first create a collection by specifying the size of the vectors and the distance metric (cosine, dot product, Euclidean). Insertion is done via "points", each containing an identifier, a vector, and a payload (the metadata like the source text). The search is performed by sending a query vector: Qdrant returns the closest points with their similarity score, which allows you to filter and refine the results on the application side.

🔄 Complete pipeline: from ingestion to response

Here is the complete RAG pipeline, step by step:

Step 1 - Ingestion: Raw documents (PDF, Markdown, TXT, web pages) are retrieved from their sources and loaded into memory with their metadata.

Step 2 - Chunking: Large documents are split into regular chunks of around 500 tokens, with an overlap between chunks to avoid losing context.

Step 3 - Embedding: Each chunk is sent to an embedding model to be transformed into a numerical vector capturing its semantic meaning.

Step 4 - Storage: The chunks and their vectors are inserted into a vector database (like Qdrant or ChromaDB), which indexes them for fast searches.

Step 5 - Retrieval: When a question arrives, it is vectorized in turn, then the vector database searches for the most semantically similar chunks.

Step 6 - Generation: The user's question and the relevant retrieved chunks are combined into an enriched prompt, sent to the LLM to produce an accurate and sourced response.

Step 1: Ingestion

The ingestion step involves retrieving your documents from all available sources: local files (Markdown, text, PDF), databases, web pages, or APIs. Each document is loaded into memory with its raw content and its metadata (source, file type, date). The goal is to build a raw and structured corpus ready to be split.

Step 2: Chunking (splitting)

LLMs have a context limit. The chunking step involves splitting large documents into digestible pieces of regular size (generally 300 to 800 tokens). An overlap system (overlap of 50 to 100 tokens) is used to avoid cutting information in the middle of two chunks. Separators are chosen intelligently: cuts between paragraphs are preferred, then between sentences, to preserve the semantic coherence of each piece.

Parameter	Recommended value	Impact
chunk_size	300-800 tokens	Too small = loss of context, too large = noise
overlap	50-100 tokens	Ensures continuity between chunks
Separator	Paragraphs > sentences > words	Respects the text structure

Step 3: Embedding

The batch embedding step involves sending the chunks to the embedding API in groups (generally of 100) to respect request limits and optimize costs. Each batch is transformed into vectors in a single HTTP request. The returned embeddings are collected and associated with their original chunk. This batch approach drastically reduces processing time compared to individual calls.

Step 4: Storage

The storage step involves inserting the chunks and their embeddings into the chosen vector database. Each chunk is saved with its vector, a unique identifier (for example chunk_0, chunk_1), and useful metadata (source file, chunk index, date). The vector database immediately indexes these vectors to enable fast searches.

Step 5: Retrieval (search)

The retrieval step is triggered when a user asks a question. We start by vectorizing the question with the same embedding model used during ingestion. This vector is sent to the vector database, which compares the question vector to all stored vectors and returns the k most semantically similar chunks (generally the top 5). These chunks constitute the relevant context for answering.

Step 6: Generation

The generation step involves building a complete prompt containing the user's question AND the relevant chunks retrieved during retrieval. This enriched prompt is sent to the LLM via a chat API (like OpenRouter or Anthropic's direct API), explicitly asking it to base its response solely on the provided context. The LLM then formulates an accurate and sourced response, without hallucinating.

🪶 Simple alternatives to RAG

Full RAG (embeddings + vector DB + pipeline) is not always necessary. Here are simpler alternatives that work surprisingly well.

1. Markdown files in the context

The simplest method: directly inject your files into the prompt. Injecting files into the context involves reading the content of your Markdown files (memory, project notes, documentation) and concatenating them directly into the prompt sent to the LLM. Each file is separated by a title to structure the context. The LLM thus receives all the information at once, without any intermediate infrastructure.

Advantages: zero infrastructure, works immediately
Limitations: limited by the LLM's context window (~100-200k tokens)

2. MEMORY.md: the file memory

This is the approach used by OpenClaw: a Markdown file that serves as persistent memory.

This MEMORY.md file contains, for example, user preferences (concise answers, Python version, hosting), current projects with their tech stack, and important dated decisions. The AI reads this file at each conversation and updates it when there is new information to remember.

3. Keyword search (memory_search)

Rather than embeddings, a simple text search. Keyword search works by extracting the terms from the user's question, then scanning each document to count how many of these keywords appear in it. The documents are then ranked in descending order of match, and the top k are returned. This approach is entirely local, requires no external API, and executes almost instantaneously.

Advantages: no embedding API needed, free, fast
Limitations: no semantic understanding ("car" will not match "automobile")

Comparison table

Method	Complexity	Cost	Accuracy	Scale
Files in the context	None	Free	Good (if everything fits)	<100 pages
MEMORY.md	None	Free	Good	<50 pages
Keyword search	Low	Free	Average	<10k docs
Full RAG (embeddings)	Medium	~0.01$/1k chunks	Excellent	Unlimited
RAG + reranking	High	~0.02$/1k chunks	Optimal	Unlimited

⚖️ When RAG is overkill vs essential

RAG is OVERKILL when...

Situation	Recommended alternative
Less than 50 pages of context	Files in the prompt
Data that rarely changes	MEMORY.md
Simple and predictable questions	Keyword search
Prototype / MVP	Context injection
Zero budget	Markdown files

RAG is ESSENTIAL when...

Situation	Why
Thousands of documents	No longer fits in context
Evolving knowledge base	Incremental updates
Unpredictable questions	Need for semantic search
Critical accuracy	Verifiable sources
Multi-user	Everyone has their own documents
Dense technical data	AI needs to find the needle in the haystack

The simple test

Ask yourself this question:

"Does ALL of my data fit in the LLM's context?"

Yes (< 100k tokens, or ~75k words) --> No need for RAG, inject directly
No --> RAG necessary

🏗️ RAG in OpenClaw

OpenClaw natively integrates memory mechanisms that use RAG principles without the complexity:

MEMORY.md: long-term memory, automatically injected
memory_search: search through daily notes
Context files: AGENTS.md, TOOLS.md, etc.

For most personal use cases, this files + search approach is more than enough. Full RAG with a vector database becomes useful when you scale up.

🎯 Summary: where to start?

Your situation	Our recommendation
Beginner, small project	MEMORY.md + context files
Medium project, <1000 docs	ChromaDB + embeddings
Production, high volume	Qdrant + full pipeline
OpenClaw user	Use native memory first

Concrete steps

Start simple: MEMORY.md and context files
When that's no longer enough: add ChromaDB for semantic search
When you scale up: migrate to Qdrant with an ingestion pipeline
Optimize: add reranking, intelligent chunking, metadata

RAG isn't magic -- it's just an elegant way to give your AI access to YOUR information. Start simple, only add complexity when necessary.

The essentials

RAG allows an LLM to answer based on your own documents, not just on its training knowledge.
Embeddings transform text into numerical vectors capturing semantic meaning, which makes it possible to compare texts by their mathematical proximity.
Vector databases (FAISS, ChromaDB, Qdrant) store and index these vectors for ultra-fast searches.
The RAG pipeline follows 6 steps: ingestion, chunking, embedding, storage, retrieval, and generation.
RAG is not always necessary: for fewer than 50 pages, direct injection into the context is often enough.

Recommended tools

Tool	Usage	Link
text-embedding-3-small	OpenAI embedding model, good price/performance ratio	openai.com
ChromaDB	Embedded vector database, ideal for prototyping	chromadb.ai
Qdrant	Production vector database, high performance	qdrant.tech
OpenRouter	Proxy to access Claude, GPT and other LLMs	openrouter.ai
LangChain / LlamaIndex	Python frameworks for orchestrating a RAG pipeline	python.langchain.com / llamaindex.ai

Common mistakes

Chunk too small: a chunk of 100 tokens loses the context around the information. Aim for 300-800 tokens.
No overlap: without overlap, you risk cutting an idea in the middle of two adjacent chunks.
Wrong embedding model: using a different embedding model between ingestion and retrieval completely skews the similarity. Stay consistent.
Too many results returned: retrieving 20 chunks drowns the LLM in noise. 3 to 5 relevant chunks are better than 15 moderately relevant ones.
Ignoring metadata: storing only text without metadata (source, date, category) prevents filtering results during retrieval.

FAQ

Does RAG work with PDFs?
Yes, but the ingestion step is more complex. You need to extract the text from the PDF (with tools like pdfplumber or PyMuPDF), then clean up artifacts (headers, footers, page numbers) before chunking.

What is the difference between RAG and fine-tuning?
Fine-tuning modifies the model's weights so it learns a specific style or behavior. RAG does not modify the model: it dynamically provides information to it at the time of the query. To inject factual knowledge, RAG is almost always the right choice.

How much does a RAG pipeline cost?
Most of the cost comes from embeddings (~$0.02 per million tokens with text-embedding-3-small in 2025) and LLM calls for generation. For 10,000 documents of 500 tokens each, the initial embedding cost is less than $1.

Can RAG and keyword search be combined?
Yes, this is called hybrid search. A vector search AND a keyword search are performed, then the results are fused and reranked. This is the approach used by solutions like Qdrant with its filters and payloads.

What tools should be used to orchestrate a RAG pipeline with function calls?
Modern RAG often goes beyond simple retrieval: we want the AI to be able to decide to search documents, but also to call external tools. To understand how to connect an LLM to tools and databases, our guide MCP, Function Calling, Tool Use : le guide complet details the underlying mechanisms.

Conclusion

RAG is a fundamental building block of AI in 2025, but it's not a miracle solution to apply everywhere. The most common mistake is over-engineering: setting up a Qdrant stack + ingestion pipeline when three Markdown files in the context would do the job.

The right approach is progressive scaling: start with direct context injection and a MEMORY.md. When your data exceeds the context window, move to ChromaDB. When you have thousands of documents and production needs, migrate to Qdrant. At each step, you only add complexity because the previous solution has reached its limits.

If you want to dive deeper into smart agent architecture, check out our article on Mémoire IA : comment faire en sorte que votre agent se souvienne de tout. And if your project involves multiple agents that need to collaborate and share a common memory, ruflo : la plateforme d'orchestration multi-agent qui explose sur GitHub will give you the keys to architect all of this.
```

#Memory #RAG #ia #llm

📚 Related articles

Agents IA 🟢 Débutant 12 min

Antigravity 2.0: Google launches the agent-first suite that wants to kill Cursor and Claude Code

Discover Antigravity 2.0, Google's new agent-first suite powered by Gemini 3.5 Flash, designed to replace Cursor and Claude Code.

2026-05-20 15:02

Agents IA 🟢 Débutant 15 min

Agent Skills: addyosmani's framework that standardizes the workflows of coding AI agents

Discover Agent Skills, Addy Osmani's framework that standardizes coding AI agent workflows to end the random quality of vibe coding.

2026-05-18 18:02

Agents IA 🟢 Débutant 14 min

Is Grep All You Need? : Why AI Agents Prefer Grep to Vector Search

Discover why AI agents prefer grep over vector search and RAG. A study shows 93% accuracy with a simple grep.

2026-05-17 17:05

📑 Table of contents