📑 Table des matières

RAG pour les nuls : donner de la mémoire à son IA

Productivité IA 🟡 Intermédiaire ⏱️ 16 min de lecture 📅 2026-02-24

You have a brilliant AI assistant, but it forgets everything after each conversation. Frustrating, right? This is the fundamental problem with Large Language Models (LLMs): they lack persistent memory. Retrieval-Augmented Generation (RAG) is THE solution to give memory to your AI.

In this guide, we'll demystify RAG in simple terms, understand how it works under the hood, and most importantly, know when it's useful and when it's overkill.

🧠 What is RAG? (In Simple Terms)

The Problem

Imagine a brilliant expert who has read millions of books... but can't consult YOUR documents. You ask, "What is my company's Q3 revenue?" and they reply, "I don't have that information."

This is exactly what happens with classic LLMs (ChatGPT, Claude, Gemini):
- They know general knowledge (what they learned during training)
- They don't know your data (documents, notes, database)
- Their knowledge is frozen in time (cutoff date)

The RAG Solution

RAG = Retrieval-Augmented Generation.

In simple French: before answering, the AI searches for relevant information in YOUR documents, then uses them to formulate its response.

Without RAG:
   Question --> LLM --> Response (based on general knowledge)

With RAG:
   Question --> Search in your docs --> Relevant documents
                                                    |
                                                    v
   Question + Relevant documents --> LLM --> Precise and sourced response

Simple Analogy

Think of a student taking an exam:
- Without RAG: they answer from memory (sometimes they get it wrong or make it up)
- With RAG: they're allowed to consult their revision notes before answering

RAG doesn't make the AI more intelligent -- it gives it access to your information.

🔢 Embeddings: Transforming Text into Vectors

The Key Concept

For the AI to "search" in your documents, you first need to transform the text into something a computer can compare efficiently: vectors (lists of numbers).

# Conceptually, an embedding transforms text into numbers
"The cat sleeps on the couch"  -->  [0.23, -0.45, 0.78, 0.12, ...]
"The feline rests in the living room" -->  [0.21, -0.43, 0.76, 0.14, ...]
"Python is a language"       -->  [-0.67, 0.34, -0.12, 0.89, ...]

The first two sentences have close vectors (they're about the same topic). The third is distant (different topic).

How it Works

An embedding model (like text-embedding-3-small from OpenAI) is trained on billions of texts to understand semantics. It doesn't compare words one by one -- it understands the meaning.

import httpx
import os

async def get_embedding(text: str) -> list[float]:
    """Transforms text into a vector via OpenAI."""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.openai.com/v1/embeddings",
            headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
            json={
                "model": "text-embedding-3-small",
                "input": text,
            },
        )
        response.raise_for_status()
        return response.json()["data"][0]["embedding"]

Similarity Comparison

Once you have vectors, comparing two texts becomes calculating the distance between their vectors:

import numpy as np

def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
    """Calculates cosine similarity between two vectors."""
    a = np.array(vec_a)
    b = np.array(vec_b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Example
sim = cosine_similarity(
    embedding_cat,     # "The cat sleeps on the couch"
    embedding_feline     # "The feline rests in the living room"
)
# sim ≈ 0.95 (very similar!)

sim = cosine_similarity(
    embedding_cat,     # "The cat sleeps on the couch"
    embedding_python    # "Python is a language"
)
# sim ≈ 0.12 (not similar at all)
Similarity Score Meaning
0.90 - 1.00 Almost identical
0.75 - 0.90 Very similar
0.50 - 0.75 Related to the topic
0.25 - 0.50 Somewhat related
0.00 - 0.25 Not related

🗄️ Vector Databases: Where to Store Embeddings

Once you've transformed your texts into vectors, you need to store them somewhere and be able to search them efficiently. This is the role of vector databases.

Comparison of Solutions

Solution Type Complexity Performance Self-hosted Ideal for
FAISS Python library Easy Excellent Yes Prototyping, small datasets
ChromaDB Embedded database Easy Good Yes Side projects, <1M docs
Qdrant Dedicated server Medium Excellent Yes Production, large volumes
pgvector Postgres extension Medium Good Yes If you already have Postgres
Pinecone Managed cloud Easy Excellent No If you don't want to manage infrastructure
Weaviate Dedicated server Complex Excellent Yes Advanced cases, multimodal

FAISS: The Simplest to Start

import faiss
import numpy as np

# Create a FAISS index
dimension = 1536  # Size of OpenAI embeddings
index = faiss.IndexFlatL2(dimension)

# Add vectors
vectors = np.array(embeddings_list).astype('float32')
index.add(vectors)

# Search for the 5 nearest neighbors
query_vector = np.array([query_embedding]).astype('float32')
distances, indices = index.search(query_vector, k=5)

print(f"Most relevant documents: {indices[0]}")
print(f"Distances: {distances[0]}")

ChromaDB: The User-Friendly Embedded Database

import chromadb

# Create/open a collection
client = chromadb.PersistentClient(path="/home/deploy/data/chroma")
collection = client.get_or_create_collection(
    name="my_documents",
    metadata={"hnsw:space": "cosine"}
)

# Add documents (ChromaDB generates embeddings automatically)
collection.add(
    documents=[
        "Complete guide to self-hosting",
        "How to configure a reverse proxy",
        "Basics of server security",
    ],
    ids=["doc1", "doc2", "doc3"],
    metadatas=[
        {"source": "blog", "date": "2025-01"},
        {"source": "blog", "date": "2025-02"},
        {"source": "wiki", "date": "2025-01"},
    ],
)

# Search
results = collection.query(
    query_texts=["how to secure my server"],
    n_results=3,
)
print(results["documents"])
# --> [["Basics of server security", ...]]

Qdrant: For Production

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

# Connection
client = QdrantClient(host="localhost", port=6333)

# Create a collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
    ),
)

# Add documents
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector=embedding_1,
            payload={"text": "Self-hosting guide", "source": "blog"},
        ),
        PointStruct(
            id=2,
            vector=embedding_2,
            payload={"text": "Reverse proxy config", "source": "blog"},
        ),
    ],
)

# Search
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
)
for result in results:
    print(f"Score: {result.score:.3f} | {result.payload['text']}")

🔄 Complete Pipeline: From Ingestion to Response

Here's the complete RAG pipeline, step by step:

1. INGESTION        2. CHUNKING         3. EMBEDDING
┌──────────┐       ┌──────────┐       ┌──────────┐
 Documents │──────> Split into │──────> Vectorize
 (PDF, MD,        chunks             each      
  TXT...)         of ~500            chunk     
└──────────┘        tokens            └─────┬─────┘
                   └──────────┘             
                                            v
4. STORAGE          5. RETRIEVAL        6. GENERATION
┌──────────┐       ┌──────────┐       ┌──────────┐
 Vector DB <────  Search for │──────> LLM +    
 (Qdrant,         most               context  
  Chroma)  │─────> relevant           = Response
└──────────┘       └──────────┘       └──────────┘

Step 1: Ingestion

Retrieve your documents from all sources:

from pathlib import Path

def load_documents(directory: str) -> list[dict]:
    """Loads all documents from a directory."""
    docs = []
    for path in Path(directory).rglob("*"):
        if path.suffix in (".md", ".txt", ".py", ".json"):
            content = path.read_text(encoding="utf-8")
            docs.append({
                "content": content,
                "source": str(path),
                "type": path.suffix,
            })
    return docs

documents = load_documents("/home/deploy/knowledge")
print(f"{len(documents)} documents loaded")

Step 2: Chunking

LLMs have context limits. Split large documents into digestible chunks:

def chunk_text(
    text: str,
    chunk_size: int = 500,
    overlap: int = 50
) -> list[str]:
    """Splits text into chunks with overlap."""
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk.strip():
            chunks.append(chunk)

    return chunks

# Example
text = "A very long document..."
chunks = chunk_text(text, chunk_size=500, overlap=50)
print(f"{len(chunks)} chunks created")
Parameter Recommended value Impact
chunk_size 300-800 tokens Too small = context loss, too large = noise
overlap 50-100 tokens Ensures continuity between chunks
Separator Paragraphs > sentences > words Respects text structure

Step 3: Embedding

Transform each chunk into a vector:

async def embed_chunks(chunks: list[str]) -> list[list[float]]:
    """Generates embeddings for a list of chunks."""
    embeddings = []
    # Process in batches of 100 (API limit)
    for i in range(0, len(chunks), 100):
        batch = chunks[i:i+100]
        async with httpx.AsyncClient() as client:
            response = await client.post(
                "https://api.openai.com/v1/embeddings",
                headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
                json={
                    "model": "text-embedding-3-small",
                    "input": batch,
                },
                timeout=30.0,
            )
            response.raise_for_status()
            batch_embeddings = [
                item["embedding"]
                for item in response.json()["data"]
            ]
            embeddings.extend(batch_embeddings)
    return embeddings

Step 4: Storage

Insert into the vector database (here ChromaDB):

collection.add(
    documents=chunks,
    embeddings=embeddings,
    ids=[f"chunk_{i}" for i in range(len(chunks))],
    metadatas=[
        {"source": doc["source"], "chunk_index": i}
        for i, doc in enumerate(chunk_metadata)
    ],
)

Step 5: Retrieval

When the user asks a question, search for relevant chunks:

async def retrieve(question: str, n_results: int = 5) -> list[str]:
    """Retrieves the most relevant chunks for a question."""
    results = collection.query(
        query_texts=[question],
        n_results=n_results,
    )
    return results["documents"][0]