You have a brilliant AI assistant, but it forgets everything after each conversation. Frustrating, right? This is the fundamental problem with Large Language Models (LLMs): they lack persistent memory. Retrieval-Augmented Generation (RAG) is THE solution to give memory to your AI.
In this guide, we'll demystify RAG in simple terms, understand how it works under the hood, and most importantly, know when it's useful and when it's overkill.
🧠 What is RAG? (In Simple Terms)
The Problem
Imagine a brilliant expert who has read millions of books... but can't consult YOUR documents. You ask, "What is my company's Q3 revenue?" and they reply, "I don't have that information."
This is exactly what happens with classic LLMs (ChatGPT, Claude, Gemini):
- They know general knowledge (what they learned during training)
- They don't know your data (documents, notes, database)
- Their knowledge is frozen in time (cutoff date)
The RAG Solution
RAG = Retrieval-Augmented Generation.
In simple French: before answering, the AI searches for relevant information in YOUR documents, then uses them to formulate its response.
Without RAG:
Question --> LLM --> Response (based on general knowledge)
With RAG:
Question --> Search in your docs --> Relevant documents
|
v
Question + Relevant documents --> LLM --> Precise and sourced response
Simple Analogy
Think of a student taking an exam:
- Without RAG: they answer from memory (sometimes they get it wrong or make it up)
- With RAG: they're allowed to consult their revision notes before answering
RAG doesn't make the AI more intelligent -- it gives it access to your information.
🔢 Embeddings: Transforming Text into Vectors
The Key Concept
For the AI to "search" in your documents, you first need to transform the text into something a computer can compare efficiently: vectors (lists of numbers).
# Conceptually, an embedding transforms text into numbers
"The cat sleeps on the couch" --> [0.23, -0.45, 0.78, 0.12, ...]
"The feline rests in the living room" --> [0.21, -0.43, 0.76, 0.14, ...]
"Python is a language" --> [-0.67, 0.34, -0.12, 0.89, ...]
The first two sentences have close vectors (they're about the same topic). The third is distant (different topic).
How it Works
An embedding model (like text-embedding-3-small from OpenAI) is trained on billions of texts to understand semantics. It doesn't compare words one by one -- it understands the meaning.
import httpx
import os
async def get_embedding(text: str) -> list[float]:
"""Transforms text into a vector via OpenAI."""
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/embeddings",
headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
json={
"model": "text-embedding-3-small",
"input": text,
},
)
response.raise_for_status()
return response.json()["data"][0]["embedding"]
Similarity Comparison
Once you have vectors, comparing two texts becomes calculating the distance between their vectors:
import numpy as np
def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
"""Calculates cosine similarity between two vectors."""
a = np.array(vec_a)
b = np.array(vec_b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Example
sim = cosine_similarity(
embedding_cat, # "The cat sleeps on the couch"
embedding_feline # "The feline rests in the living room"
)
# sim ≈ 0.95 (very similar!)
sim = cosine_similarity(
embedding_cat, # "The cat sleeps on the couch"
embedding_python # "Python is a language"
)
# sim ≈ 0.12 (not similar at all)
| Similarity Score | Meaning |
|---|---|
| 0.90 - 1.00 | Almost identical |
| 0.75 - 0.90 | Very similar |
| 0.50 - 0.75 | Related to the topic |
| 0.25 - 0.50 | Somewhat related |
| 0.00 - 0.25 | Not related |
🗄️ Vector Databases: Where to Store Embeddings
Once you've transformed your texts into vectors, you need to store them somewhere and be able to search them efficiently. This is the role of vector databases.
Comparison of Solutions
| Solution | Type | Complexity | Performance | Self-hosted | Ideal for |
|---|---|---|---|---|---|
| FAISS | Python library | Easy | Excellent | Yes | Prototyping, small datasets |
| ChromaDB | Embedded database | Easy | Good | Yes | Side projects, <1M docs |
| Qdrant | Dedicated server | Medium | Excellent | Yes | Production, large volumes |
| pgvector | Postgres extension | Medium | Good | Yes | If you already have Postgres |
| Pinecone | Managed cloud | Easy | Excellent | No | If you don't want to manage infrastructure |
| Weaviate | Dedicated server | Complex | Excellent | Yes | Advanced cases, multimodal |
FAISS: The Simplest to Start
import faiss
import numpy as np
# Create a FAISS index
dimension = 1536 # Size of OpenAI embeddings
index = faiss.IndexFlatL2(dimension)
# Add vectors
vectors = np.array(embeddings_list).astype('float32')
index.add(vectors)
# Search for the 5 nearest neighbors
query_vector = np.array([query_embedding]).astype('float32')
distances, indices = index.search(query_vector, k=5)
print(f"Most relevant documents: {indices[0]}")
print(f"Distances: {distances[0]}")
ChromaDB: The User-Friendly Embedded Database
import chromadb
# Create/open a collection
client = chromadb.PersistentClient(path="/home/deploy/data/chroma")
collection = client.get_or_create_collection(
name="my_documents",
metadata={"hnsw:space": "cosine"}
)
# Add documents (ChromaDB generates embeddings automatically)
collection.add(
documents=[
"Complete guide to self-hosting",
"How to configure a reverse proxy",
"Basics of server security",
],
ids=["doc1", "doc2", "doc3"],
metadatas=[
{"source": "blog", "date": "2025-01"},
{"source": "blog", "date": "2025-02"},
{"source": "wiki", "date": "2025-01"},
],
)
# Search
results = collection.query(
query_texts=["how to secure my server"],
n_results=3,
)
print(results["documents"])
# --> [["Basics of server security", ...]]
Qdrant: For Production
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
# Connection
client = QdrantClient(host="localhost", port=6333)
# Create a collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE,
),
)
# Add documents
client.upsert(
collection_name="documents",
points=[
PointStruct(
id=1,
vector=embedding_1,
payload={"text": "Self-hosting guide", "source": "blog"},
),
PointStruct(
id=2,
vector=embedding_2,
payload={"text": "Reverse proxy config", "source": "blog"},
),
],
)
# Search
results = client.search(
collection_name="documents",
query_vector=query_embedding,
limit=5,
)
for result in results:
print(f"Score: {result.score:.3f} | {result.payload['text']}")
🔄 Complete Pipeline: From Ingestion to Response
Here's the complete RAG pipeline, step by step:
1. INGESTION 2. CHUNKING 3. EMBEDDING
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Documents │──────>│ Split into │──────>│ Vectorize│
│ (PDF, MD, │ │ chunks │ │ each │
│ TXT...) │ │ of ~500 │ │ chunk │
└──────────┘ │ tokens │ └─────┬─────┘
└──────────┘ │
v
4. STORAGE 5. RETRIEVAL 6. GENERATION
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Vector DB │<──── │ Search for │──────>│ LLM + │
│ (Qdrant, │ │ most │ │ context │
│ Chroma) │─────>│ relevant │ │ = Response│
└──────────┘ └──────────┘ └──────────┘
Step 1: Ingestion
Retrieve your documents from all sources:
from pathlib import Path
def load_documents(directory: str) -> list[dict]:
"""Loads all documents from a directory."""
docs = []
for path in Path(directory).rglob("*"):
if path.suffix in (".md", ".txt", ".py", ".json"):
content = path.read_text(encoding="utf-8")
docs.append({
"content": content,
"source": str(path),
"type": path.suffix,
})
return docs
documents = load_documents("/home/deploy/knowledge")
print(f"{len(documents)} documents loaded")
Step 2: Chunking
LLMs have context limits. Split large documents into digestible chunks:
def chunk_text(
text: str,
chunk_size: int = 500,
overlap: int = 50
) -> list[str]:
"""Splits text into chunks with overlap."""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
if chunk.strip():
chunks.append(chunk)
return chunks
# Example
text = "A very long document..."
chunks = chunk_text(text, chunk_size=500, overlap=50)
print(f"{len(chunks)} chunks created")
| Parameter | Recommended value | Impact |
|---|---|---|
| chunk_size | 300-800 tokens | Too small = context loss, too large = noise |
| overlap | 50-100 tokens | Ensures continuity between chunks |
| Separator | Paragraphs > sentences > words | Respects text structure |
Step 3: Embedding
Transform each chunk into a vector:
async def embed_chunks(chunks: list[str]) -> list[list[float]]:
"""Generates embeddings for a list of chunks."""
embeddings = []
# Process in batches of 100 (API limit)
for i in range(0, len(chunks), 100):
batch = chunks[i:i+100]
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/embeddings",
headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
json={
"model": "text-embedding-3-small",
"input": batch,
},
timeout=30.0,
)
response.raise_for_status()
batch_embeddings = [
item["embedding"]
for item in response.json()["data"]
]
embeddings.extend(batch_embeddings)
return embeddings
Step 4: Storage
Insert into the vector database (here ChromaDB):
collection.add(
documents=chunks,
embeddings=embeddings,
ids=[f"chunk_{i}" for i in range(len(chunks))],
metadatas=[
{"source": doc["source"], "chunk_index": i}
for i, doc in enumerate(chunk_metadata)
],
)
Step 5: Retrieval
When the user asks a question, search for relevant chunks:
async def retrieve(question: str, n_results: int = 5) -> list[str]:
"""Retrieves the most relevant chunks for a question."""
results = collection.query(
query_texts=[question],
n_results=n_results,
)
return results["documents"][0]