🎯 Why a generic avatar isn't enough
A language model like Claude d'Anthropic is brilliant at general knowledge. But ask it a question about your internal billing process, your industry jargon, or your customers' preferences: it will make up a plausible but false answer.
The fundamental problem: LLMs don't know YOUR data. They were trained on the internet, not on your company.
A truly useful AI avatar must:
- Know your context: history, customers, products, processes
- Adopt your tone: formal, casual, technical — your own style
- Respond accurately: cite your documents, not hallucinate
- Evolve: integrate your new data over time
The good news? Three approaches can get you there, depending on your budget and technical skills. To understand how to make this avatar persistent, feel free to check out our guide on comment donner une mémoire long-terme à son avatar IA.
🔀 The 3 approaches: prompting, RAG, and fine-tuning
Before diving into the details, here is an overview of the three strategies for customizing an AI avatar.
Advanced prompting (easy level)
You inject your data directly into the prompt (system message). The model uses this context to respond. No additional infrastructure required.
RAG — Retrieval-Augmented Generation (intermediate level)
Your documents are chunked, vectorized, and stored in a vector database. With each question, relevant passages are retrieved and injected into the prompt. The model responds based on these excerpts.
Fine-tuning (advanced level)
You retrain (partially) the model on your data. The knowledge is integrated into the network's weights. More expensive, but the model "knows" natively.
📊 Comparison table of the 3 approaches
| Criterion | Advanced Prompting | RAG | Fine-tuning |
|---|---|---|---|
| Difficulty | ⭐ Easy | ⭐⭐ Medium | ⭐⭐⭐ Advanced |
| Initial cost | ~0 € | 50-200 € | 500-5 000 € |
| Recurring cost | Tokens (long context) | Vector DB hosting | Periodic retraining |
| Data volume | < 50 pages | 50 to 100,000+ docs | 1,000+ structured examples |
| Response quality | Good if sufficient context | Very good | Excellent on the domain |
| Data freshness | Immediate (copy-paste) | Near real-time | Requires retraining |
| Hallucinations | Medium risk | Low (sources cited) | Low but possible |
| Maintenance | Manual | Automatable | Heavy |
| Latency | Low | Medium (+retrieval) | Low |
| Ideal for | Prototyping, small volumes | Production, evolving docs | Specific tone/style, specialized domain |
💡 Advanced prompting: techniques and examples
Advanced prompting is the most accessible entry point. Three techniques stand out.
Few-shot prompting
Provide examples of ideal conversations in the system prompt. The goal is to show the AI the exact tone, the expected level of detail, and the structure of your typical responses (greeting, pitch, call to action).
Chain-of-thought (CoT)
Ask the model to reason step by step before answering. This technique consists of providing a sequence of reasoning in the prompt: identify the real need, search for relevant information, formulate the response, and then propose a next step.
Complete system prompt template
# IDENTITY
You are the AI avatar of [NAME], [TITLE] at [COMPANY].
# STYLE
- Tone: professional but approachable
- Length: concise responses (3-5 sentences), elaborate if asked
- Signature: always end with a question or a CTA
# KNOWLEDGE (injected)
[Paste your FAQs, pricing, processes here — up to ~30 pages]
# RULES
- Never invent figures. If you don't know, say so.
- Always cite the source when using a document.
- Redirect to a human if: legal, medical, serious complaint.
Limitations: the context window is limited (200K tokens for Claude, i.e., ~150,000 words). Beyond that, you need to switch to RAG.
🔍 RAG in detail: the complete pipeline
RAG is the most popular approach in production in 2025. Here is the complete pipeline.
Pipeline architecture
Documents → Chunking → Embeddings → Vector Store
↓
User question → Embedding → Similarity search → Top-K chunks
↓
Prompt + chunks → LLM → Response
Step 1: Document chunking
Split your documents into chunks of 500-1000 tokens with overlap. Tools like LangChain or LlamaIndex automate this splitting with a RecursiveCharacterTextSplitter that intelligently separates text by paragraphs, then sentences, while maintaining an overlap to avoid losing context between two chunks.
Step 2: Generating embeddings
Transform each chunk into a numerical vector using an embedding model. You can use OpenRouter to access different embedding models through a single API, for example OpenAI's text-embedding-3-small model.
Step 3: Storage in a vector database
Store the resulting vectors in a vector database like ChromaDB, Qdrant, or Pinecone. ChromaDB is an excellent option to get started: it installs locally, uses cosine similarity for searches, and allows you to associate metadata (source, document type) with each vector.
Step 4: Retrieval and generation
For each user question, the question's vector is compared to those in the database to retrieve the most relevant chunks (the Top-K). These excerpts are injected into a system prompt instructing the model to answer only based on this context. The final response is generated by the LLM, which drastically reduces hallucinations.
Key RAG optimizations
| Technique | Impact | Complexity |
|---|---|---|
| Hybrid search (BM25 + vectors) | +15-20% relevance | Medium |
| Reranking (Cohere, cross-encoder) | +10-15% relevance | Low |
| Semantic chunking | Better coherence | Medium |
| Metadata filtering | Targeted responses | Low |
| Query expansion | Better recall | Low |
| Parent-child chunks | Richer context | Medium |
To dive deeper into this architecture and understand how to make your avatar's memory persistent, check out our article on how to give long-term memory to your AI avatar.
🧬 Fine-tuning: when and how
Fine-tuning modifies the model's weights. It is the heaviest but most powerful approach for style and tone.
When fine-tuning is justified
- Your avatar must adopt a very specific style (technical jargon, particular tone)
- You have 1,000+ examples of ideal conversations
- RAG is not enough to capture complex reasoning patterns
- You want to reduce latency (no need for retrieval)
Preparing a JSONL dataset
The fine-tuning dataset comes in the form of a JSONL file where each line contains a complete conversation. Each exchange must follow a strict user / assistant alternation, with an initial system message defining the avatar's role.
Dataset preparation script
To prepare this file, a Python script goes through a folder of conversations in JSON format, validates that each message has a correct role (system, user or assistant), checks for the presence of at least one exchange, and then exports everything in the JSONL format specific to fine-tuning APIs.
Estimated fine-tuning costs
| Model | Training cost | Inference cost | Technique |
|---|---|---|---|
| GPT-4o mini fine-tuned | ~$3 / 1M tokens | $0.30 / 1M tokens | Full fine-tune |
| Llama 3.1 8B (LoRA) | ~$20 on RunPod | Self-hosted | LoRA / QLoRA |
| Mistral 7B (LoRA) | ~$15 on RunPod | Self-hosted | LoRA / QLoRA |
| Claude (via API) | Not available | Standard API | Prompting/RAG only |
Note: Anthropic's Claude does not offer public fine-tuning. Prefer RAG with Claude for excellent results without fine-tuning.
LoRA: lightweight fine-tuning
LoRA (Low-Rank Adaptation) allows you to fine-tune a model by modifying only a fraction of the weights. With Hugging Face's PEFT library, we target only certain layers (like q_proj and v_proj) with a reduced decomposition rank (e.g.: r=16). This makes it possible to train only 0.05% of the parameters of an 8-billion model, making fine-tuning possible on a single consumer GPU. To discover how to configure your AI's character, our guide on personality and convictions: configuring your AI's character perfectly complements this approach.
📁 Usable Data Types
Your avatar can learn from a wide variety of sources. Here is what you can leverage:
| Source | Format | Preprocessing | Value |
|---|---|---|---|
| Emails | .eml, .mbox | Extract body, remove auto signatures | ⭐⭐⭐ Personal style |
| Documents | .pdf, .docx, .md | OCR if scanned, text extraction | ⭐⭐⭐ Business knowledge |
| Slack/Teams | JSON export | Filter noise, keep useful threads | ⭐⭐ Informal tone |
| Notes | Notion, Obsidian | Markdown export | ⭐⭐⭐ Raw thoughts |
| Code | .py, .js, .ts | Keep comments | ⭐⭐ Technical style |
| Transcriptions | .srt, .txt (Whisper) | Clean up disfluencies | ⭐⭐⭐ Authentic voice |
| FAQ/Support | CSV, JSON | Structure into Q&A | ⭐⭐⭐ Direct answers |
| Presentations | .pptx | Extract text + notes | ⭐⭐ Key messages |
🧹 Preparing your data: the cleaning pipeline
Data quality is the determining factor. Garbage in, garbage out.
Cleaning pipeline
A good automated cleaning pipeline performs several sequential operations: normalizing spaces and removing visual separators, anonymizing personal data (replacing emails, phone numbers, and postal codes with generic tags using regular expressions), deduplication via MD5 hashing of the text to eliminate duplicates, and filtering out documents that are too short (fewer than 20 words).
Preparation checklist
- ✅ Cleaning: remove repetitive headers/footers, automatic signatures
- ✅ Deduplication: eliminate copies (forwarded emails, versioned docs)
- ✅ Anonymization: mask emails, phones, addresses, customer names
- ✅ Structuring: convert to a uniform format (markdown recommended)
- ✅ Validation: review a 5% sample to check quality
- ✅ Metadata: date, source, category — for downstream filtering
📏 Evaluation: Has your avatar learned well?
Training is good, measuring is better. Here is how to evaluate your avatar.
Key metrics
| Metric | How to measure | Target |
|---|---|---|
| Factual accuracy | % of verifiable responses in the sources | > 90% |
| Hallucination rate | Made-up responses per 100 test questions | < 5% |
| Relevance | Human score 1-5 on 50 questions | > 4.0 |
| Tone consistency | Blind evaluation vs original responses | > 80% similarity |
| Response time | P95 latency | < 3s |
Automated A/B testing
To automate evaluation, you create a test set containing typical questions and their reference answers. A script sends each question to the avatar, then a "judge" LLM (like Claude Sonnet) compares the generated response to the reference. The judge assigns a score of 1 to 5 and categorizes the response (correct, partial, wrong, hallucination), which provides a reliable statistical report on the avatar's quality.
🔄 Continuous updating: an evolving avatar
A static avatar becomes obsolete. Set up an update pipeline.
Refresh strategy
| Approach | Frequency | Automatable | Effort |
|---|---|---|---|
| Prompting | On every change | ✅ Yes | Low |
| RAG | Daily / weekly | ✅ Yes (cron) | Low |
| Fine-tuning | Monthly / quarterly | ⚠️ Semi-auto | High |
Continuous ingestion pipeline for RAG
For RAG, continuous updating is set up via a scheduled script (using a tool like schedule in Python) that runs every night. The script retrieves new documents added in the last 24 hours, passes them through the cleaning pipeline, chunks them, generates their embeddings, and inserts them directly into the vector database. This process is completely transparent to the end user.
⚠️ Common mistakes
1. Overfitting on training data
The fine-tuned model recites your documents word for word instead of synthesizing them. Solution: reduce the number of epochs, increase the diversity of examples.
2. Hallucinations on outdated data
Your avatar quotes a price from 2023 when you updated it in 2025. Solution: version your data, remove outdated chunks from the vector store, add date metadata.
3. Selection bias
If you only feed it your successes (positive case studies), the avatar won't know how to handle objections. Solution: include difficult conversations, refusals, and edge cases.
4. Sensitive data leakage
The avatar reveals confidential information to unauthorized users. Solution: upstream anonymization, output filtering, access levels.
5. Dependence on a single model
Your fine-tuning works on GPT-4 but OpenAI changes its terms. Solution: favor RAG (portable) or fine-tune open source models via OpenRouter.
6. Neglecting personality
You focus solely on factual knowledge but the avatar sounds robotic. Solution: work on the personality system prompt in parallel, or even specifically configure your AI's character so that it adopts your convictions and tone.
📋 Which approach based on your situation?
| Situation | Data volume | Budget | Recommended approach |
|---|---|---|---|
| Freelance, startup | < 50 docs | €0 | Advanced Prompting |
| SMB, document base | 50-500 docs | €50-200/month | RAG with ChromaDB |
| SMB, critical production | 500-5,000 docs | €200-500/month | Optimized RAG + reranking |
| Enterprise, specialized domain | 1,000+ conversations | €1,000+ | Fine-tuning + RAG |
| AI startup, avatar product | Unlimited | Variable | RAG + LoRA fine-tuning |
To host your RAG stack (vector DB, API, backend), a dedicated VPS is recommended. Hostinger offers powerful solutions with a 20% discount — sufficient for ChromaDB + a Python API.
🏗️ Complete example: consultant avatar with 500 docs
Let's put it all together with a concrete use case. Marie is a digital transformation consultant. She has 500 documents: sales proposals, client emails, blog articles, webinar transcripts.
Step 1: Inventory and collection
The first step is to list and catalog all available files by category (PDF sales proposals, .eml emails, Markdown blog articles, plain text transcripts) to get a complete inventory before extraction.
Step 2: Extraction and cleaning
Each file type requires specific processing: PDFs are read using a library like PyMuPDF to extract the text from each page, emails are parsed with Python's standard email module to isolate the message body, and text/markdown files are read directly. Everything then goes through the cleaning and anonymization pipeline seen earlier (typically going from 500 to ~420 useful documents).
Step 3: Complete RAG pipeline
The cleaned documents are split into chunks via the splitter, vectorized with the embeddings model, and then inserted in batches into the ChromaDB collection with their metadata (source and document type). This generally yields several thousand chunks for 500 documents.
Step 4: Marie's custom system prompt
# IDENTITY
You are the AI avatar of Marie Dupont, a digital transformation consultant
for 12 years. Founder of the DigitalShift consulting firm.
# STYLE
- Direct and pragmatic tone, no unnecessary jargon
- Always provide concrete figures when possible
- End with an actionable next step
- Use "tu" for recurring contacts, "vous" for new ones
# EXPERTISE
SME/Mid-cap digital transformation, change management, IT audit,
team training, generative AI applied to business.
# RULES
- Cite the source of the document used in [brackets]
- If the question is outside your expertise, redirect to a partner
- Never communicate personalized rates without validation
- Maximum 200 words unless explicit detail is requested
Step 5: Testing and iteration
The avatar is subjected to a battery of test questions covering the main use cases (rates, methodology, training) along with reference answers. The automated evaluation script provides an initial score (typically 85% correct answers on the first try), which rises to 95% after adjusting the prompt and the retrieval settings.
Result: Marie's avatar correctly answers 95% of common questions, cites its sources, and maintains its direct and pragmatic tone. All of this for approximately €100/month in infrastructure (VPS + embeddings API + LLM tokens).
The Essentials
- Three approaches to train an avatar: advanced prompting (beginner), RAG (production), fine-tuning (expert).
- RAG is the sweet spot for 90% of use cases: scalable, low-cost, and drastically reduces hallucinations.
- Data quality matters more than quantity: 50 well-prepared documents beat 5,000 poorly cleaned ones.
- Start simple with prompting, switch to RAG when you exceed 50 pages, reserve fine-tuning for style and tone.
- Measure systematically with a test set before considering your avatar production-ready.
Recommended Tools
- Claude d'Anthropic: the best model for RAG in 2025, huge context window (200K tokens), excellent at factual accuracy.
- OpenRouter: API aggregator to access multiple embedding and LLM models via a single key.
- ChromaDB: local vector database, ideal for prototyping and deploying a RAG system without complex infrastructure.
- LangChain / LlamaIndex: Python frameworks to orchestrate the RAG pipeline (chunking, embeddings, retrieval).
- PyMuPDF: reliable text extraction from PDFs, including scanned documents via OCR.
- Hostinger: affordable VPS hosting to deploy your RAG stack in production.
- OpenClaw: orchestrator to connect your AI avatar to your daily tools.
🚀 Conclusion: take action
Training an AI avatar with your data is no longer reserved for data scientists. With advanced prompting, you can get started in 30 minutes. With RAG, you can go into production in a few days. Fine-tuning remains the nuclear option for the most demanding use cases.
The key to success? Start simple, measure, iterate. An avatar fed with 50 well-prepared documents will always beat a model fine-tuned on 5,000 poorly cleaned documents.
Explore OpenClaw to orchestrate your AI avatar with tools like Claude and OpenRouter. The source code is available on GitHub. If you want to go further in creating your digital twin, our guide to creating an expert AI avatar in your field will guide you step by step. For a concrete business use case, discover how to use an AI avatar for customer service: replacing without losing the human touch. Finally, if you are looking to maximize your personal productivity, the article on the AI avatar + personal assistant combo: the ultimate productivity combo is made for you.
```