RAG Fundamentals
[!IMPORTANT] Retrieval-Augmented Generation (RAG) is the architecture that bridges the gap between an LLM’s frozen training data and your dynamic, proprietary data. It is the standard for building production AI applications.
1. The Problem with LLMs
Large Language Models (LLMs) like GPT-4 are incredibly powerful, but they have two fatal flaws when used in isolation:
- Hallucination: They confidently make up facts when they don’t know the answer.
- Knowledge Cutoff: Their training data is frozen in time. They don’t know about events that happened yesterday, or about your private company data.
Imagine an LLM as a brilliant scholar who has been locked in a library for 2 years. They know everything in those books, but nothing about the outside world since then. RAG is like giving that scholar an internet connection and a search engine.
2. What is RAG?
RAG is a technique that retrieves relevant information from an external knowledge base and provides it to the LLM as context before asking it to generate an answer.
The RAG Triad
- Retriever: Finds the most relevant documents for the user’s query from a database.
- Augmenter: Combines the user’s query with the retrieved documents into a single prompt.
- Generator: The LLM takes the augmented prompt and generates a grounded response.
3. Interactive: RAG Simulator
Experience how RAG works step-by-step. Enter a query to see how the system retrieves data and generates an answer.
4. Basic RAG Implementation
Here is a minimal example of a RAG system using Python. We use chromadb as our vector database and OpenAI for embeddings and generation.
import chromadb
from openai import OpenAI
# 1. Initialize Vector DB
client = chromadb.Client()
collection = client.create_collection("knowledge_base")
# 2. Add Documents (Ingestion)
documents = [
"The refund policy allows returns within 30 days.",
"Shipping is free for orders over $50.",
"Support is available 24/7 via email."
]
collection.add(
documents=documents,
ids=["doc1", "doc2", "doc3"]
)
# 3. Retrieval Function
def retrieve(query, n_results=1):
results = collection.query(
query_texts=[query],
n_results=n_results
)
return results['documents'][0]
# 4. Generation Function
llm_client = OpenAI(api_key="sk-...")
def generate_answer(query):
# Retrieve context
context_docs = retrieve(query)
context = "\n".join(context_docs)
# Augment Prompt
prompt = f"""
Answer the question based ONLY on the context below.
Context:
{context}
Question: {query}
"""
# Generate
response = llm_client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Usage
print(generate_answer("How long do I have to return an item?"))
# Output: "You have 30 days to return an item based on the refund policy."
5. Why Not Just Fine-Tuning?
A common misconception is that you should fine-tune an LLM to teach it new knowledge.
| Feature | RAG | Fine-Tuning |
|---|---|---|
| Goal | Connect LLM to dynamic data | Change LLM behavior/style |
| Knowledge Update | Instant (add doc to DB) | Slow (re-train model) |
| Accuracy | High (grounded in retrieved docs) | Lower (can still hallucinate) |
| Cost | Low (Vector DB + Inference) | High (Training compute) |
| Privacy | High (Data stays in DB) | Low (Data baked into model) |
[!TIP] Use RAG for knowledge (facts, data). Use Fine-Tuning for behavior (tone, format, specific coding style).
6. Next Steps
In the next chapter, we will dive deep into the engine that powers retrieval: Vector Databases.