RAG Fundamentals

[!IMPORTANT] Retrieval-Augmented Generation (RAG) is the architecture that bridges the gap between an LLM’s frozen training data and your dynamic, proprietary data. It is the standard for building production AI applications.

1. The Problem with LLMs

Large Language Models (LLMs) like GPT-4 are incredibly powerful, but they have two fatal flaws when used in isolation:

  1. Hallucination: They confidently make up facts when they don’t know the answer.
  2. Knowledge Cutoff: Their training data is frozen in time. They don’t know about events that happened yesterday, or about your private company data.

Imagine an LLM as a brilliant scholar who has been locked in a library for 2 years. They know everything in those books, but nothing about the outside world since then. RAG is like giving that scholar an internet connection and a search engine.

2. What is RAG?

RAG is a technique that retrieves relevant information from an external knowledge base and provides it to the LLM as context before asking it to generate an answer.

The RAG Triad

  1. Retriever: Finds the most relevant documents for the user’s query from a database.
  2. Augmenter: Combines the user’s query with the retrieved documents into a single prompt.
  3. Generator: The LLM takes the augmented prompt and generates a grounded response.
👤
User Query
🔍
Retriever
Search Vector DB
📄
Context
Top K Documents
🤖
Generator
LLM + Prompt
💬
Response

3. Interactive: RAG Simulator

Experience how RAG works step-by-step. Enter a query to see how the system retrieves data and generates an answer.

1. User Query
Waiting for input...
2. Retrieval (Vector DB)
Searching database...
3. Augmented Prompt
Waiting for retrieval...
4. LLM Response
Waiting for prompt...

4. Basic RAG Implementation

Here is a minimal example of a RAG system using Python. We use chromadb as our vector database and OpenAI for embeddings and generation.

Python
Java
Go
```python import chromadb from openai import OpenAI # 1. Initialize Vector DB client = chromadb.Client() collection = client.create_collection("knowledge_base") # 2. Add Documents (Ingestion) documents = [ "The refund policy allows returns within 30 days.", "Shipping is free for orders over $50.", "Support is available 24/7 via email." ] collection.add( documents=documents, ids=["doc1", "doc2", "doc3"] ) # 3. Retrieval Function def retrieve(query, n_results=1): results = collection.query( query_texts=[query], n_results=n_results ) return results['documents'][0] # 4. Generation Function llm_client = OpenAI(api_key="sk-...") def generate_answer(query): # Retrieve context context_docs = retrieve(query) context = "\n".join(context_docs) # Augment Prompt prompt = f""" Answer the question based ONLY on the context below. Context: {context} Question: {query} """ # Generate response = llm_client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content # Usage print(generate_answer("How long do I have to return an item?")) # Output: "You have 30 days to return an item based on the refund policy." ```
```java // Java Implementation Using LangChain4j import dev.langchain4j.model.chat.ChatLanguageModel; import dev.langchain4j.model.openai.OpenAiChatModel; import dev.langchain4j.rag.content.retriever.ContentRetriever; import dev.langchain4j.rag.content.retriever.EmbeddingStoreContentRetriever; import dev.langchain4j.service.AiServices; import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore; public class BasicRAG { // Define our RAG Service interface interface Assistant { String answer(String query); } public static void main(String[] args) { ChatLanguageModel model = OpenAiChatModel.withApiKey("sk-..."); // Abstracting Store logic var store = new InMemoryEmbeddingStore<>(); // ... insert docs into store ... ContentRetriever retriever = EmbeddingStoreContentRetriever.from(store); Assistant assistant = AiServices.builder(Assistant.class) .chatLanguageModel(model) .contentRetriever(retriever) .build(); String response = assistant.answer("How long do I have to return an item?"); System.out.println(response); } } ```
```go // Go Implementation Using LangChainGo package main import ( "context" "fmt" "github.com/tmc/langchaingo/llms/openai" "github.com/tmc/langchaingo/chains" "github.com/tmc/langchaingo/vectorstores/pinecone" ) func main() { ctx := context.Background() llm, _ := openai.New() // Setup store and retriever store, _ := pinecone.New(pinecone.WithProjectName("demo")) retriever := store.AsRetriever() // Create conversational retrieval QA chain chain := chains.NewConversationalRetrievalQA(chains.LoadStuffQA(llm), retriever) result, _ := chain.Call(ctx, map[string]any{ "question": "How long do I have to return an item?", }) fmt.Println(result["answer"]) } ```

5. Why Not Just Fine-Tuning?

A common misconception is that you should fine-tune an LLM to teach it new knowledge.

Feature RAG Fine-Tuning
Goal Connect LLM to dynamic data Change LLM behavior/style
Knowledge Update Instant (add doc to DB) Slow (re-train model)
Accuracy High (grounded in retrieved docs) Lower (can still hallucinate)
Cost Low (Vector DB + Inference) High (Training compute)
Privacy High (Data stays in DB) Low (Data baked into model)

[!TIP] Use RAG for knowledge (facts, data). Use Fine-Tuning for behavior (tone, format, specific coding style).

6. Next Steps

In the next chapter, we will dive deep into the engine that powers retrieval: Vector Databases.