Module Review: RAG

This chapter reviews the key concepts, architectures, and implementation details of Retrieval-Augmented Generation (RAG).

Key Takeaways

RAG = Retrieval + Generation: It solves LLM hallucinations and knowledge cutoffs by providing external context.
Embeddings: Vectors that represent semantic meaning. Similar concepts are close in vector space.
Vector Databases: Specialized stores (Pinecone, ChromaDB) optimized for high-dimensional similarity search using ANN (HNSW).
Chunking Matters: How you split text affects retrieval quality. Recursive chunking is generally better than fixed-size.
Hybrid Search: Combining Keyword Search (BM25) and Vector Search yields the best results (Recall).
Re-ranking: A second pass using a Cross-Encoder drastically improves precision.
Production RAG: Is not a linear pipeline but a complex system with query expansion, routing, and self-correction.

Test your knowledge by flipping the cards.

What are the two main problems RAG solves?

(Click to flip)

1. Hallucinations (making up facts)

2. Knowledge Cutoffs (outdated data)

What is an Embedding?

A vector (list of numbers) representing the semantic meaning of text.

Which distance metric is most common for text similarity?

Cosine Similarity (measures the angle between vectors).

What is the trade-off of Re-ranking?

It improves accuracy (precision) but increases latency (slower) and cost.

What does HNSW stand for?

Hierarchical Navigable Small World (an algorithm for fast approximate nearest neighbor search).

Parameter	Recommended Start	Description
Chunk Size	512 - 1024 tokens	Size of each text block.
Chunk Overlap	10% - 20%	Characters shared between chunks to preserve context.
Top K	3 - 5	Number of documents to retrieve.
Temperature	0.0 - 0.3	Lower temperature reduces hallucinations in RAG.

Component	Popular Tools
Orchestration	LangChain, LlamaIndex
Vector DB	Pinecone, ChromaDB, Weaviate, pgvector
Embeddings	OpenAI `text-embedding-3`, HuggingFace `all-MiniLM-L6-v2`
Evaluation	RAGAS, TruLens

RAG Triad: Retriever, Augmenter, Generator.
Vector Search: Uses cosine similarity to find semantically related documents in an N-dimensional space.
Chunking: Breaking documents into optimal sizes. Recursive is preferred over fixed size.
Hybrid Search: Combining keyword search (BM25) with vector search and fusing results (RRF).
Re-ranking: An essential step to increase precision by scoring top-k results with a cross-encoder model.

Now that you understand how to augment LLMs with external data, let’s learn how to permanently teach them new skills.

Module 04: Fine-Tuning (Coming Soon)