Advanced RAG Architectures

[!WARNING] Naive RAG (simple retrieve-then-generate) fails in production. It struggles with complex queries, long documents, and conflicting information.

1. The Chunking Problem

The first step in RAG is splitting your documents into smaller pieces (“chunks”). How you do this drastically affects performance.

Fixed-Size Chunking: Split every N characters. Fast but breaks sentences.
Recursive Chunking: Split by paragraphs, then sentences. Preserves structure.
Semantic Chunking: Split when the topic changes (using embeddings).

2. Interactive: Chunking Visualizer

Paste text below to see how different chunking strategies affect the context windows.

3. Improving Retrieval

Once data is chunked, we need to find the best chunks.

1. Hybrid Search (Keyword + Vector)

Vector search misses exact keyword matches (e.g., product part numbers “X-99” vs “X-98”).

Solution: Run BM25 (Keyword) and Vector Search in parallel.
Fusion: Combine results using Reciprocal Rank Fusion (RRF).

2. Query Expansion

Users ask vague questions (“Refund?”).

Solution: Use an LLM to rewrite the query into multiple variations.
“What is the refund policy?”
“How do I get my money back?”
“Return process steps.”
Search for all variations and deduplicate results.

3. Re-ranking (The “Secret Sauce”)

Vector DB retrieval is fast but approximate.

Step 1: Retrieve Top 50 documents using Vector Search (Fast).
Step 2: Use a Cross-Encoder model (Slow but accurate) to score each document against the query.
Step 3: Take the Top 5 from the re-ranked list to the LLM.

# Pseudo-code for Re-ranking
from sentence_transformers import CrossEncoder

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# 1. Fast Retrieval
hits = vector_db.search(query, k=50)

# 2. Re-rank
pairs = [[query, doc.text] for doc in hits]
scores = cross_encoder.predict(pairs)

# 3. Sort and Slice
ranked_hits = sorted(zip(hits, scores), key=lambda x: x[1], reverse=True)
top_5 = ranked_hits[:5]

4. Architecture Diagram: Advanced RAG

User Query

→

Query Expander (LLM)

→

Vector Search

Keyword Search

→

Fusion (RRF)

→

Re-ranker (Cross-Encoder)

→

Generator (LLM)

5. Modular RAG

In production, RAG is not a linear pipeline; it’s a DAG (Directed Acyclic Graph).

Routing: “Is this query about math?” → Route to Wolfram Alpha tool. “Is it about history?” → Route to Vector DB.
Self-RAG: The LLM generates an answer, then critiques itself. If the confidence is low, it searches again or says “I don’t know.”

6. Next Steps

Review everything you’ve learned in the Module Review.