Knowledge & Retrieval

RAG — Retrieval-Augmented Generation

Giving LLMs access to external knowledge so they can answer accurately

The Knowledge Problem

Large language models are impressively capable, but they have a fundamental limitation: their knowledge is frozen at training time. GPT-4 was trained on data up to early 2023. It cannot tell you what happened yesterday, look up a company's current stock price, or access your private documents. Every fact it knows was baked into its parameters during training, and those parameters do not change until the next training run — which costs millions of dollars.

This leads to three problems. First, staleness: the model's knowledge becomes outdated the moment it finishes training. Second, hallucination: when the model does not know something, it often confidently makes up an answer rather than admitting ignorance. Third, lack of specificity: the model has general knowledge but cannot access domain-specific or private information — your company's internal docs, a patient's medical records, or a legal case file. RAG (Retrieval-Augmented Generation) was invented to solve all three.

What Is RAG?

RAG is a technique that gives language models access to external knowledge at inference time. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents from a database and feeds them into the model's context window alongside the user's question. The model then generates an answer grounded in those retrieved documents.

The process works in three steps. First, when a user asks a question, the system converts that question into a vector (embedding) and searches a vector database for the most similar document chunks. Second, the top-K retrieved chunks are combined with the original question into an augmented prompt. Third, the LLM reads this augmented prompt and generates an answer that draws from the provided context. The result is a response that is more accurate, more current, and citeable — because you can trace it back to the source documents.

Think of RAG as giving the LLM an open-book exam. Instead of answering from memory alone (which might be wrong or outdated), the model gets to look up the answer in a reference library before responding. This dramatically reduces hallucination and ensures the answer is based on actual documents.

The RAG Pipeline

The diagram below shows the complete RAG pipeline. A user query flows through retrieval (finding relevant documents), context assembly (combining retrieved chunks with the query), and generation (the LLM produces a grounded answer). The vector database sits at the heart of the retrieval step.

The RAG pipeline: Query → Retrieve → Context → LLM → Grounded Answer

Document Indexing: Chunking, Embedding, Vector DB

Before RAG can retrieve anything, you need to build an index of your documents. This happens in three stages. First, chunking: large documents are split into smaller pieces (typically 200-1000 tokens each). If chunks are too large, the retrieval loses precision — you fetch irrelevant text along with the good stuff. If chunks are too small, you lose context and may miss important connections.

Second, each chunk is passed through an embedding model (like OpenAI's text-embedding-3-small or BGE) that converts it into a high-dimensional vector — a list of numbers that captures the semantic meaning of the text. Similar text produces similar vectors. Third, these vectors are stored in a vector database (Pinecone, Weaviate, Chroma, FAISS) that supports fast similarity search.

Chunking Strategies Matter

Fixed-size chunking (e.g., 512 tokens with 50-token overlap) is the simplest approach. Semantic chunking splits at natural boundaries (paragraphs, sections) for better coherence. Recursive chunking tries multiple sizes and picks the best. The right strategy depends on your documents — legal text benefits from paragraph-aware chunking, while code might be chunked by function.

Documents are split into chunks, each chunk is embedded into a vector, and all vectors are stored in a vector database for fast similarity search

Vector Similarity Search

When a user query arrives, it is converted into the same vector space as the document chunks using the same embedding model. The system then finds the chunks whose vectors are closest to the query vector. The most common similarity metric is cosine similarity, which measures the angle between two vectors:

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

A cosine similarity of 1.0 means the vectors point in exactly the same direction (semantically identical). A score of 0.0 means they are orthogonal (unrelated). In practice, good matches typically score 0.7–0.95. The system retrieves the top-K chunks (usually K=3 to K=10) with the highest similarity scores and passes them to the LLM as context.

3D Visualization: Embedding Retrieval

The interactive 3D scene below visualizes how retrieval works in embedding space. Each sphere represents a document chunk. The bright green sphere is the query. Lines connect the query to its top-5 most similar chunks — thicker, brighter lines indicate higher similarity scores. Rotate and zoom to explore the 3D space.

Query sphere (green) connects to the top-5 most similar document chunks. Line thickness = similarity score.

Context Window Management

Every LLM has a maximum context window — the total number of tokens it can process at once. GPT-4 supports 128K tokens, Claude 3 supports 200K. In RAG, this window must hold the system prompt, the retrieved chunks, and the user's query, leaving room for the answer. If the retrieved chunks are too large, you either truncate them (losing information) or send fewer chunks (reducing coverage).

Several strategies help manage this constraint. Chunk ordering: place the most relevant chunks closest to the query (the model pays more attention to nearby text). Compression: use a smaller model to summarize or extract only the relevant parts of each chunk before sending them to the main LLM. Selective retrieval: use a re-ranker to filter out chunks that are marginally relevant, keeping only the highest-quality context.

Advanced RAG Techniques

Basic RAG (query → retrieve → generate) works but has limitations. The query might be ambiguous. The retriever might miss relevant documents. The retrieved chunks might be redundant. Advanced RAG techniques address these issues with multi-stage pipelines.

Query rewriting transforms the user's vague question into a more specific search query. HyDE (Hypothetical Document Embedding) goes further: it asks the LLM to generate a hypothetical answer, then uses that answer's embedding to search the database — the idea being that a plausible answer is more similar to real documents than the question itself. Hybrid search combines vector similarity (semantic match) with BM25 (keyword match) for better recall.

Re-ranking adds a second pass: after the initial retrieval fetches, say, 20 candidates, a cross-encoder re-ranker scores each one for relevance to the specific query and selects the best K. This two-stage approach (fast retrieval + accurate ranking) gives much better precision than vector search alone.

Naive RAG vs Advanced RAG

The comparison below illustrates the difference. Naive RAG passes the raw query through a single vector search. Advanced RAG uses query rewriting, hybrid search, and re-ranking to produce much higher-quality context. The result: naive RAG typically achieves 60-70% accuracy on benchmarks, while advanced RAG reaches 85-95%.

Advanced RAG adds query rewriting, hybrid search, and re-ranking to dramatically improve retrieval quality.

Evaluating RAG Systems

How do you know if your RAG system is working well? The RAGAS framework (Retrieval Augmented Generation Assessment) provides four key metrics. Faithfulness measures whether the generated answer is supported by the retrieved context (no hallucination). Answer Relevance checks if the answer actually addresses the question. Context Precision evaluates whether the retrieved chunks are relevant. Context Recall assesses whether all necessary information was retrieved.

RAGAS evaluates both retrieval quality (context metrics) and generation quality (faithfulness, relevance).

In practice, you compute these metrics on a test set of question-answer pairs with ground-truth contexts. Faithfulness below 0.8 indicates the model is hallucinating — either the context is insufficient or the model is ignoring it. Low context recall (< 0.7) means your retrieval is missing important documents, and you should consider better chunking or hybrid search.

RAG vs Fine-Tuning

A common question: should you use RAG or fine-tuning to adapt an LLM to your use case? The answer is usually both, but for different reasons. RAG is best when knowledge changes frequently, you need citations and source tracking, or you want to access domain-specific documents without retraining. Fine-tuning is better for shaping the model's behavior, style, and output format, or when you need low-latency responses without the overhead of retrieval.

RAG for knowledge, Fine-Tuning for behavior. Production systems use both.

Most production systems combine both: fine-tune the model on your domain's interaction patterns (how to answer, what tone to use), then use RAG to ground each response in up-to-date documents. This gives you the best of both worlds — consistent behavior from fine-tuning and accurate, current knowledge from RAG.

Key Takeaways

1

RAG solves the knowledge problem by retrieving relevant documents at inference time, eliminating the need to retrain for new information.

2

The pipeline has three core steps: convert the query to a vector, find the most similar document chunks, then feed them as context to the LLM.

3

Advanced RAG techniques (query rewriting, hybrid search, re-ranking) dramatically improve accuracy from ~65% to ~90%.

4

Chunking strategy and embedding model choice have outsized impact on retrieval quality — invest time in getting these right.

5

RAG and fine-tuning are complementary: use fine-tuning for behavior and style, RAG for knowledge and accuracy.

Explore Related Topics

Dive deeper into the building blocks of RAG: