Retrieval-Augmented Generation

A technique that enhances LLM outputs by first retrieving relevant documents from an external knowledge base before generating a response.

Retrieval-Augmented Generation (RAG) is an architecture that improves the accuracy and relevance of LLM responses by grounding them in retrieved information. Instead of relying solely on the model's internal, potentially stale or wrong knowledge, RAG first queries an external knowledge base to find relevant documents, then passes them as context to the LLM when generating the answer.

RAG solves two critical problems: hallucination and knowledge cutoff. Because the model can see the actual source documents, it is less likely to fabricate information. And because the knowledge base can be updated independently of the model, RAG systems stay current without expensive retraining.

RAG flow: User query → embed query → search vector DB → retrieve top-K chunks → inject into prompt → LLM generates grounded answer with citations.

RAG Components

Chunking — splitting documents into appropriately sized pieces
Embedding model — converts text to vectors for similarity search
Vector database — stores and retrieves document embeddings
Reranker — re-scores retrieved chunks for relevance
Generator (LLM) — synthesizes retrieved context into an answer

RAG is the most widely deployed AI architecture in enterprise applications. It powers customer support bots, internal knowledge bases, legal research tools, and medical information systems. Advanced variants include multi-hop RAG (chaining multiple retrievals), agentic RAG (letting the model decide when and what to retrieve), and hybrid search (combining vector similarity with keyword matching).

Related Terms

← Back to Glossary