RAG: Retrieval Augmented Generation
Overview
Retrieval Augmented Generation (RAG) is a technique that enhances LLM outputs by retrieving relevant information from external knowledge bases before generating responses [18].
Traditional RAG Pipeline
Components
| Component | Function |
|---|---|
| Document Loader | Ingests documents from various sources |
| Text Splitter | Chunks documents into manageable pieces |
| Embedding Model | Converts text to vector representations |
| Vector Store | Stores and indexes embeddings for retrieval |
| Retriever | Finds relevant documents based on query |
| Generator | LLM that produces final response |
Pipeline Flow
- Indexing: Documents are loaded, chunked, embedded, and stored
- Retrieval: User query is embedded and similar documents are retrieved
- Generation: Retrieved context is combined with query for LLM response
Agentic RAG
Agentic RAG extends traditional RAG by giving the agent control over the retrieval process, enabling more sophisticated information gathering strategies.
Key Differences
| Aspect | Traditional RAG | Agentic RAG |
|---|---|---|
| Retrieval Control | Fixed pipeline | Agent decides when/what to retrieve |
| Query Formulation | Direct user query | Agent reformulates queries |
| Iteration | Single retrieval | Multiple retrieval rounds |
| Source Selection | Predefined sources | Dynamic source selection |
Agentic RAG Capabilities
- Query Decomposition: Break complex queries into sub-queries
- Source Routing: Select appropriate knowledge bases
- Iterative Refinement: Retrieve additional information as needed
- Result Validation: Verify retrieved information quality
- Multi-hop Reasoning: Chain multiple retrievals for complex questions
Vector Stores
Vector stores are specialized databases optimized for storing and querying vector embeddings.
Popular Options
| Vector Store | Type | Key Features |
|---|---|---|
| Pinecone | Managed | Fully managed, scalable, fast |
| Weaviate | Open-source | GraphQL API, hybrid search |
| Chroma | Open-source | Lightweight, easy to use |
| Milvus | Open-source | Highly scalable, GPU support |
| Qdrant | Open-source | Rust-based, filtering support |
| pgvector | PostgreSQL extension | SQL integration, familiar tooling |
Chunking Strategies
How documents are split significantly impacts retrieval quality.
| Strategy | Description | Best For |
|---|---|---|
| Fixed Size | Split by character/token count | Simple documents |
| Recursive | Split by separators hierarchically | Structured text |
| Semantic | Split by meaning/topic | Complex documents |
| Document-Aware | Respect document structure | PDFs, HTML, Markdown |
Retrieval Strategies
Basic Retrieval
- Similarity Search: Find most similar vectors
- MMR (Maximal Marginal Relevance): Balance relevance and diversity
- Threshold-based: Only return results above similarity threshold
Advanced Retrieval
- Hybrid Search: Combine vector and keyword search
- Re-ranking: Use cross-encoder to re-rank initial results
- Query Expansion: Generate multiple query variants
- Hypothetical Document Embeddings (HyDE): Generate hypothetical answer, then search
Implementation Example
# Basic RAG with LangChain
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Load and split documents
loader = TextLoader("knowledge_base.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=vectorstore.as_retriever(
search_kwargs={"k": 4}
)
)
# Query
response = qa_chain.run("What are the key features?")
print(response)Best Practices
- Chunk Size Optimization: Test different sizes for your use case
- Overlap: Include overlap to preserve context at boundaries
- Metadata: Store rich metadata for filtering
- Evaluation: Measure retrieval quality with metrics like recall@k
- Hybrid Approaches: Combine multiple retrieval strategies