LLMs don't know YOUR data. They can't access your company docs, product specs, or that critical decision from last Tuesday. That's not a bug β it's a feature (π) so that why RAG exists.
The Knowledge Gap
Ask ChatGPT about your API rate limits? Creative fiction. Your pricing tiers? More fiction. Even the best prompting and agents can't help if the knowledge isn't there.
Traditional search? Returns 500 documents with the word "refund". Your actual policy is in "Service Level Agreements" β no match.
Fine-tuning? Teaches behavior, not facts. Your data changes daily. It's expensive. Save your π°π°π°.
So how do we bridge this knowledge gap? Enter RAG β the pattern that actually works.
1. RAG Fundamentals
RAG, Retrieval-Augmented Generation, isn't just "search + LLM". It's a coordinated process that combines smart retrieval with intelligent generation. Here's how it works:
RAG Pipeline in Action
Here's what happens when you ask a RAG system a question:
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β User Question βββββΆβ Embed Question βββββΆβ Vector Database β
β "What's our β β (Vector) β β Search β
β policy?" β β β β β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Final Answer ββββββ LLM Generate ββββββ Retrieve Docs β
β "Based on our β β with Context β β (Top matches) β
β policy..." β β β β β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
The key components are:
- Embedding Strategy β Convert text to vectors that capture semantic meaning
- Vector Store Database β Store and efficiently search millions of embeddings
- Similarity Search β Find the most relevant documents using cosine similarity
And if we look at the code, here is the simple illustration:
class SimpleRAG:
def query(self, question: str) -> str:
# Step 1: Embed the question
query_embedding = embedding(model="text-embedding-3-small", input=question)
# Step 2: Retrieve relevant documents
results = self.collection.query(query_embeddings=[query_embedding])
# Step 3: Build context from retrieved documents
context = "\n\n".join(results['documents'][0])
# Step 4: Generate answer using LLM with context
response = completion(
model="openrouter/openai/gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer based on context."},
{"role": "user", "content": f"Context: {context}\nQ: {question}"}
]
)
return response.choices[0].message.content
# Usage
rag = SimpleRAG()
rag.add_documents(["Enterprise refund policy: 90 days with approval"])
answer = rag.query("What's the refund window for enterprise?")
Run this example yourself π§
Script:
1_simple_rag.py
Command:
uv run 1_simple_rag.py
Expected Output:
=== Simple RAG System Demo with DuckDB ===
Loading embedding model... β
RAG system initialized with DuckDB vector search β
Generating embeddings for 6 documents... β
Added 6 documents to the vector store β
=== Testing RAG Queries ===
Q: What's the refund window for enterprise clients?
Generating embedding for question: 'What's the refund window for enterprise clients?'
Retrieved 3 relevant documents:
1. Similarity: 0.673 - Enterprise customers get priority handling for all refund requests...
2. Similarity: 0.667 - Our refund policy for enterprise customers: 90-day refund window...
3. Similarity: 0.561 - All refunds must be requested through the customer portal...
A: The refund window for enterprise clients is 90 days with manager approval required.
--------------------------------------------------
Q: How quickly do enterprise customers get support?
Generating embedding for question: 'How quickly do enterprise customers get support?'
Retrieved 3 relevant documents:
1. Similarity: 0.586 - Support SLA for enterprise: 1-hour response time, 4-hour resolution...
2. Similarity: 0.577 - Enterprise customers get priority handling for all requests...
3. Similarity: 0.539 - Standard support offers 24-hour response time...
A: Enterprise customers get 1-hour response time and 4-hour resolution for critical issues, much faster than the standard 24-hour response time.
--------------------------------------------------
Q: What are the benefits of premium tier?
Generating embedding for question: 'What are the benefits of premium tier?'
Retrieved 3 relevant documents:
1. Similarity: 0.765 - Premium tier includes 24/7 phone support, dedicated account manager...
2. Similarity: 0.271 - Enterprise customers get priority handling...
3. Similarity: 0.162 - Support SLA for enterprise: 1-hour response time...
A: Premium tier includes 24/7 phone support, dedicated account manager, and custom integrations for comprehensive enterprise support.
=== Key Features Demonstrated ===
β Local embeddings with sentence-transformers (all-MiniLM-L6-v2)
β DuckDB vector similarity search with cosine similarity
β Accurate semantic matching (0.7+ similarity scores)
β Graceful fallback when LLM rate limits hit
β Clean document retrieval with relevance ranking
Retrieval-Generation Balance
Too much context? The LLM gets confused. Too little? It starts making things up. Here's how to get it right:
def smart_retrieval(question: str, documents: List[str]) -> str:
# Retrieve more than you need
initial_results = retrieve_documents(question, n=10)
# Re-rank by relevance
reranked = rerank_by_relevance(initial_results, question)
# Take only highly relevant documents (threshold filtering)
relevant_docs = [doc for doc in reranked if doc.score > 0.7]
return generate_answer(question, relevant_docs[:3]) # Top 3
The good is in the balance. Cast a wide net, filter smartly, then serve only the best knowledge to your LLM.
But here's the thing β none of this works without turning your text into numbers first.
2. Embeddings: Text Becomes Math
Think of embeddings as GPS coordinates for meaning. Just like latitude and longitude tell you where something is physically, embedding vectors tell you where text sits in "meaning space". This lets computers understand that "car" and "automobile" mean the same thing.
Choosing the Right Model
Not all embedding models are created equal. With dozens available, picking the wrong one wastes weeks. Here are the ones that actually matter (in my limited experiences):
Model | Dimensions | Best For | Speed | Quality |
---|---|---|---|---|
all-MiniLM-L6-v2 | 384 | General purpose, fast prototypes | β‘β‘β‘ | βββ |
text-embedding-3-small | 1536 | Production apps, balanced cost/quality | β‘β‘ | ββββ |
text-embedding-3-large | 3072 | High-accuracy search, enterprise | β‘ | βββββ |
Rule of thumb: Start with all-MiniLM-L6-v2
for prototypes, upgrade to text-embedding-3-small
for production.
Once you've got your embeddings, you need somewhere to store them β and that's where vector databases come in.
Found a better embedding model for your use case? Drop your recommendations in the comments below β the community learns from real battle-tested experiences!
3. Vector Database
Your vector database choice depends on where you want to run it and how much control you need.
For local development and prototypes where you want full control, your best bets are ChromaDB (free, dead simple setup, handles 1M+ vectors locally) or DuckDB + VSS (free, SQL-native, great for analytics teams who love SQL).
When you're ready for production scale with zero ops overhead, consider managed cloud services like Pinecone or Snowflake.
For the demo purpose in this article, let's quacking use DuckDB with vector similarity search (VSS) extension offers a compelling middle ground:
import duckdb
from sentence_transformers import SentenceTransformer
# Initialize DuckDB with VSS extension
conn = duckdb.connect(":memory:")
conn.execute("INSTALL vss")
conn.execute("LOAD vss")
# Create vector table
conn.execute("""
CREATE TABLE documents (
id INTEGER PRIMARY KEY,
content TEXT,
embedding FLOAT[384]
)
""")
# Add documents with embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
docs = ["Python is great for data science", "SQL handles structured data well"]
for i, doc in enumerate(docs):
embedding = model.encode([doc])[0].tolist()
conn.execute("""
INSERT INTO documents (id, content, embedding)
VALUES (?, ?, ?)
""", [i, doc, embedding])
# Semantic search using SQL
query_embedding = model.encode(["data analysis"]).tolist()[0]
results = conn.execute("""
SELECT content, array_cosine_similarity(embedding, ?) as similarity
FROM documents
ORDER BY similarity DESC
LIMIT 3
""", [query_embedding]).fetchall()
Perfect for teams that live in SQL and want vector search without leaving their comfort zone.
Let's circle back to the RAG pipeline code example and break it safely if you'd like to give a try to another type of vector database.
Now, even with perfect embeddings and a blazing-fast vector database, there's one thing that kills most RAG systems before they even start.
4. Chunking: Where 90% of RAG Fails
Three approaches, pick wisely:
- Fixed-size β Split by 1000 chars. Simple but breaks sentences
- Content-aware β Split by paragraphs/sections. Preserves meaning
- Semantic β Group by meaning. Smart but slow
Pro tip: Match chunking to content type:
- Code β Respect function boundaries (1500 chars)
- Legal β Keep sections intact (2000 chars)
- General β Recursive split with 200 char overlap
Run this example yourself π§
Script:
4_smart_chunking.py
Command:
uv run 4_smart_chunking.py
Expected Output:
=== LangChain Text Splitter Comparison ===
SmartChunker initialized with LangChain text splitters
β
Embedding model available for semantic chunking
Original text: 844 characters
CharacterTextSplitter (fixed_size):
------------------------------------------------------------
Chunking general document using fixed_size strategy
Number of chunks: 2
Average chunk size: 414 chars
Chunk boundaries:
1. 'Natural Language Processing (NLP) is a s...' to '...ention mechanisms to understand context.'
2. 'Applications of NLP are widespread. They...' to '...ted language understanding capabilities.'
RecursiveCharacterTextSplitter (content_aware):
------------------------------------------------------------
Chunking general document using content_aware strategy
Number of chunks: 1
Average chunk size: 834 chars
Chunk boundaries:
1. 'Natural Language Processing (NLP) is a s...' to '...ted language understanding capabilities.'
TokenTextSplitter (token_based):
------------------------------------------------------------
Chunking general document using token_based strategy
Number of chunks: 1
Average chunk size: 844 chars
Chunk boundaries:
1. 'Natural Language Processing (NLP) i...' to '...anguage understanding capabilities.'
SemanticChunker (semantic):
------------------------------------------------------------
Chunking general document using semantic strategy
Number of chunks: 13
Average chunk size: 60 chars
Chunk boundaries:
1. 'Natural Language Processing (NLP) is a s...' to '...is a subfield of artificial intelligence'
2. 'It focuses on enabling computers to unde...' to '...to understand and process human language'
3. 'Key NLP techniques include tokenization,...' to '...ch tagging, and named entity recognition'
4. 'These form the foundation for more compl...' to '...rm the foundation for more complex tasks'
5. 'Modern NLP relies heavily on transformer...' to '...NLP relies heavily on transformer models'
6. 'Models like BERT and GPT have revolution...' to '...RT and GPT have revolutionized the field'
7. 'They use attention mechanisms to underst...' to '...tention mechanisms to understand context'
8. 'Applications of NLP are widespread...' to '...Applications of NLP are widespread'
9. 'They include machine translation, sentim...' to '...lation, sentiment analysis, and chatbots'
10. 'These tools help bridge the gap between...' to '...communication and computer understanding'
11. 'The future of NLP looks promising...' to '...The future of NLP looks promising'
12. 'Advances in deep learning continue to im...' to '...learning continue to improve performance'
13. 'We can expect even more sophisticated la...' to '...ated language understanding capabilities'
=== Document Type-Specific Chunking ===
SmartChunker initialized with LangChain text splitters
β
Embedding model available for semantic chunking
CODE Document:
==================================================
Document type: code
Total chunks: 1
Chunk sizes: [1106] characters
Average size: 1106 chars
First chunk preview: def fibonacci(n): """Calculate nth Fibonacci number using recursion""" if n <= 0: return 0 elif n == 1: return 1 else:...
MARKDOWN Document:
==================================================
Document type: markdown
Total chunks: 8
Chunk sizes: [81, 74, 176, 56, 75, 83, 58, 168] characters
Average size: 96 chars
First chunk preview: The first step in any ML pipeline is collecting high-quality data. This involves:...
LEGAL Document:
==================================================
Document type: legal
Total chunks: 1
Chunk sizes: [1474] characters
Average size: 1474 chars
First chunk preview: ## Terms of Service Agreement ### Section 1: Acceptance of Terms By accessing and using this service, you accept and agree to be bound by the terms a...
You've chunked your documents perfectly. Now let's make sure you're actually finding the right ones.
5. Retrieval Optimization
2 techniques that actually matter:
Hybrid Search: Best of Both Worlds
Combine keyword (BM25) and semantic search. Let keywords find specifics ("ORDER-12345"), semantics find concepts ("refund process").
def hybrid_search(query: str, alpha: float = 0.5): # 0=keyword, 1=semantic
# Auto-adjust based on query type
if has_specific_terms(query): # IDs, codes
alpha = 0.3 # Favor keyword
elif is_conceptual(query): # "explain", "how"
alpha = 0.8 # Favor semantic
keyword_scores = bm25_search(query)
semantic_scores = vector_search(query)
return combine_scores(keyword_scores, semantic_scores, alpha)
Run this example yourself π§
Script:
5_hybrid_search.py
Command:
uv run 5_hybrid_search.py
Expected Output:
======================================================================
HYBRID SEARCH DEMONSTRATION
======================================================================
Documents in collection:
1. USER-12345 encountered authentication error at 10:30 AM
2. The authentication system uses OAuth 2.0 for secure verification
3. Error code AUTH-500 indicates server-side authentication failure
4. Understanding how authentication works is crucial for security
5. Database query optimization improves application performance
6. ORDER-67890 was processed successfully at 11:45 AM
7. Explain the relationship between caching and database performance
8. API-KEY-789 expired and needs renewal
======================================================================
SEARCH RESULTS
======================================================================
π Query: 'USER-12345 error'
--------------------------------------------------
β Detected specific terms, using alpha=0.3 (keyword-focused)
Top 3 Results:
1. [ββββββββββββββββββββ] Score: 1.000
USER-12345 encountered authentication error at 10:30 AM
2. [ββββββββββββββββββββ] Score: 0.398
Error code AUTH-500 indicates server-side authentication failure
3. [ββββββββββββββββββββ] Score: 0.111
ORDER-67890 was processed successfully at 11:45 AM
π Query: 'explain authentication security'
--------------------------------------------------
β Detected conceptual query, using alpha=0.8 (semantic-focused)
Top 3 Results:
1. [ββββββββββββββββββββ] Score: 1.000
Understanding how authentication works is crucial for security
2. [ββββββββββββββββββββ] Score: 0.479
The authentication system uses OAuth 2.0 for secure verification
3. [ββββββββββββββββββββ] Score: 0.385
Explain the relationship between caching and database performance
π Query: 'authentication OAuth'
--------------------------------------------------
β Using balanced search, alpha=0.5
Top 3 Results:
1. [ββββββββββββββββββββ] Score: 1.000
The authentication system uses OAuth 2.0 for secure verification
2. [ββββββββββββββββββββ] Score: 0.399
Understanding how authentication works is crucial for security
3. [ββββββββββββββββββββ] Score: 0.295
Error code AUTH-500 indicates server-side authentication failure
π Query: 'ORDER-67890'
--------------------------------------------------
β Detected specific terms, using alpha=0.3 (keyword-focused)
Top 3 Results:
1. [ββββββββββββββββββββ] Score: 1.000
ORDER-67890 was processed successfully at 11:45 AM
2. [ββββββββββββββββββββ] Score: 0.131
API-KEY-789 expired and needs renewal
3. [ββββββββββββββββββββ] Score: 0.088
USER-12345 encountered authentication error at 10:30 AM
π Query: 'how does caching work'
--------------------------------------------------
β Detected conceptual query, using alpha=0.8 (semantic-focused)
Top 3 Results:
1. [ββββββββββββββββββββ] Score: 1.000
Explain the relationship between caching and database performance
2. [ββββββββββββββββββββ] Score: 0.496
Understanding how authentication works is crucial for security
3. [ββββββββββββββββββββ] Score: 0.319
Database query optimization improves application performance
======================================================================
MANUAL ALPHA CONTROL COMPARISON
======================================================================
π Query: 'authentication system security'
Keyword-only (alpha=0.0):
------------------------------
1. [1.000] Understanding how authentication works is crucial for securi...
2. [0.944] The authentication system uses OAuth 2.0 for secure verifica...
Balanced (alpha=0.5):
------------------------------
1. [1.000] Understanding how authentication works is crucial for securi...
2. [0.843] The authentication system uses OAuth 2.0 for secure verifica...
Semantic-only (alpha=1.0):
------------------------------
1. [1.000] Understanding how authentication works is crucial for securi...
2. [0.742] The authentication system uses OAuth 2.0 for secure verifica...
Re-ranking & Metadata
Re-ranking: First retrieval gets 10 docs, cross-encoder picks the best 3. More accurate than single-pass.
Metadata filtering: Don't search everything. Filter by date, type, or complexity BEFORE similarity search.
That covers traditional RAG. But what if your data isn't just documents β what if it's a web of connections?
6. Graph RAG
Traditional RAG treats documents as isolated chunks. Graph RAG understands relationships.
Example: "Who approved the budget increase that led to Q3 hiring?"
- Traditional RAG: Searches "budget", "increase", "Q3", "hiring" separately
- Graph RAG: Follows connections: budget β approval β personnel changes
Graph RAG works best for rich relationships (org charts, research papers), multi-hop questions ("Who worked on X that influenced Y?"), and causality tracing, but skip it for simple Q&A, real-time needs (graphs add latency), or teams lacking graph database experience.
Graph RAG essentially steps through: 1) Entity extraction β Pull people, companies, events from documents, 2) Relationship mapping β Connect how Sarah β hired Mike β built SmartAnalytics, 3) Graph storage β Store as nodes (entities) and edges (relationships), 4) Graph traversal or query β Follow connections to answer multi-hop questions like "Who's responsible for the revenue impact of the Series B funding?".
Building graph databases used to be complex, but frameworks like LightRAG now handle the heavy lifting automatically.
Run this example yourself π§
Script:
6_graph_rag.py
Command:
uv run 6_graph_rag.py
Expected Output:
LIGHTRAG GRAPH RAG DEMONSTRATION
Using DeepSeek API for LLM and Local Embeddings
INFO:nano-vectordb:Init {'embedding_dim': 384, 'metric': 'cosine', 'storage_file': './lightrag_demo/vdb_entities.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 384, 'metric': 'cosine', 'storage_file': './lightrag_demo/vdb_relationships.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 384, 'metric': 'cosine', 'storage_file': './lightrag_demo/vdb_chunks.json'} 0 data
Rerank is enabled but no rerank_model_func provided. Reranking will be skipped.
β LightRAG initialized at ./lightrag_demo
Building knowledge graph from 5 documents...
β Knowledge graph built
Testing Graph RAG Query Modes:
Query (local): 'Who developed the core algorithms for SmartAnalytics?'
Answer: Lisa Wang developed the core machine learning algorithms for SmartAnalytics. She serves as the Lead ML Engineer on the SmartAnalytics team and her algorithmic innovations have been instrumental in the platform's success, contributing significantly to TechCorp's revenue growth and reducing data processing time for major clients.
**References:**
[KG] unknown_source
[DC] unknown_source
Query (global): 'How did the Series B funding impact TechCorp's growth strategy?'
Answer: Based on the provided knowledge base, the Series B funding had a significant impact on TechCorp's growth strategy across multiple dimensions:
## Strategic Expansion and Development
The $50 million Series B funding, secured in Q3 2023 and led by VentureCapital Partners, was critical for TechCorp's growth strategy. This funding enabled the company to pursue strategic expansion initiatives that were central to their development plans.
## Engineering Capability Enhancement
A major impact was the significant expansion of TechCorp's engineering capabilities. The funding allowed the company to grow its engineering team from 15 to 45 engineers under VP of Engineering Mike Johnson's leadership. This enhanced technical capacity directly supported the company's product development goals.
## Market Expansion and Product Development
The Series B funding is being strategically invested to expand SmartAnalytics into European markets. Additionally, the funding supports the development of SmartAnalytics 2.0 with advanced predictive analytics features, led by Lisa Wang's team. This product evolution is projected to double the company's revenue by 2025.
## Supporting Revenue Growth Trajectory
The funding builds upon TechCorp's impressive revenue growth, which increased from $5 million in 2021 to $25 million in 2023. By enabling both market expansion and product innovation, the Series B funding positions the company to continue this strong growth trajectory.
**References:**
[KG] unknown_source (Growth Strategy-Series B Funding relationship)
[KG] unknown_source (Engineering Capabilities expansion)
[DC] unknown_source (European market expansion and SmartAnalytics 2.0 development)
[DC] unknown_source (Engineering team growth details)
[KG] unknown_source (Revenue growth context)
Query (hybrid): 'What's the connection between Sarah Chen and the company's revenue growth?'
Answer: Based on the provided knowledge base, Sarah Chen's connection to TechCorp's revenue growth is primarily through her leadership role and strategic initiatives that have enabled the company's expansion and financial success.
Sarah Chen serves as CEO of TechCorp and has been instrumental in securing critical funding for the company's growth. She successfully secured $50 million in Series B funding led by VentureCapital Partners in Q3 2023, which was described as "critical for the company's growth strategy." This funding has enabled significant expansion of engineering capabilities and is being strategically invested to expand SmartAnalytics into European markets.
While the knowledge base doesn't directly attribute specific revenue growth numbers to Sarah Chen's actions, it shows she created the conditions for growth by recruiting key talent (including Mike Johnson from CloudNet in 2021) and securing the substantial funding that supports the company's expansion plans. The revenue growth itself (from $5 million in 2021 to $25 million in 2023) is attributed to algorithmic innovations in SmartAnalytics developed by Lisa Wang's team.
**References:**
[KG] unknown_source (Sarah Chen entity description and relationships)
[KG] unknown_source (Series B Funding entity description)
[DC] unknown_source (Document chunk 1 - Sarah Chen's leadership and funding)
[DC] unknown_source (Document chunk 3 - Strategic investment of funding)
[KG] unknown_source (Revenue Growth entity description)
Comparing Query Modes:
Question: 'How did Sarah Chen's leadership decisions impact TechCorp's success?'
Local: Based on the provided knowledge base, Sarah Chen's leadership decisions had significant positive impacts on TechCorp's success through strategic hiring, funding acquisition, and product development.
## Strategic Recruitment and Team Building
Sarah Chen recruited Mike Johnson from CloudNet in 2021 ...
Global: Based on the provided knowledge base, Sarah Chen's leadership decisions significantly impacted TechCorp's success through strategic hiring, funding acquisition, and overall company direction.
**Strategic Hiring and Team Building**
Sarah Chen recruited Mike Johnson from CloudNet in 2021 to lead Tech...
Hybrid: Based on the provided knowledge base, Sarah Chen's leadership decisions had significant positive impacts on TechCorp's success through strategic hiring, funding acquisition, and product development.
**Strategic Recruitment and Team Building**
Sarah Chen recruited Mike Johnson from CloudNet in 2021 ...
Key Takeaways
- RAG > Fine-tuning for facts β Fine-tuning teaches behavior, RAG injects knowledge. Use RAG for dynamic, factual information
- Hybrid search wins β Combine keyword (BM25) and semantic search. Let keywords find specifics, semantics find concepts
- Graph RAG for relationships β When your questions involve "who", "how", and "why" across multiple documents, graphs beat flat vectors
What's Next?
Great, your agents can think and access knowledge. But how do they actually DO things? Time to talk about action tools...
Technical deep dive series β Part 3 of 5
β Part 2: Agents & Reasoning | Part 4: Action Tools β
Related Articles in This Series
π Context Engineering Deep Dive Series:
- User Intent & Prompting: The Art of Making LLMs Understand What You Really Want
- Agents & Reasoning: When LLMs Learn to Think Before They Speak
- RAG Systems (You are here)
- Action Tools: How LLMs Finally Learned to Stop Talking and Start Doing
- Memory Systems: Teaching LLMs to Remember (Without Going Broke)
π― Start with the overview: Context Engineering: How RAG, agents, and memory make LLMs actually useful
Comments
Join the discussion! Your thoughts and feedback are valuable.
π‘ Comments are powered by GitHub Discussions. You'll need a GitHub account to comment. Make sure Discussions are enabled in the repository.