RAG Pipeline with LangChain and ChromaDB: A Practical Tutorial
You asked your LLM a question about your company's internal docs, and it confidently made something up. Classic. The model has no idea what's in your files -- it only knows what it was trained on, and that training data stopped months (or years) ago. That gap between "what the model knows" and "what you need it to know" is exactly what RAG solves.
Retrieval-Augmented Generation (RAG) is the pattern where you fetch relevant documents first, then hand them to the LLM as context alongside the user's question. The model doesn't need to memorize your data -- it just needs to read it at query time. It's the difference between asking someone a question from memory versus handing them the reference manual and saying "answer from this."
This tutorial builds a complete RAG pipeline from scratch using LangChain for orchestration, ChromaDB as the vector store, and OpenAI (or any compatible LLM) for generation. By the end, you'll have a working system that ingests documents, chunks them, embeds them, stores them, retrieves relevant pieces, and generates grounded answers. All in under 100 lines of Python.
🛠️ What You'll Need
- Python 3.10+ -- LangChain's latest packages require it
- An OpenAI API key -- for embeddings and the chat model (you can swap for Ollama or any local LLM later)
- Basic Python knowledge -- functions, pip, virtual environments
- Some documents to query -- PDFs, text files, or markdown files
- 10 minutes -- this isn't a weekend project, it's a lunch break project
🧠 How RAG Actually Works
Before writing code, you need to understand the two-phase architecture. Every RAG system has an indexing phase (done once) and a query phase (done every time someone asks a question).
INDEXING PHASE (one-time)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Your Docs │────►│ Chunk │────►│ Embed │────►│ ChromaDB │
│ (PDF, TXT, │ │ (split │ │ (convert │ │ (vector │
│ MD, HTML) │ │ text) │ │ to vecs) │ │ store) │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
QUERY PHASE (every question)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ User Query │────►│ Embed │────►│ Retrieve │────►│ LLM │
│ │ │ the query │ │ top-k │ │ generates │
│ │ │ │ │ chunks │ │ answer │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
The key insight: you're converting text into numbers (vectors) so you can find "similar" content mathematically. When someone asks "What's our refund policy?", the embedding model turns that into a vector, ChromaDB finds the document chunks with the closest vectors, and the LLM reads those chunks to generate an answer.
Why Not Just Stuff Everything Into the Prompt?
You could dump all your documents into the LLM's context window. Modern models like GPT-4o support 128K tokens. But here's why that's a bad idea:
| Approach | Pros | Cons |
|---|---|---|
| Full context stuffing | Simple, no vector DB needed | ❌ Expensive (you pay per token), ❌ Slow, ❌ Breaks at scale, ❌ Model loses focus in long contexts |
| RAG with retrieval | ✅ Cheap (only relevant chunks sent), ✅ Fast, ✅ Scales to millions of docs | Requires setup, chunking strategy matters |
| Fine-tuning | Good for style/format changes | ❌ Expensive, ❌ Doesn't help with factual recall, ❌ Training data goes stale |
RAG wins for factual question-answering over your own data. It's not even close.
📦 Project Setup
Let's get the dependencies installed and the project structure created.
Install the Packages
pip install langchain langchain-openai langchain-chroma \
langchain-text-splitters langchain-community \
python-dotenv unstructured
Here's what each package does:
| Package | Purpose |
|---|---|
langchain |
Core framework -- chains, prompts, orchestration |
langchain-openai |
OpenAI LLM and embedding integrations |
langchain-chroma |
ChromaDB vector store integration |
langchain-text-splitters |
Document chunking utilities |
langchain-community |
Community document loaders (PDF, HTML, etc.) |
python-dotenv |
Load API keys from .env files |
unstructured |
Parse PDFs and other document formats |
Project Structure
rag-project/
├── .env # Your API keys
├── rag_pipeline.py # The main RAG script
├── documents/ # Your source documents go here
│ ├── report.pdf
│ ├── notes.md
│ └── policy.txt
└── chroma_store/ # ChromaDB persists data here (auto-created)
Set Up Your API Key
Create a .env file in the project root:
OPENAI_API_KEY=sk-proj-your-actual-key-here
# Load it at the top of your script
from dotenv import load_dotenv
load_dotenv()
That's it. LangChain's OpenAI integrations automatically read OPENAI_API_KEY from the environment.
📄 Step 1: Load and Chunk Your Documents
This is the indexing phase. You take raw documents and prepare them for embedding.
Loading Documents
LangChain has loaders for practically every document format. Here are the ones you'll use most:
from langchain_community.document_loaders import (
TextLoader,
DirectoryLoader,
PyPDFLoader,
UnstructuredMarkdownLoader,
)
# Load a single PDF
pdf_docs = PyPDFLoader("documents/report.pdf").load()
# Load a single text file
txt_docs = TextLoader("documents/notes.txt").load()
# Load all files in a directory (auto-detects format)
all_docs = DirectoryLoader("documents/", glob="**/*.*").load()
print(f"Loaded {len(all_docs)} document pages")
Each loader returns a list of Document objects, each containing page_content (the text) and metadata (source file, page number, etc.). That metadata travels with the chunks all the way to the final answer -- useful for source attribution.
Chunking Strategy
Raw documents are too long to embed as-is. You need to split them into chunks that are small enough to be meaningful but large enough to preserve context. The RecursiveCharacterTextSplitter is the go-to choice for most use cases in 2026.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
chunks = text_splitter.split_documents(all_docs)
print(f"Split {len(all_docs)} documents into {len(chunks)} chunks")
How this works: the splitter tries to split on double newlines first (paragraph breaks). If a chunk is still too large, it falls back to single newlines, then sentences, then words. The chunk_overlap=200 means each chunk shares 200 characters with its neighbor -- this prevents context from being lost at chunk boundaries.
Choosing Chunk Size
The February 2026 FloTorch benchmark study tested seven chunking strategies across thousands of academic papers. The surprising result: simple recursive splitting at 512-1000 tokens consistently outperformed fancy semantic chunking methods. Start simple.
| Document Type | chunk_size | chunk_overlap | Why |
|---|---|---|---|
| General articles, docs | 1000 | 200 | ✅ Good default balance |
| Technical documentation | 600 | 100 | Smaller chunks = more precise retrieval |
| Legal or dense prose | 1500 | 400 | Larger chunks preserve legal context |
| Code files | 800 | 100 | Split on function/class boundaries |
| Q&A / FAQ content | 500 | 50 | Each Q&A pair should be its own chunk |
chunk_size=1000 and chunk_overlap=200, build the pipeline, and evaluate the results. Optimize chunk size after you have a working system and actual queries to test against.
🗃️ Step 2: Embed and Store in ChromaDB
Now you convert those text chunks into vectors and store them in ChromaDB.
What Are Embeddings?
An embedding turns text into a list of numbers (a vector) that captures its meaning. Similar texts produce similar vectors. When you search for "refund policy", the embedding is close to chunks containing "return items within 30 days" -- even though those words don't overlap.
Create the Vector Store
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Initialize the embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
# Build the vector store from your chunks
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
collection_name="my_documents",
persist_directory="./chroma_store",
)
print(f"Stored {len(chunks)} chunks in ChromaDB")
That's three things happening in one call:
1. Each chunk's text is sent to OpenAI's embedding API and converted to a 1536-dimensional vector
2. The vectors (plus the original text and metadata) are stored in ChromaDB
3. ChromaDB persists everything to ./chroma_store/ on disk
Loading an Existing Store
You only build the vector store once (or when you add new documents). After that, load from disk:
# Load existing vector store (no re-embedding needed)
vector_store = Chroma(
collection_name="my_documents",
embedding_function=embedding_model,
persist_directory="./chroma_store",
)
Embedding Model Options
OpenAI's text-embedding-3-small is the practical default in 2026 -- fast, cheap ($0.02 per million tokens), and good enough for most use cases. But you have options:
| Embedding Model | Provider | Dimensions | Cost | Quality |
|---|---|---|---|---|
text-embedding-3-small |
OpenAI | 1536 | $0.02/1M tokens | ✅ Great default |
text-embedding-3-large |
OpenAI | 3072 | $0.13/1M tokens | Better for large corpora |
nomic-embed-text |
Ollama (local) | 768 | Free | ✅ Best local option |
mxbai-embed-large |
Ollama (local) | 1024 | Free | Higher quality, slower |
To use a local embedding model with Ollama instead of OpenAI:
from langchain_ollama import OllamaEmbeddings
# Pull the model first: ollama pull nomic-embed-text
embedding_model = OllamaEmbeddings(model="nomic-embed-text")
Everything else in the pipeline stays the same. That's the beauty of LangChain's abstractions.
🔍 Step 3: Build the Retrieval Chain
This is where everything comes together. You take a user's question, retrieve relevant chunks, and generate an answer.
The Retriever
The retriever is ChromaDB's search interface, wrapped in a LangChain-compatible object:
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 5},
)
k=5 means "return the 5 most similar chunks." This is the most important tuning parameter -- too few and you miss relevant context, too many and you dilute the signal with noise.
The Prompt Template
The prompt tells the LLM how to use the retrieved context:
from langchain_core.prompts import ChatPromptTemplate
rag_prompt = ChatPromptTemplate.from_template(
"You are a helpful assistant. Answer the question based ONLY on "
"the following context. If the context doesn't contain enough "
"information to answer fully, say what you can and note what's "
"missing.\n\n"
"Context:\n{context}\n\n"
"Question: {question}\n\n"
"Answer:"
)
The "ONLY" constraint is critical. Without it, the model will happily blend retrieved context with its training data, and you'll get plausible-sounding answers that are half-grounded and half-hallucinated. You want the model to stay in its lane.
The Full RAG Chain (LCEL)
LangChain Expression Language (LCEL) lets you compose the entire chain with pipe operators:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Format retrieved docs into a single string
def format_docs(docs):
return "\n\n---\n\n".join(
f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"
for doc in docs
)
# Build the RAG chain
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
Here's what happens when you invoke this chain:
- The user's question flows into two paths simultaneously
- Left path: The question goes to the retriever, which embeds it and searches ChromaDB for the 5 closest chunks, then
format_docsjoins them into a string - Right path: The question passes through unchanged via
RunnablePassthrough() - Both feed into the prompt template as
{context}and{question} - The filled prompt goes to the LLM
- The LLM's response is parsed to a plain string
Query It
# Ask a question
response = rag_chain.invoke("What is our refund policy?")
print(response)
Based on the provided documents, your refund policy allows customers to
return items within 30 days of purchase for a full refund. Items must be
in original packaging and unused. Digital products are non-refundable
after download. Refunds are processed within 5-7 business days to the
original payment method.
Source: policy.txt
That answer came from your documents. The model read your policy file and synthesized an answer from it.
🏗️ The Complete Pipeline
Here's the full script -- copy it, replace the documents directory, and run it:
"""
rag_pipeline.py -- Complete RAG pipeline with LangChain and ChromaDB.
Usage:
python rag_pipeline.py --build # Index documents (run once)
python rag_pipeline.py # Interactive query mode
"""
import os
import sys
from dotenv import load_dotenv
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
load_dotenv()
# --- Configuration ---
DOCS_DIR = "./documents"
CHROMA_DIR = "./chroma_store"
COLLECTION_NAME = "my_documents"
EMBEDDING_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o-mini"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
RETRIEVAL_K = 5
def get_embedding_model():
return OpenAIEmbeddings(model=EMBEDDING_MODEL)
def build_index():
"""Load documents, chunk them, embed, and store in ChromaDB."""
print(f"Loading documents from {DOCS_DIR}/ ...")
loader = DirectoryLoader(
DOCS_DIR,
glob="**/*.*",
show_progress=True,
use_multithreading=True,
)
documents = loader.load()
print(f"Loaded {len(documents)} document pages.")
if not documents:
print("No documents found. Add files to the documents/ directory.")
return
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")
print("Embedding and storing in ChromaDB...")
Chroma.from_documents(
documents=chunks,
embedding=get_embedding_model(),
collection_name=COLLECTION_NAME,
persist_directory=CHROMA_DIR,
)
print(f"Done. Vector store saved to {CHROMA_DIR}/")
def load_store():
"""Load existing ChromaDB vector store from disk."""
return Chroma(
collection_name=COLLECTION_NAME,
embedding_function=get_embedding_model(),
persist_directory=CHROMA_DIR,
)
def format_docs(docs):
"""Format retrieved documents into a context string."""
return "\n\n---\n\n".join(
f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
for doc in docs
)
def build_rag_chain(vector_store):
"""Assemble the RAG chain: retriever -> prompt -> LLM -> output."""
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": RETRIEVAL_K},
)
prompt = ChatPromptTemplate.from_template(
"You are a helpful assistant. Answer the question based ONLY on "
"the following context. If the context doesn't contain enough "
"information, say so.\n\n"
"Context:\n{context}\n\n"
"Question: {question}\n\n"
"Answer:"
)
llm = ChatOpenAI(model=CHAT_MODEL, temperature=0)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
return chain
def interactive_query(chain):
"""Run an interactive query loop."""
print(f"\nRAG pipeline ready. Model: {CHAT_MODEL}")
print("Type your questions (Ctrl+C to exit):\n")
while True:
try:
question = input("Q: ").strip()
if not question:
continue
print("\nSearching and generating...\n")
answer = chain.invoke(question)
print(f"A: {answer}\n")
except KeyboardInterrupt:
print("\nGoodbye.")
break
if __name__ == "__main__":
if not os.path.exists(DOCS_DIR):
os.makedirs(DOCS_DIR)
print(f"Created {DOCS_DIR}/. Add your documents, then run:")
print(" python rag_pipeline.py --build")
sys.exit(0)
if "--build" in sys.argv:
build_index()
else:
store = load_store()
chain = build_rag_chain(store)
interactive_query(chain)
Run It
# 1. Add your documents
mkdir -p documents
cp ~/my-docs/*.pdf documents/
cp ~/my-docs/*.md documents/
# 2. Build the index
python rag_pipeline.py --build
# 3. Start querying
python rag_pipeline.py
RAG pipeline ready. Model: gpt-4o-mini
Type your questions (Ctrl+C to exit):
Q: What were the Q4 revenue numbers?
Searching and generating...
A: According to the Q4 financial report, total revenue was $4.2M,
representing a 23% increase over Q3. The primary growth drivers
were enterprise subscriptions (+31%) and professional services (+18%).
🔧 Troubleshooting
"No documents found" after running --build
- Make sure your files are in the
documents/directory (not nested too deep) - Check file extensions -- the
DirectoryLoaderwithglob="**/*.*"catches most formats, but binary files like.docxneed theunstructuredpackage with extra dependencies:pip install "unstructured[all-docs]" - Verify permissions:
ls -la documents/
Retrieval returns irrelevant chunks
- Reduce chunk size to 500-600 characters for more precise matching
- Increase k to 8-10 to cast a wider net, then let the LLM filter relevance
- Check your chunks manually by adding a debug print:
# Debug: see what the retriever actually returns
docs = retriever.invoke("your question here")
for i, doc in enumerate(docs):
print(f"\n--- Chunk {i+1} (score: {doc.metadata.get('score', 'N/A')}) ---")
print(doc.page_content[:200])
ChromaDB "collection already exists" error
This happens when you re-run --build without clearing the old data. Either delete the chroma_store/ directory first, or modify the build step to clear the existing collection:
# Delete existing collection before rebuilding
import chromadb
client = chromadb.PersistentClient(path=CHROMA_DIR)
try:
client.delete_collection(COLLECTION_NAME)
except ValueError:
pass # Collection doesn't exist yet
OpenAI rate limit or quota errors
- Check your API key is valid:
echo $OPENAI_API_KEY - For large document sets, add a delay between embedding batches or reduce batch size
- Switch to a local embedding model (Ollama) to avoid API costs entirely
"Hallucinated" answers that aren't in the documents
- Strengthen the prompt: add "If you cannot answer from the provided context alone, respond with 'I don't have enough information to answer that'"
- Lower the LLM temperature to 0 (deterministic output)
- Check whether the relevant information is actually making it into the context -- log the retrieved chunks
⚡ Beyond the Basics
Once your pipeline is working, here are the upgrades that actually matter.
Add Metadata Filtering
ChromaDB supports filtering by metadata fields. If your documents have categories, dates, or departments:
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={
"k": 5,
"filter": {"source": "documents/policy.pdf"},
},
)
This restricts retrieval to chunks from a specific file. Useful when users ask about a particular document.
Hybrid Search with MMR
Maximal Marginal Relevance (MMR) balances similarity with diversity. Instead of returning 5 chunks that all say the same thing, MMR ensures variety:
retriever = vector_store.as_retriever(
search_type="mmr",
search_kwargs={
"k": 5,
"fetch_k": 20, # Fetch 20 candidates
"lambda_mult": 0.7, # 0 = max diversity, 1 = max similarity
},
)
Add Source Citations
Modify format_docs and the prompt to include source references:
def format_docs_with_sources(docs):
formatted = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "unknown")
formatted.append(f"[{i}] (Source: {source})\n{doc.page_content}")
return "\n\n".join(formatted)
# Update the prompt to instruct citation
rag_prompt = ChatPromptTemplate.from_template(
"Answer the question using ONLY the provided context. "
"Cite your sources using [1], [2], etc.\n\n"
"Context:\n{context}\n\n"
"Question: {question}\n\n"
"Answer (with citations):"
)
Conversation Memory
For a multi-turn chatbot, add conversation history to the chain:
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
prompt_with_history = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Use the context to answer."),
MessagesPlaceholder(variable_name="chat_history"),
("human", "Context:\n{context}\n\nQuestion: {question}"),
])
# Maintain history in your query loop
chat_history = []
def ask_with_history(chain, question, history):
response = chain.invoke({
"question": question,
"chat_history": history,
})
history.append(HumanMessage(content=question))
history.append(AIMessage(content=response))
return response
🚀 What's Next
- Add a web interface -- wrap the pipeline with FastAPI or Streamlit for a chat UI that your team can use without touching the terminal
- Go fully local -- replace OpenAI with Ollama for embeddings and generation, zero API costs and full data privacy (Local LLM + Ollama RAG Guide)
- Scale to production -- swap ChromaDB for a managed vector database like Pinecone or Weaviate when you outgrow single-machine storage
- Add document watching -- set up a file watcher or cron job to automatically re-index when new documents land in the directory
- Evaluate and iterate -- build a test set of 20-50 questions with expected answers, then use LangSmith or RAGAS to measure retrieval and generation quality systematically
The RAG pattern is one of the most practical things happening in AI engineering right now. It bridges the gap between "general-purpose LLM" and "actually useful tool for your specific data." Master it, and you'll find uses everywhere -- internal knowledge bases, customer support, code documentation search, compliance document Q&A, and more.
Want to see how AI is changing the developer role? Read The Rise of the AI Engineer for the bigger picture, or check out AI Coding Agents Compared to see how RAG fits into the broader AI-assisted development landscape.