Easy

RAG Pipeline with LangChain and ChromaDB: A Practical Tutorial

You asked your LLM a question about your company's internal docs, and it confidently made something up. Classic. The model has no idea what's in your files -- it only knows what it was trained on, and that training data stopped months (or years) ago. That gap between "what the model knows" and "what you need it to know" is exactly what RAG solves.

Retrieval-Augmented Generation (RAG) is the pattern where you fetch relevant documents first, then hand them to the LLM as context alongside the user's question. The model doesn't need to memorize your data -- it just needs to read it at query time. It's the difference between asking someone a question from memory versus handing them the reference manual and saying "answer from this."

This tutorial builds a complete RAG pipeline from scratch using LangChain for orchestration, ChromaDB as the vector store, and OpenAI (or any compatible LLM) for generation. By the end, you'll have a working system that ingests documents, chunks them, embeds them, stores them, retrieves relevant pieces, and generates grounded answers. All in under 100 lines of Python.

🛠️ What You'll Need

Python 3.10+ -- LangChain's latest packages require it
An OpenAI API key -- for embeddings and the chat model (you can swap for Ollama or any local LLM later)
Basic Python knowledge -- functions, pip, virtual environments
Some documents to query -- PDFs, text files, or markdown files
10 minutes -- this isn't a weekend project, it's a lunch break project

Tip: Want to run this entirely locally with no API keys? Swap OpenAI for Ollama. Our Local LLM + Ollama RAG Guide walks through that exact setup, including model selection for your hardware.

🧠 How RAG Actually Works

Before writing code, you need to understand the two-phase architecture. Every RAG system has an indexing phase (done once) and a query phase (done every time someone asks a question).

                        INDEXING PHASE (one-time)
┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Your Docs   │────►│   Chunk      │────►│   Embed      │────►│   ChromaDB   │
│  (PDF, TXT,  │     │   (split     │     │   (convert   │     │   (vector    │
│   MD, HTML)  │     │    text)     │     │    to vecs)  │     │    store)    │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘

                        QUERY PHASE (every question)
┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  User Query  │────►│  Embed       │────►│  Retrieve    │────►│  LLM         │
│              │     │  the query   │     │  top-k       │     │  generates   │
│              │     │              │     │  chunks      │     │  answer      │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘

The key insight: you're converting text into numbers (vectors) so you can find "similar" content mathematically. When someone asks "What's our refund policy?", the embedding model turns that into a vector, ChromaDB finds the document chunks with the closest vectors, and the LLM reads those chunks to generate an answer.

Why Not Just Stuff Everything Into the Prompt?

You could dump all your documents into the LLM's context window. Modern models like GPT-4o support 128K tokens. But here's why that's a bad idea:

Approach	Pros	Cons
Full context stuffing	Simple, no vector DB needed	❌ Expensive (you pay per token), ❌ Slow, ❌ Breaks at scale, ❌ Model loses focus in long contexts
RAG with retrieval	✅ Cheap (only relevant chunks sent), ✅ Fast, ✅ Scales to millions of docs	Requires setup, chunking strategy matters
Fine-tuning	Good for style/format changes	❌ Expensive, ❌ Doesn't help with factual recall, ❌ Training data goes stale

RAG wins for factual question-answering over your own data. It's not even close.

📦 Project Setup

Let's get the dependencies installed and the project structure created.

Install the Packages

pip install langchain langchain-openai langchain-chroma \
  langchain-text-splitters langchain-community \
  python-dotenv unstructured

Here's what each package does:

Package	Purpose
`langchain`	Core framework -- chains, prompts, orchestration
`langchain-openai`	OpenAI LLM and embedding integrations
`langchain-chroma`	ChromaDB vector store integration
`langchain-text-splitters`	Document chunking utilities
`langchain-community`	Community document loaders (PDF, HTML, etc.)
`python-dotenv`	Load API keys from `.env` files
`unstructured`	Parse PDFs and other document formats

Project Structure

rag-project/
├── .env                  # Your API keys
├── rag_pipeline.py       # The main RAG script
├── documents/            # Your source documents go here
│   ├── report.pdf
│   ├── notes.md
│   └── policy.txt
└── chroma_store/         # ChromaDB persists data here (auto-created)

Set Up Your API Key

Create a .env file in the project root:

OPENAI_API_KEY=sk-proj-your-actual-key-here

# Load it at the top of your script
from dotenv import load_dotenv
load_dotenv()

That's it. LangChain's OpenAI integrations automatically read OPENAI_API_KEY from the environment.

📄 Step 1: Load and Chunk Your Documents

This is the indexing phase. You take raw documents and prepare them for embedding.

Loading Documents

LangChain has loaders for practically every document format. Here are the ones you'll use most:

from langchain_community.document_loaders import (
    TextLoader,
    DirectoryLoader,
    PyPDFLoader,
    UnstructuredMarkdownLoader,
)

# Load a single PDF
pdf_docs = PyPDFLoader("documents/report.pdf").load()

# Load a single text file
txt_docs = TextLoader("documents/notes.txt").load()

# Load all files in a directory (auto-detects format)
all_docs = DirectoryLoader("documents/", glob="**/*.*").load()

print(f"Loaded {len(all_docs)} document pages")

Each loader returns a list of Document objects, each containing page_content (the text) and metadata (source file, page number, etc.). That metadata travels with the chunks all the way to the final answer -- useful for source attribution.

Chunking Strategy

Raw documents are too long to embed as-is. You need to split them into chunks that are small enough to be meaningful but large enough to preserve context. The RecursiveCharacterTextSplitter is the go-to choice for most use cases in 2026.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

chunks = text_splitter.split_documents(all_docs)
print(f"Split {len(all_docs)} documents into {len(chunks)} chunks")

How this works: the splitter tries to split on double newlines first (paragraph breaks). If a chunk is still too large, it falls back to single newlines, then sentences, then words. The chunk_overlap=200 means each chunk shares 200 characters with its neighbor -- this prevents context from being lost at chunk boundaries.

Choosing Chunk Size

The February 2026 FloTorch benchmark study tested seven chunking strategies across thousands of academic papers. The surprising result: simple recursive splitting at 512-1000 tokens consistently outperformed fancy semantic chunking methods. Start simple.

Document Type	chunk_size	chunk_overlap	Why
General articles, docs	1000	200	✅ Good default balance
Technical documentation	600	100	Smaller chunks = more precise retrieval
Legal or dense prose	1500	400	Larger chunks preserve legal context
Code files	800	100	Split on function/class boundaries
Q&A / FAQ content	500	50	Each Q&A pair should be its own chunk

Warning: Don't overthink chunking on your first pass. Set chunk_size=1000 and chunk_overlap=200, build the pipeline, and evaluate the results. Optimize chunk size after you have a working system and actual queries to test against.

🗃️ Step 2: Embed and Store in ChromaDB

Now you convert those text chunks into vectors and store them in ChromaDB.

What Are Embeddings?

An embedding turns text into a list of numbers (a vector) that captures its meaning. Similar texts produce similar vectors. When you search for "refund policy", the embedding is close to chunks containing "return items within 30 days" -- even though those words don't overlap.

Create the Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Initialize the embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

# Build the vector store from your chunks
vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    collection_name="my_documents",
    persist_directory="./chroma_store",
)

print(f"Stored {len(chunks)} chunks in ChromaDB")

That's three things happening in one call:
1. Each chunk's text is sent to OpenAI's embedding API and converted to a 1536-dimensional vector
2. The vectors (plus the original text and metadata) are stored in ChromaDB
3. ChromaDB persists everything to ./chroma_store/ on disk

Loading an Existing Store

You only build the vector store once (or when you add new documents). After that, load from disk:

# Load existing vector store (no re-embedding needed)
vector_store = Chroma(
    collection_name="my_documents",
    embedding_function=embedding_model,
    persist_directory="./chroma_store",
)

Embedding Model Options

OpenAI's text-embedding-3-small is the practical default in 2026 -- fast, cheap ($0.02 per million tokens), and good enough for most use cases. But you have options:

Embedding Model	Provider	Dimensions	Cost	Quality
`text-embedding-3-small`	OpenAI	1536	$0.02/1M tokens	✅ Great default
`text-embedding-3-large`	OpenAI	3072	$0.13/1M tokens	Better for large corpora
`nomic-embed-text`	Ollama (local)	768	Free	✅ Best local option
`mxbai-embed-large`	Ollama (local)	1024	Free	Higher quality, slower

To use a local embedding model with Ollama instead of OpenAI:

from langchain_ollama import OllamaEmbeddings

# Pull the model first: ollama pull nomic-embed-text
embedding_model = OllamaEmbeddings(model="nomic-embed-text")

Everything else in the pipeline stays the same. That's the beauty of LangChain's abstractions.

🔍 Step 3: Build the Retrieval Chain

This is where everything comes together. You take a user's question, retrieve relevant chunks, and generate an answer.

The Retriever

The retriever is ChromaDB's search interface, wrapped in a LangChain-compatible object:

retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5},
)

k=5 means "return the 5 most similar chunks." This is the most important tuning parameter -- too few and you miss relevant context, too many and you dilute the signal with noise.

The Prompt Template

The prompt tells the LLM how to use the retrieved context:

from langchain_core.prompts import ChatPromptTemplate

rag_prompt = ChatPromptTemplate.from_template(
    "You are a helpful assistant. Answer the question based ONLY on "
    "the following context. If the context doesn't contain enough "
    "information to answer fully, say what you can and note what's "
    "missing.\n\n"
    "Context:\n{context}\n\n"
    "Question: {question}\n\n"
    "Answer:"
)

The "ONLY" constraint is critical. Without it, the model will happily blend retrieved context with its training data, and you'll get plausible-sounding answers that are half-grounded and half-hallucinated. You want the model to stay in its lane.

The Full RAG Chain (LCEL)

LangChain Expression Language (LCEL) lets you compose the entire chain with pipe operators:

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Format retrieved docs into a single string
def format_docs(docs):
    return "\n\n---\n\n".join(
        f"Source: {doc.metadata.get('source', 'unknown')}\n{doc.page_content}"
        for doc in docs
    )

# Build the RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

Here's what happens when you invoke this chain:

The user's question flows into two paths simultaneously
Left path: The question goes to the retriever, which embeds it and searches ChromaDB for the 5 closest chunks, then format_docs joins them into a string
Right path: The question passes through unchanged via RunnablePassthrough()
Both feed into the prompt template as {context} and {question}
The filled prompt goes to the LLM
The LLM's response is parsed to a plain string

Query It

# Ask a question
response = rag_chain.invoke("What is our refund policy?")
print(response)

Based on the provided documents, your refund policy allows customers to
return items within 30 days of purchase for a full refund. Items must be
in original packaging and unused. Digital products are non-refundable
after download. Refunds are processed within 5-7 business days to the
original payment method.

Source: policy.txt

That answer came from your documents. The model read your policy file and synthesized an answer from it.

🏗️ The Complete Pipeline

Here's the full script -- copy it, replace the documents directory, and run it:

"""
rag_pipeline.py -- Complete RAG pipeline with LangChain and ChromaDB.

Usage:
    python rag_pipeline.py --build     # Index documents (run once)
    python rag_pipeline.py             # Interactive query mode
"""
import os
import sys
from dotenv import load_dotenv
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

load_dotenv()

# --- Configuration ---
DOCS_DIR = "./documents"
CHROMA_DIR = "./chroma_store"
COLLECTION_NAME = "my_documents"
EMBEDDING_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o-mini"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
RETRIEVAL_K = 5


def get_embedding_model():
    return OpenAIEmbeddings(model=EMBEDDING_MODEL)


def build_index():
    """Load documents, chunk them, embed, and store in ChromaDB."""
    print(f"Loading documents from {DOCS_DIR}/ ...")

    loader = DirectoryLoader(
        DOCS_DIR,
        glob="**/*.*",
        show_progress=True,
        use_multithreading=True,
    )
    documents = loader.load()
    print(f"Loaded {len(documents)} document pages.")

    if not documents:
        print("No documents found. Add files to the documents/ directory.")
        return

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks.")

    print("Embedding and storing in ChromaDB...")
    Chroma.from_documents(
        documents=chunks,
        embedding=get_embedding_model(),
        collection_name=COLLECTION_NAME,
        persist_directory=CHROMA_DIR,
    )
    print(f"Done. Vector store saved to {CHROMA_DIR}/")


def load_store():
    """Load existing ChromaDB vector store from disk."""
    return Chroma(
        collection_name=COLLECTION_NAME,
        embedding_function=get_embedding_model(),
        persist_directory=CHROMA_DIR,
    )


def format_docs(docs):
    """Format retrieved documents into a context string."""
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        for doc in docs
    )


def build_rag_chain(vector_store):
    """Assemble the RAG chain: retriever -> prompt -> LLM -> output."""
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": RETRIEVAL_K},
    )

    prompt = ChatPromptTemplate.from_template(
        "You are a helpful assistant. Answer the question based ONLY on "
        "the following context. If the context doesn't contain enough "
        "information, say so.\n\n"
        "Context:\n{context}\n\n"
        "Question: {question}\n\n"
        "Answer:"
    )

    llm = ChatOpenAI(model=CHAT_MODEL, temperature=0)

    chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return chain


def interactive_query(chain):
    """Run an interactive query loop."""
    print(f"\nRAG pipeline ready. Model: {CHAT_MODEL}")
    print("Type your questions (Ctrl+C to exit):\n")

    while True:
        try:
            question = input("Q: ").strip()
            if not question:
                continue
            print("\nSearching and generating...\n")
            answer = chain.invoke(question)
            print(f"A: {answer}\n")
        except KeyboardInterrupt:
            print("\nGoodbye.")
            break


if __name__ == "__main__":
    if not os.path.exists(DOCS_DIR):
        os.makedirs(DOCS_DIR)
        print(f"Created {DOCS_DIR}/. Add your documents, then run:")
        print("  python rag_pipeline.py --build")
        sys.exit(0)

    if "--build" in sys.argv:
        build_index()
    else:
        store = load_store()
        chain = build_rag_chain(store)
        interactive_query(chain)

Run It

# 1. Add your documents
mkdir -p documents
cp ~/my-docs/*.pdf documents/
cp ~/my-docs/*.md documents/

# 2. Build the index
python rag_pipeline.py --build

# 3. Start querying
python rag_pipeline.py

RAG pipeline ready. Model: gpt-4o-mini
Type your questions (Ctrl+C to exit):

Q: What were the Q4 revenue numbers?

Searching and generating...

A: According to the Q4 financial report, total revenue was $4.2M,
representing a 23% increase over Q3. The primary growth drivers
were enterprise subscriptions (+31%) and professional services (+18%).

🔧 Troubleshooting

"No documents found" after running --build

Make sure your files are in the documents/ directory (not nested too deep)
Check file extensions -- the DirectoryLoader with glob="**/*.*" catches most formats, but binary files like .docx need the unstructured package with extra dependencies: pip install "unstructured[all-docs]"
Verify permissions: ls -la documents/

Retrieval returns irrelevant chunks

Reduce chunk size to 500-600 characters for more precise matching
Increase k to 8-10 to cast a wider net, then let the LLM filter relevance
Check your chunks manually by adding a debug print:

# Debug: see what the retriever actually returns
docs = retriever.invoke("your question here")
for i, doc in enumerate(docs):
    print(f"\n--- Chunk {i+1} (score: {doc.metadata.get('score', 'N/A')}) ---")
    print(doc.page_content[:200])

ChromaDB "collection already exists" error

This happens when you re-run --build without clearing the old data. Either delete the chroma_store/ directory first, or modify the build step to clear the existing collection:

# Delete existing collection before rebuilding
import chromadb
client = chromadb.PersistentClient(path=CHROMA_DIR)
try:
    client.delete_collection(COLLECTION_NAME)
except ValueError:
    pass  # Collection doesn't exist yet

OpenAI rate limit or quota errors

Check your API key is valid: echo $OPENAI_API_KEY
For large document sets, add a delay between embedding batches or reduce batch size
Switch to a local embedding model (Ollama) to avoid API costs entirely

"Hallucinated" answers that aren't in the documents

Strengthen the prompt: add "If you cannot answer from the provided context alone, respond with 'I don't have enough information to answer that'"
Lower the LLM temperature to 0 (deterministic output)
Check whether the relevant information is actually making it into the context -- log the retrieved chunks

Important: If your RAG system answers questions about topics not in your documents, your prompt guardrails are too weak. The model should refuse to answer rather than hallucinate. Always test with questions that deliberately have no answer in your corpus.

⚡ Beyond the Basics

Once your pipeline is working, here are the upgrades that actually matter.

Add Metadata Filtering

ChromaDB supports filtering by metadata fields. If your documents have categories, dates, or departments:

retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 5,
        "filter": {"source": "documents/policy.pdf"},
    },
)

This restricts retrieval to chunks from a specific file. Useful when users ask about a particular document.

Hybrid Search with MMR

Maximal Marginal Relevance (MMR) balances similarity with diversity. Instead of returning 5 chunks that all say the same thing, MMR ensures variety:

retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,
        "fetch_k": 20,       # Fetch 20 candidates
        "lambda_mult": 0.7,  # 0 = max diversity, 1 = max similarity
    },
)

Add Source Citations

Modify format_docs and the prompt to include source references:

def format_docs_with_sources(docs):
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "unknown")
        formatted.append(f"[{i}] (Source: {source})\n{doc.page_content}")
    return "\n\n".join(formatted)

# Update the prompt to instruct citation
rag_prompt = ChatPromptTemplate.from_template(
    "Answer the question using ONLY the provided context. "
    "Cite your sources using [1], [2], etc.\n\n"
    "Context:\n{context}\n\n"
    "Question: {question}\n\n"
    "Answer (with citations):"
)

Conversation Memory

For a multi-turn chatbot, add conversation history to the chain:

from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage

prompt_with_history = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use the context to answer."),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "Context:\n{context}\n\nQuestion: {question}"),
])

# Maintain history in your query loop
chat_history = []

def ask_with_history(chain, question, history):
    response = chain.invoke({
        "question": question,
        "chat_history": history,
    })
    history.append(HumanMessage(content=question))
    history.append(AIMessage(content=response))
    return response

Tip: For production RAG systems, consider adding an evaluation step. Use a small set of known question-answer pairs and measure retrieval precision (are the right chunks being retrieved?) and answer accuracy (does the LLM's answer match the expected answer?). Without evaluation, you're tuning blind.

🚀 What's Next

Add a web interface -- wrap the pipeline with FastAPI or Streamlit for a chat UI that your team can use without touching the terminal
Go fully local -- replace OpenAI with Ollama for embeddings and generation, zero API costs and full data privacy (Local LLM + Ollama RAG Guide)
Scale to production -- swap ChromaDB for a managed vector database like Pinecone or Weaviate when you outgrow single-machine storage
Add document watching -- set up a file watcher or cron job to automatically re-index when new documents land in the directory
Evaluate and iterate -- build a test set of 20-50 questions with expected answers, then use LangSmith or RAGAS to measure retrieval and generation quality systematically

The RAG pattern is one of the most practical things happening in AI engineering right now. It bridges the gap between "general-purpose LLM" and "actually useful tool for your specific data." Master it, and you'll find uses everywhere -- internal knowledge bases, customer support, code documentation search, compliance document Q&A, and more.

Want to see how AI is changing the developer role? Read The Rise of the AI Engineer for the bigger picture, or check out AI Coding Agents Compared to see how RAG fits into the broader AI-assisted development landscape.

Dislike

Thanks for feedback.