Your Own Private AI: Run a Local LLM with Ollama and Build a RAG System Over Your Documents
Every time you paste code into ChatGPT or Claude, that code leaves your machine and lands on someone else's server. For personal projects, that's fine. But for proprietary algorithms, internal business logic, client data, or anything regulated by HIPAA, GDPR, or SOC 2 -- it's a real problem.
The good news: in 2026, you can run genuinely capable AI models entirely on your own hardware. No API keys. No cloud. No data leaving your machine. And with a RAG (Retrieval-Augmented Generation) pipeline, you can make that local model answer questions grounded in your own documents -- PDFs, markdown files, code, internal wikis, whatever you have.
This guide walks you through the entire setup: installing Ollama, picking the right model for your hardware, setting up a ChatGPT-like web interface, and building a working RAG system that queries your own documents. All of it runs locally.
What You'll Need
- A reasonably modern computer -- 8 GB RAM minimum (16 GB recommended) with an SSD
- macOS, Linux, or Windows -- Ollama runs natively on all three
- Docker -- for Open WebUI (the chat interface)
- Python 3.10+ -- for the RAG pipeline
- A GPU is helpful but not required -- Ollama falls back to CPU inference automatically
Part 1: Install Ollama and Run Your First Model
Installation
macOS:
brew install ollama
Or download directly from ollama.com.
Linux:
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl start ollama
sudo systemctl enable ollama
Windows:
Download the installer from ollama.com and run it. The CLI gets added to your PATH automatically.
Run Your First Model
ollama run llama3.1:8b
That's it. Ollama downloads the model (about 4.9 GB) and drops you into an interactive chat. Type a question, get an answer -- all running on your machine.
Press Ctrl+D or type /bye to exit.
Essential Ollama Commands
| Command | What It Does |
|---|---|
ollama pull <model> |
Download a model without starting a chat |
ollama run <model> |
Download (if needed) and start an interactive session |
ollama list |
Show all downloaded models |
ollama ps |
Show currently loaded models and memory usage |
ollama stop <model> |
Unload a model from memory |
ollama rm <model> |
Delete a model from disk |
ollama show <model> |
Show model details and parameters |
Part 2: Picking the Right Model for Your Hardware
This is where most guides fail -- they recommend a model without considering what you're actually running it on. Here's a practical breakdown based on real RAM constraints:
If You Have 8 GB RAM (Entry Level)
| Model | Tag | Size | Best For |
|---|---|---|---|
| Llama 3.2 | llama3.2:3b |
2 GB | Fast chat, lightweight tasks |
| DeepSeek-R1 (distill) | deepseek-r1:1.5b |
1 GB | Basic reasoning on minimal hardware |
| Llama 3.1 | llama3.1:8b |
4.9 GB | Solid all-rounder (tight fit at 8 GB) |
ollama pull llama3.1:8b
If You Have 16 GB RAM (Sweet Spot)
| Model | Tag | Size | Best For |
|---|---|---|---|
| DeepSeek-R1 (distill) | deepseek-r1:14b |
9 GB | Strong reasoning and math |
| Qwen 2.5 Coder | qwen2.5-coder:7b |
4.7 GB | Code completion and generation |
| Phi-4 | phi4:14b |
9 GB | Reasoning, compact but powerful |
| Mistral Small 3.1 | mistral-small:24b |
15 GB | Fast general purpose |
ollama pull deepseek-r1:14b
If You Have 32 GB RAM (Power User)
| Model | Tag | Size | Best For |
|---|---|---|---|
| DeepSeek-R1 (distill) | deepseek-r1:32b |
20 GB | Near o1-mini reasoning quality |
| Qwen 2.5 Coder | qwen2.5-coder:32b |
20 GB | GPT-4o level coding |
| Gemma 3 | gemma3:27b |
17 GB | Multimodal, Google-quality |
ollama pull deepseek-r1:32b
If You Have 64 GB+ RAM (Workstation)
| Model | Tag | Size | Best For |
|---|---|---|---|
| DeepSeek-R1 (distill) | deepseek-r1:70b |
43 GB | Top-tier local reasoning |
| Llama 3.3 | llama3.3:70b |
43 GB | Best general-purpose local model |
ollama pull llama3.3:70b
My Recommendation
Start with deepseek-r1:14b if you have 16 GB RAM. It punches well above its weight for reasoning tasks and code generation. If you have 32 GB, the deepseek-r1:32b is extraordinary -- it outperforms OpenAI's o1-mini on standard benchmarks while running entirely on your laptop.
For coding specifically, qwen2.5-coder:7b is excellent as a secondary model for fast inline suggestions.
A Note on Quantization
When you pull a model from Ollama, you're getting a quantized version -- the model weights are compressed from 16-bit to 4-bit, which dramatically reduces file size and RAM usage with minimal quality loss. The default quantization (Q4_K_M) is the sweet spot for almost everyone:
| Format | RAM Savings | Quality Impact |
|---|---|---|
| Q4_K_M (default) | ~75% less than full | Small, excellent balance |
| Q5_K_M | ~65% less | Very small impact |
| Q8_0 | ~50% less | Nearly indistinguishable from full |
You don't need to think about this unless you want to -- the defaults are good. But if you have extra RAM and want better quality:
ollama pull llama3.1:8b-q8_0
Part 3: Set Up Open WebUI (The Chat Interface)
Running models from the terminal is fine for quick queries, but for daily use you'll want a proper interface. Open WebUI gives you a self-hosted ChatGPT-like experience that connects directly to your local Ollama models.
Install with Docker
Make sure Ollama is running, then:
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in your browser. Create an account on first visit (the first account becomes admin). All data stays local in the Docker volume.
Single-User Mode (Skip Login)
If this is a personal setup and you don't want to deal with accounts:
docker run -d -p 3000:8080 \
-e WEBUI_AUTH=False \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
What You Get
- Model switching -- toggle between all your downloaded Ollama models from a dropdown
- Conversation history -- all chats are saved locally
- Document upload -- drag PDFs or text files into a chat for in-session RAG
- System prompt editor -- customize each model's behavior
- Multiple users -- if you want to share the setup with family or teammates
Docker Compose (Ollama + Open WebUI Together)
If you want both services managed together:
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- openwebui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: always
volumes:
ollama_data:
openwebui_data:
docker compose up -d
If you have an NVIDIA GPU and want GPU acceleration inside Docker, add this to the ollama service:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Part 4: Connect Ollama to Your Dev Tools
Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. This means any tool that supports the OpenAI API can use your local models as a drop-in replacement.
Connect to Cursor
- Open Settings > Models > OpenAI API
- Set Base URL:
http://localhost:11434/v1(the/v1is required) - Set API Key:
ollama(any non-empty string works -- Ollama ignores it) - Select your model
Now Cursor uses your local model instead of sending code to the cloud.
Connect to VS Code (Continue Extension)
Install the Continue extension, then edit ~/.continue/config.yaml:
models:
- name: DeepSeek R1 14B
provider: ollama
model: deepseek-r1:14b
apiBase: http://localhost:11434
roles:
- chat
- edit
- name: Qwen Coder 7B
provider: ollama
model: qwen2.5-coder:7b
apiBase: http://localhost:11434
roles:
- autocomplete
The trick is to use a small, fast model for autocomplete (inline suggestions need to be instant) and a larger model for chat and editing (where you can wait a few seconds).
Use From Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama", # required by the SDK but ignored by Ollama
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what RAG is in one paragraph."},
]
)
print(response.choices[0].message.content)
Because Ollama speaks the OpenAI protocol, any library or tool built for OpenAI works with zero code changes -- just point the base URL to localhost.
Part 5: Build a RAG Pipeline Over Your Own Documents
This is where it gets powerful. RAG (Retrieval-Augmented Generation) lets you ask an LLM questions that it answers using your own documents as the source of truth -- not its training data, not the internet, your files.
The pipeline looks like this:
┌──────────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Your Docs │────►│ Chunk │────►│ Embed & │────►│ ChromaDB │
│ (PDF, MD) │ │ Text │ │ Vectorize │ │ (Vector │
└──────────────┘ └──────────┘ └──────────────┘ │ Store) │
└────┬─────┘
│
┌──────────────┐ ┌──────────┐ ┌──────────────┐ │
│ Answer │◄────│ LLM │◄────│ Relevant │◄────────┘
│ │ │ (Ollama) │ │ Chunks │
└──────────────┘ └──────────┘ └──────────────┘
▲
│
┌────┴─────┐
│ Your │
│ Question│
└──────────┘
- Load your documents (PDFs, markdown, text files)
- Chunk them into smaller pieces (paragraphs or sections)
- Embed each chunk into a vector (a numeric representation of meaning)
- Store the vectors in ChromaDB
- At query time: embed your question, find the most similar chunks, pass them to the LLM along with your question
Step 1: Install Dependencies
pip install langchain langchain-ollama langchain-community \
langchain-text-splitters chromadb \
unstructured "unstructured[all-docs]"
Step 2: Pull an Embedding Model
You need a separate small model specifically for turning text into vectors:
ollama pull nomic-embed-text
This is a fast, 137M parameter embedding model that produces 1024-dimensional vectors. It beats OpenAI's text-embedding-ada-002 on standard benchmarks and runs entirely locally.
Step 3: The Complete RAG Script
Create a file called rag.py:
import sys
from langchain_community.document_loaders import DirectoryLoader, UnstructuredPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# --- Configuration ---
DOCS_DIR = "./documents" # Put your PDFs and files here
CHROMA_DIR = "./chroma_db" # Where vectors are persisted
EMBED_MODEL = "nomic-embed-text" # Embedding model
CHAT_MODEL = "llama3.1:8b" # Chat model (adjust to your hardware)
CHUNK_SIZE = 1000 # Characters per chunk
CHUNK_OVERLAP = 200 # Overlap between chunks
def build_vector_store():
"""Load documents, chunk, embed, and store in ChromaDB."""
print(f"Loading documents from {DOCS_DIR}/ ...")
# Load PDFs
pdf_loader = DirectoryLoader(
DOCS_DIR,
glob="**/*.pdf",
loader_cls=UnstructuredPDFLoader,
)
# Load text and markdown files
txt_loader = DirectoryLoader(
DOCS_DIR,
glob="**/*.{txt,md}",
)
documents = pdf_loader.load() + txt_loader.load()
print(f"Loaded {len(documents)} documents.")
if not documents:
print("No documents found. Add files to the documents/ directory.")
return None
# Chunk the documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")
# Embed and store
print("Embedding chunks (this may take a few minutes on first run)...")
vector_store = Chroma.from_documents(
documents=chunks,
embedding=OllamaEmbeddings(model=EMBED_MODEL),
collection_name="my_documents",
persist_directory=CHROMA_DIR,
)
print("Vector store built and persisted.")
return vector_store
def load_vector_store():
"""Load an existing vector store from disk."""
return Chroma(
collection_name="my_documents",
embedding_function=OllamaEmbeddings(model=EMBED_MODEL),
persist_directory=CHROMA_DIR,
)
def query_documents(vector_store, question):
"""Run a RAG query against the vector store."""
# Set up retriever (find the 5 most relevant chunks)
retriever = vector_store.as_retriever(
search_kwargs={"k": 5}
)
# Set up the LLM
llm = ChatOllama(model=CHAT_MODEL)
# The prompt template
prompt = ChatPromptTemplate.from_template(
"You are a helpful assistant. Answer the question using ONLY the "
"provided context. If the context doesn't contain enough information "
"to answer, say so.\n\n"
"Context:\n{context}\n\n"
"Question: {question}"
)
# Build the RAG chain
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
return chain.invoke(question)
if __name__ == "__main__":
import os
if not os.path.exists(DOCS_DIR):
os.makedirs(DOCS_DIR)
print(f"Created {DOCS_DIR}/ directory. Add your documents there.")
print("Then run: python rag.py --build")
sys.exit(0)
if "--build" in sys.argv:
build_vector_store()
else:
# Load existing store and enter query loop
store = load_vector_store()
print(f"\nRAG system ready. Using {CHAT_MODEL} for answers.")
print("Type your questions (Ctrl+C to exit):\n")
while True:
try:
question = input("Q: ").strip()
if not question:
continue
print("\nSearching documents and generating answer...\n")
answer = query_documents(store, question)
print(f"A: {answer}\n")
except KeyboardInterrupt:
print("\nGoodbye.")
break
Step 4: Use It
# 1. Create the documents directory and add your files
mkdir -p documents
cp ~/my-notes/*.pdf documents/
cp ~/my-notes/*.md documents/
# 2. Build the vector store (one-time, or when you add new documents)
python rag.py --build
# 3. Start asking questions
python rag.py
RAG system ready. Using llama3.1:8b for answers.
Type your questions (Ctrl+C to exit):
Q: What were the key decisions from last quarter's architecture review?
Searching documents and generating answer...
A: Based on the architecture review document from Q4, the key decisions were:
1. Migration from REST to GraphQL for the mobile API...
The model is answering from your documents, not from its training data. And nothing left your machine.
Tuning Tips
Chunk size matters. Smaller chunks (500-800) give more precise retrieval but less context per chunk. Larger chunks (1200-2000) give more context but might dilute relevance. Start with 1000 and adjust based on your document type:
| Document Type | chunk_size | chunk_overlap |
|---|---|---|
| General articles, notes | 1000 | 200 |
| Technical docs, code | 600 | 100 |
| Legal or dense prose | 1500 | 400 |
More chunks retrieved = more context but slower. The search_kwargs={"k": 5} means we retrieve the 5 most relevant chunks. For complex questions spanning multiple documents, try k=10. For simple factual lookups, k=3 is enough.
Rebuild when you add documents. Run python rag.py --build again whenever you add new files. ChromaDB persists to disk, so you only need to rebuild when your source documents change.
Part 6: Performance Tuning
Free Up GPU Memory After One-Off Queries
Ollama keeps models loaded in memory for 5 minutes by default. If you need that memory back immediately:
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "keep_alive": 0}'
Increase Context Window
By default, most models use a 2048 or 4096 token context. For RAG, you often need more:
export OLLAMA_CONTEXT_LENGTH=16384
Or set it per-model in a Modelfile:
FROM llama3.1:8b
PARAMETER num_ctx 16384
ollama create llama3.1-16k -f ./Modelfile
ollama run llama3.1-16k
KV Cache Quantization (Save Memory on Long Contexts)
Context memory grows linearly with conversation length. An 8B model uses ~0.3 GB at 2K context but ~5 GB at 32K. Cut this in half:
export OLLAMA_KV_CACHE_TYPE=q8_0
Move Model Storage to a Larger Drive
Models live at ~/.ollama/models by default. If your boot drive is small:
export OLLAMA_MODELS=/Volumes/BigDrive/ollama-models
Create Custom Model Presets
A Modelfile lets you bake in a system prompt and parameters:
FROM deepseek-r1:14b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
SYSTEM """You are a senior Python engineer. Always provide working code
with type hints and docstrings. Be concise."""
ollama create python-expert -f ./Modelfile
ollama run python-expert
Cost Breakdown
| Component | Cost | Notes |
|---|---|---|
| Ollama | Free | MIT licensed, open source |
| Open WebUI | Free | MIT licensed, open source |
| LangChain + ChromaDB | Free | Open source |
| Models | Free | Open-weight models, no API fees |
| Hardware | What you already own | Runs on any 8 GB+ machine |
| Total | $0 | Forever. No subscriptions, no per-token fees |
The only cost is the electricity to run inference and the initial time to set it up. Once running, you have a private AI system with zero ongoing costs.
When to Use Local vs Cloud AI
Local LLMs aren't a replacement for Claude or GPT-4o in every scenario. Here's a practical decision framework:
| Scenario | Use Local | Use Cloud |
|---|---|---|
| Querying proprietary documents | Yes | Risky |
| Code with trade secrets | Yes | Risky |
| General coding assistance | Either | Yes (better quality) |
| Complex reasoning, long analysis | Only with 32B+ models | Yes |
| Working offline or on planes | Yes | No |
| Regulated data (HIPAA, GDPR) | Yes | Requires enterprise agreements |
| Quick throwaway scripts | Either | Yes (faster, smarter) |
| Learning and experimentation | Yes (free, unlimited) | Limited by cost |
The ideal setup for most developers: use cloud AI (Claude Code, Cursor) for general work, and local models for anything sensitive. They complement each other.
Troubleshooting
Model runs very slowly
- Check
ollama ps-- if the model is using CPU only, you'll see0%GPU utilization - On Mac with Apple Silicon: Ollama uses Metal (GPU) automatically. If slow, check Activity Monitor for memory pressure
- On Linux/Windows with NVIDIA: ensure you have CUDA drivers installed (
nvidia-smishould work) - Try a smaller model -- the 7B models are 5-10x faster than 70B on the same hardware
Out of memory errors
- Use a smaller model or a more aggressively quantized version
- Close other memory-hungry applications
- Reduce context length:
export OLLAMA_CONTEXT_LENGTH=4096 - Enable KV cache quantization:
export OLLAMA_KV_CACHE_TYPE=q4_0
Open WebUI can't connect to Ollama
- Make sure Ollama is running:
ollama psorcurl http://localhost:11434 - The
--add-host=host.docker.internal:host-gatewayflag in the Docker command is essential -- it lets the container reach Ollama on the host machine - If using Docker Compose, use the service name (
http://ollama:11434) not localhost
RAG returns irrelevant results
- Your chunks might be too large -- try reducing
chunk_sizeto 500-800 - Increase
k(number of retrieved chunks) to cast a wider net - Make sure your documents actually contain the information you're asking about
- Try a different embedding model:
ollama pull mxbai-embed-large(slower but higher quality)
What's Next
Once your local AI setup is running:
- Add more document types -- LangChain supports Word docs, HTML pages, Notion exports, and dozens more formats
- Build a web frontend -- Use Streamlit or Gradio to create a custom interface for your RAG system
- Experiment with different models -- Swap
CHAT_MODELin the RAG script to compare how different models answer the same questions - Set up scheduled ingestion -- Cron job to re-index documents from a shared folder, keeping your RAG system current
- Connect to your dev tools -- Use Ollama as the backend for Cursor, Continue, or any OpenAI-compatible tool for fully private coding assistance
The era of "you need a cloud API for useful AI" is over. A 14B parameter model running on a $1,000 laptop in 2026 outperforms what a $100/month GPT-4 subscription could do two years ago. The models are only getting better, and the hardware requirements are only going down.
Looking for a cloud-based AI agent instead? Check out our OpenClaw Setup Guide for a 24/7 personal AI assistant on EC2, or our Claude Code Workflow Guide for terminal-based AI coding.