Easy

Your Own Private AI: Run a Local LLM with Ollama and Build a RAG System Over Your Documents

Every time you paste code into ChatGPT or Claude, that code leaves your machine and lands on someone else's server. For personal projects, that's fine. But for proprietary algorithms, internal business logic, client data, or anything regulated by HIPAA, GDPR, or SOC 2 -- it's a real problem.

The good news: in 2026, you can run genuinely capable AI models entirely on your own hardware. No API keys. No cloud. No data leaving your machine. And with a RAG (Retrieval-Augmented Generation) pipeline, you can make that local model answer questions grounded in your own documents -- PDFs, markdown files, code, internal wikis, whatever you have.

This guide walks you through the entire setup: installing Ollama, picking the right model for your hardware, setting up a ChatGPT-like web interface, and building a working RAG system that queries your own documents. All of it runs locally.

What You'll Need

A reasonably modern computer -- 8 GB RAM minimum (16 GB recommended) with an SSD
macOS, Linux, or Windows -- Ollama runs natively on all three
Docker -- for Open WebUI (the chat interface)
Python 3.10+ -- for the RAG pipeline
A GPU is helpful but not required -- Ollama falls back to CPU inference automatically

Part 1: Install Ollama and Run Your First Model

Installation

macOS:

brew install ollama

Or download directly from ollama.com.

Linux:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl start ollama
sudo systemctl enable ollama

Windows:

Download the installer from ollama.com and run it. The CLI gets added to your PATH automatically.

Run Your First Model

ollama run llama3.1:8b

That's it. Ollama downloads the model (about 4.9 GB) and drops you into an interactive chat. Type a question, get an answer -- all running on your machine.

Press Ctrl+D or type /bye to exit.

Essential Ollama Commands

Command	What It Does
`ollama pull <model>`	Download a model without starting a chat
`ollama run <model>`	Download (if needed) and start an interactive session
`ollama list`	Show all downloaded models
`ollama ps`	Show currently loaded models and memory usage
`ollama stop <model>`	Unload a model from memory
`ollama rm <model>`	Delete a model from disk
`ollama show <model>`	Show model details and parameters

Part 2: Picking the Right Model for Your Hardware

This is where most guides fail -- they recommend a model without considering what you're actually running it on. Here's a practical breakdown based on real RAM constraints:

If You Have 8 GB RAM (Entry Level)

Model	Tag	Size	Best For
Llama 3.2	`llama3.2:3b`	2 GB	Fast chat, lightweight tasks
DeepSeek-R1 (distill)	`deepseek-r1:1.5b`	1 GB	Basic reasoning on minimal hardware
Llama 3.1	`llama3.1:8b`	4.9 GB	Solid all-rounder (tight fit at 8 GB)

ollama pull llama3.1:8b

If You Have 16 GB RAM (Sweet Spot)

Model	Tag	Size	Best For
DeepSeek-R1 (distill)	`deepseek-r1:14b`	9 GB	Strong reasoning and math
Qwen 2.5 Coder	`qwen2.5-coder:7b`	4.7 GB	Code completion and generation
Phi-4	`phi4:14b`	9 GB	Reasoning, compact but powerful
Mistral Small 3.1	`mistral-small:24b`	15 GB	Fast general purpose

ollama pull deepseek-r1:14b

If You Have 32 GB RAM (Power User)

Model	Tag	Size	Best For
DeepSeek-R1 (distill)	`deepseek-r1:32b`	20 GB	Near o1-mini reasoning quality
Qwen 2.5 Coder	`qwen2.5-coder:32b`	20 GB	GPT-4o level coding
Gemma 3	`gemma3:27b`	17 GB	Multimodal, Google-quality

ollama pull deepseek-r1:32b

If You Have 64 GB+ RAM (Workstation)

Model	Tag	Size	Best For
DeepSeek-R1 (distill)	`deepseek-r1:70b`	43 GB	Top-tier local reasoning
Llama 3.3	`llama3.3:70b`	43 GB	Best general-purpose local model

ollama pull llama3.3:70b

My Recommendation

Start with deepseek-r1:14b if you have 16 GB RAM. It punches well above its weight for reasoning tasks and code generation. If you have 32 GB, the deepseek-r1:32b is extraordinary -- it outperforms OpenAI's o1-mini on standard benchmarks while running entirely on your laptop.

For coding specifically, qwen2.5-coder:7b is excellent as a secondary model for fast inline suggestions.

A Note on Quantization

When you pull a model from Ollama, you're getting a quantized version -- the model weights are compressed from 16-bit to 4-bit, which dramatically reduces file size and RAM usage with minimal quality loss. The default quantization (Q4_K_M) is the sweet spot for almost everyone:

Format	RAM Savings	Quality Impact
Q4_K_M (default)	~75% less than full	Small, excellent balance
Q5_K_M	~65% less	Very small impact
Q8_0	~50% less	Nearly indistinguishable from full

You don't need to think about this unless you want to -- the defaults are good. But if you have extra RAM and want better quality:

ollama pull llama3.1:8b-q8_0

Part 3: Set Up Open WebUI (The Chat Interface)

Running models from the terminal is fine for quick queries, but for daily use you'll want a proper interface. Open WebUI gives you a self-hosted ChatGPT-like experience that connects directly to your local Ollama models.

Install with Docker

Make sure Ollama is running, then:

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an account on first visit (the first account becomes admin). All data stays local in the Docker volume.

If this is a personal setup and you don't want to deal with accounts:

docker run -d -p 3000:8080 \
  -e WEBUI_AUTH=False \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

What You Get

Model switching -- toggle between all your downloaded Ollama models from a dropdown
Conversation history -- all chats are saved locally
Document upload -- drag PDFs or text files into a chat for in-session RAG
System prompt editor -- customize each model's behavior
Multiple users -- if you want to share the setup with family or teammates

Docker Compose (Ollama + Open WebUI Together)

If you want both services managed together:

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - openwebui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: always

volumes:
  ollama_data:
  openwebui_data:

docker compose up -d

If you have an NVIDIA GPU and want GPU acceleration inside Docker, add this to the ollama service:

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Part 4: Connect Ollama to Your Dev Tools

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. This means any tool that supports the OpenAI API can use your local models as a drop-in replacement.

Connect to Cursor

Open Settings > Models > OpenAI API
Set Base URL: http://localhost:11434/v1 (the /v1 is required)
Set API Key: ollama (any non-empty string works -- Ollama ignores it)
Select your model

Now Cursor uses your local model instead of sending code to the cloud.

Connect to VS Code (Continue Extension)

Install the Continue extension, then edit ~/.continue/config.yaml:

models:
  - name: DeepSeek R1 14B
    provider: ollama
    model: deepseek-r1:14b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit

  - name: Qwen Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434
    roles:
      - autocomplete

The trick is to use a small, fast model for autocomplete (inline suggestions need to be instant) and a larger model for chat and editing (where you can wait a few seconds).

Use From Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",  # required by the SDK but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what RAG is in one paragraph."},
    ]
)
print(response.choices[0].message.content)

Because Ollama speaks the OpenAI protocol, any library or tool built for OpenAI works with zero code changes -- just point the base URL to localhost.

Part 5: Build a RAG Pipeline Over Your Own Documents

This is where it gets powerful. RAG (Retrieval-Augmented Generation) lets you ask an LLM questions that it answers using your own documents as the source of truth -- not its training data, not the internet, your files.

The pipeline looks like this:

┌──────────────┐     ┌──────────┐     ┌──────────────┐     ┌──────────┐
│  Your Docs   │────►│  Chunk   │────►│   Embed &    │────►│ ChromaDB │
│  (PDF, MD)   │     │  Text    │     │   Vectorize  │     │ (Vector  │
└──────────────┘     └──────────┘     └──────────────┘     │  Store)  │
                                                           └────┬─────┘
                                                                │
┌──────────────┐     ┌──────────┐     ┌──────────────┐         │
│   Answer     │◄────│  LLM     │◄────│  Relevant    │◄────────┘
│              │     │ (Ollama) │     │  Chunks      │
└──────────────┘     └──────────┘     └──────────────┘
                                           ▲
                                           │
                                      ┌────┴─────┐
                                      │  Your    │
                                      │  Question│
                                      └──────────┘

Load your documents (PDFs, markdown, text files)
Chunk them into smaller pieces (paragraphs or sections)
Embed each chunk into a vector (a numeric representation of meaning)
Store the vectors in ChromaDB
At query time: embed your question, find the most similar chunks, pass them to the LLM along with your question

Step 1: Install Dependencies

pip install langchain langchain-ollama langchain-community \
  langchain-text-splitters chromadb \
  unstructured "unstructured[all-docs]"

Step 2: Pull an Embedding Model

You need a separate small model specifically for turning text into vectors:

ollama pull nomic-embed-text

This is a fast, 137M parameter embedding model that produces 1024-dimensional vectors. It beats OpenAI's text-embedding-ada-002 on standard benchmarks and runs entirely locally.

Step 3: The Complete RAG Script

Create a file called rag.py:

import sys
from langchain_community.document_loaders import DirectoryLoader, UnstructuredPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


# --- Configuration ---
DOCS_DIR = "./documents"          # Put your PDFs and files here
CHROMA_DIR = "./chroma_db"        # Where vectors are persisted
EMBED_MODEL = "nomic-embed-text"  # Embedding model
CHAT_MODEL = "llama3.1:8b"        # Chat model (adjust to your hardware)
CHUNK_SIZE = 1000                 # Characters per chunk
CHUNK_OVERLAP = 200               # Overlap between chunks


def build_vector_store():
    """Load documents, chunk, embed, and store in ChromaDB."""

    print(f"Loading documents from {DOCS_DIR}/ ...")

    # Load PDFs
    pdf_loader = DirectoryLoader(
        DOCS_DIR,
        glob="**/*.pdf",
        loader_cls=UnstructuredPDFLoader,
    )
    # Load text and markdown files
    txt_loader = DirectoryLoader(
        DOCS_DIR,
        glob="**/*.{txt,md}",
    )

    documents = pdf_loader.load() + txt_loader.load()
    print(f"Loaded {len(documents)} documents.")

    if not documents:
        print("No documents found. Add files to the documents/ directory.")
        return None

    # Chunk the documents
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
    )
    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks.")

    # Embed and store
    print("Embedding chunks (this may take a few minutes on first run)...")
    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=OllamaEmbeddings(model=EMBED_MODEL),
        collection_name="my_documents",
        persist_directory=CHROMA_DIR,
    )
    print("Vector store built and persisted.")
    return vector_store


def load_vector_store():
    """Load an existing vector store from disk."""
    return Chroma(
        collection_name="my_documents",
        embedding_function=OllamaEmbeddings(model=EMBED_MODEL),
        persist_directory=CHROMA_DIR,
    )


def query_documents(vector_store, question):
    """Run a RAG query against the vector store."""

    # Set up retriever (find the 5 most relevant chunks)
    retriever = vector_store.as_retriever(
        search_kwargs={"k": 5}
    )

    # Set up the LLM
    llm = ChatOllama(model=CHAT_MODEL)

    # The prompt template
    prompt = ChatPromptTemplate.from_template(
        "You are a helpful assistant. Answer the question using ONLY the "
        "provided context. If the context doesn't contain enough information "
        "to answer, say so.\n\n"
        "Context:\n{context}\n\n"
        "Question: {question}"
    )

    # Build the RAG chain
    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    return chain.invoke(question)


if __name__ == "__main__":
    import os

    if not os.path.exists(DOCS_DIR):
        os.makedirs(DOCS_DIR)
        print(f"Created {DOCS_DIR}/ directory. Add your documents there.")
        print("Then run: python rag.py --build")
        sys.exit(0)

    if "--build" in sys.argv:
        build_vector_store()
    else:
        # Load existing store and enter query loop
        store = load_vector_store()
        print(f"\nRAG system ready. Using {CHAT_MODEL} for answers.")
        print("Type your questions (Ctrl+C to exit):\n")

        while True:
            try:
                question = input("Q: ").strip()
                if not question:
                    continue
                print("\nSearching documents and generating answer...\n")
                answer = query_documents(store, question)
                print(f"A: {answer}\n")
            except KeyboardInterrupt:
                print("\nGoodbye.")
                break

Step 4: Use It

# 1. Create the documents directory and add your files
mkdir -p documents
cp ~/my-notes/*.pdf documents/
cp ~/my-notes/*.md documents/

# 2. Build the vector store (one-time, or when you add new documents)
python rag.py --build

# 3. Start asking questions
python rag.py

RAG system ready. Using llama3.1:8b for answers.
Type your questions (Ctrl+C to exit):

Q: What were the key decisions from last quarter's architecture review?

Searching documents and generating answer...

A: Based on the architecture review document from Q4, the key decisions were:
1. Migration from REST to GraphQL for the mobile API...

The model is answering from your documents, not from its training data. And nothing left your machine.

Tuning Tips

Chunk size matters. Smaller chunks (500-800) give more precise retrieval but less context per chunk. Larger chunks (1200-2000) give more context but might dilute relevance. Start with 1000 and adjust based on your document type:

Document Type	chunk_size	chunk_overlap
General articles, notes	1000	200
Technical docs, code	600	100
Legal or dense prose	1500	400

More chunks retrieved = more context but slower. The search_kwargs={"k": 5} means we retrieve the 5 most relevant chunks. For complex questions spanning multiple documents, try k=10. For simple factual lookups, k=3 is enough.

Rebuild when you add documents. Run python rag.py --build again whenever you add new files. ChromaDB persists to disk, so you only need to rebuild when your source documents change.

Part 6: Performance Tuning

Free Up GPU Memory After One-Off Queries

Ollama keeps models loaded in memory for 5 minutes by default. If you need that memory back immediately:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "keep_alive": 0}'

Increase Context Window

By default, most models use a 2048 or 4096 token context. For RAG, you often need more:

export OLLAMA_CONTEXT_LENGTH=16384

Or set it per-model in a Modelfile:

FROM llama3.1:8b
PARAMETER num_ctx 16384

ollama create llama3.1-16k -f ./Modelfile
ollama run llama3.1-16k

KV Cache Quantization (Save Memory on Long Contexts)

Context memory grows linearly with conversation length. An 8B model uses ~0.3 GB at 2K context but ~5 GB at 32K. Cut this in half:

export OLLAMA_KV_CACHE_TYPE=q8_0

Move Model Storage to a Larger Drive

Models live at ~/.ollama/models by default. If your boot drive is small:

export OLLAMA_MODELS=/Volumes/BigDrive/ollama-models

Create Custom Model Presets

A Modelfile lets you bake in a system prompt and parameters:

FROM deepseek-r1:14b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192

SYSTEM """You are a senior Python engineer. Always provide working code
with type hints and docstrings. Be concise."""

ollama create python-expert -f ./Modelfile
ollama run python-expert

Cost Breakdown

Component	Cost	Notes
Ollama	Free	MIT licensed, open source
Open WebUI	Free	MIT licensed, open source
LangChain + ChromaDB	Free	Open source
Models	Free	Open-weight models, no API fees
Hardware	What you already own	Runs on any 8 GB+ machine
Total	$0	Forever. No subscriptions, no per-token fees

The only cost is the electricity to run inference and the initial time to set it up. Once running, you have a private AI system with zero ongoing costs.

When to Use Local vs Cloud AI

Local LLMs aren't a replacement for Claude or GPT-4o in every scenario. Here's a practical decision framework:

Scenario	Use Local	Use Cloud
Querying proprietary documents	Yes	Risky
Code with trade secrets	Yes	Risky
General coding assistance	Either	Yes (better quality)
Complex reasoning, long analysis	Only with 32B+ models	Yes
Working offline or on planes	Yes	No
Regulated data (HIPAA, GDPR)	Yes	Requires enterprise agreements
Quick throwaway scripts	Either	Yes (faster, smarter)
Learning and experimentation	Yes (free, unlimited)	Limited by cost

The ideal setup for most developers: use cloud AI (Claude Code, Cursor) for general work, and local models for anything sensitive. They complement each other.

Troubleshooting

Model runs very slowly

Check ollama ps -- if the model is using CPU only, you'll see 0% GPU utilization
On Mac with Apple Silicon: Ollama uses Metal (GPU) automatically. If slow, check Activity Monitor for memory pressure
On Linux/Windows with NVIDIA: ensure you have CUDA drivers installed (nvidia-smi should work)
Try a smaller model -- the 7B models are 5-10x faster than 70B on the same hardware

Out of memory errors

Use a smaller model or a more aggressively quantized version
Close other memory-hungry applications
Reduce context length: export OLLAMA_CONTEXT_LENGTH=4096
Enable KV cache quantization: export OLLAMA_KV_CACHE_TYPE=q4_0

Open WebUI can't connect to Ollama

Make sure Ollama is running: ollama ps or curl http://localhost:11434
The --add-host=host.docker.internal:host-gateway flag in the Docker command is essential -- it lets the container reach Ollama on the host machine
If using Docker Compose, use the service name (http://ollama:11434) not localhost

RAG returns irrelevant results

Your chunks might be too large -- try reducing chunk_size to 500-800
Increase k (number of retrieved chunks) to cast a wider net
Make sure your documents actually contain the information you're asking about
Try a different embedding model: ollama pull mxbai-embed-large (slower but higher quality)

What's Next

Once your local AI setup is running:

Add more document types -- LangChain supports Word docs, HTML pages, Notion exports, and dozens more formats
Build a web frontend -- Use Streamlit or Gradio to create a custom interface for your RAG system
Experiment with different models -- Swap CHAT_MODEL in the RAG script to compare how different models answer the same questions
Set up scheduled ingestion -- Cron job to re-index documents from a shared folder, keeping your RAG system current
Connect to your dev tools -- Use Ollama as the backend for Cursor, Continue, or any OpenAI-compatible tool for fully private coding assistance

The era of "you need a cloud API for useful AI" is over. A 14B parameter model running on a $1,000 laptop in 2026 outperforms what a $100/month GPT-4 subscription could do two years ago. The models are only getting better, and the hardware requirements are only going down.

Looking for a cloud-based AI agent instead? Check out our OpenClaw Setup Guide for a 24/7 personal AI assistant on EC2, or our Claude Code Workflow Guide for terminal-based AI coding.

Dislike

Thanks for feedback.