Your Own Private AI: Run a Local LLM with Ollama and Build a RAG System Over Your Documents

 

Every time you paste code into ChatGPT or Claude, that code leaves your machine and lands on someone else's server. For personal projects, that's fine. But for proprietary algorithms, internal business logic, client data, or anything regulated by HIPAA, GDPR, or SOC 2 -- it's a real problem.

The good news: in 2026, you can run genuinely capable AI models entirely on your own hardware. No API keys. No cloud. No data leaving your machine. And with a RAG (Retrieval-Augmented Generation) pipeline, you can make that local model answer questions grounded in your own documents -- PDFs, markdown files, code, internal wikis, whatever you have.

This guide walks you through the entire setup: installing Ollama, picking the right model for your hardware, setting up a ChatGPT-like web interface, and building a working RAG system that queries your own documents. All of it runs locally.


What You'll Need

  • A reasonably modern computer -- 8 GB RAM minimum (16 GB recommended) with an SSD
  • macOS, Linux, or Windows -- Ollama runs natively on all three
  • Docker -- for Open WebUI (the chat interface)
  • Python 3.10+ -- for the RAG pipeline
  • A GPU is helpful but not required -- Ollama falls back to CPU inference automatically

Part 1: Install Ollama and Run Your First Model

Installation

macOS:

brew install ollama

Or download directly from ollama.com.

Linux:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl start ollama
sudo systemctl enable ollama

Windows:

Download the installer from ollama.com and run it. The CLI gets added to your PATH automatically.

Run Your First Model

ollama run llama3.1:8b

That's it. Ollama downloads the model (about 4.9 GB) and drops you into an interactive chat. Type a question, get an answer -- all running on your machine.

Press Ctrl+D or type /bye to exit.

Essential Ollama Commands

Command What It Does
ollama pull <model> Download a model without starting a chat
ollama run <model> Download (if needed) and start an interactive session
ollama list Show all downloaded models
ollama ps Show currently loaded models and memory usage
ollama stop <model> Unload a model from memory
ollama rm <model> Delete a model from disk
ollama show <model> Show model details and parameters

Part 2: Picking the Right Model for Your Hardware

This is where most guides fail -- they recommend a model without considering what you're actually running it on. Here's a practical breakdown based on real RAM constraints:

If You Have 8 GB RAM (Entry Level)

Model Tag Size Best For
Llama 3.2 llama3.2:3b 2 GB Fast chat, lightweight tasks
DeepSeek-R1 (distill) deepseek-r1:1.5b 1 GB Basic reasoning on minimal hardware
Llama 3.1 llama3.1:8b 4.9 GB Solid all-rounder (tight fit at 8 GB)
ollama pull llama3.1:8b

If You Have 16 GB RAM (Sweet Spot)

Model Tag Size Best For
DeepSeek-R1 (distill) deepseek-r1:14b 9 GB Strong reasoning and math
Qwen 2.5 Coder qwen2.5-coder:7b 4.7 GB Code completion and generation
Phi-4 phi4:14b 9 GB Reasoning, compact but powerful
Mistral Small 3.1 mistral-small:24b 15 GB Fast general purpose
ollama pull deepseek-r1:14b

If You Have 32 GB RAM (Power User)

Model Tag Size Best For
DeepSeek-R1 (distill) deepseek-r1:32b 20 GB Near o1-mini reasoning quality
Qwen 2.5 Coder qwen2.5-coder:32b 20 GB GPT-4o level coding
Gemma 3 gemma3:27b 17 GB Multimodal, Google-quality
ollama pull deepseek-r1:32b

If You Have 64 GB+ RAM (Workstation)

Model Tag Size Best For
DeepSeek-R1 (distill) deepseek-r1:70b 43 GB Top-tier local reasoning
Llama 3.3 llama3.3:70b 43 GB Best general-purpose local model
ollama pull llama3.3:70b

My Recommendation

Start with deepseek-r1:14b if you have 16 GB RAM. It punches well above its weight for reasoning tasks and code generation. If you have 32 GB, the deepseek-r1:32b is extraordinary -- it outperforms OpenAI's o1-mini on standard benchmarks while running entirely on your laptop.

For coding specifically, qwen2.5-coder:7b is excellent as a secondary model for fast inline suggestions.

A Note on Quantization

When you pull a model from Ollama, you're getting a quantized version -- the model weights are compressed from 16-bit to 4-bit, which dramatically reduces file size and RAM usage with minimal quality loss. The default quantization (Q4_K_M) is the sweet spot for almost everyone:

Format RAM Savings Quality Impact
Q4_K_M (default) ~75% less than full Small, excellent balance
Q5_K_M ~65% less Very small impact
Q8_0 ~50% less Nearly indistinguishable from full

You don't need to think about this unless you want to -- the defaults are good. But if you have extra RAM and want better quality:

ollama pull llama3.1:8b-q8_0

Part 3: Set Up Open WebUI (The Chat Interface)

Running models from the terminal is fine for quick queries, but for daily use you'll want a proper interface. Open WebUI gives you a self-hosted ChatGPT-like experience that connects directly to your local Ollama models.

Install with Docker

Make sure Ollama is running, then:

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an account on first visit (the first account becomes admin). All data stays local in the Docker volume.

Single-User Mode (Skip Login)

If this is a personal setup and you don't want to deal with accounts:

docker run -d -p 3000:8080 \
  -e WEBUI_AUTH=False \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

What You Get

  • Model switching -- toggle between all your downloaded Ollama models from a dropdown
  • Conversation history -- all chats are saved locally
  • Document upload -- drag PDFs or text files into a chat for in-session RAG
  • System prompt editor -- customize each model's behavior
  • Multiple users -- if you want to share the setup with family or teammates

Docker Compose (Ollama + Open WebUI Together)

If you want both services managed together:

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - openwebui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: always

volumes:
  ollama_data:
  openwebui_data:
docker compose up -d

If you have an NVIDIA GPU and want GPU acceleration inside Docker, add this to the ollama service:

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Part 4: Connect Ollama to Your Dev Tools

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. This means any tool that supports the OpenAI API can use your local models as a drop-in replacement.

Connect to Cursor

  1. Open Settings > Models > OpenAI API
  2. Set Base URL: http://localhost:11434/v1 (the /v1 is required)
  3. Set API Key: ollama (any non-empty string works -- Ollama ignores it)
  4. Select your model

Now Cursor uses your local model instead of sending code to the cloud.

Connect to VS Code (Continue Extension)

Install the Continue extension, then edit ~/.continue/config.yaml:

models:
  - name: DeepSeek R1 14B
    provider: ollama
    model: deepseek-r1:14b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit

  - name: Qwen Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434
    roles:
      - autocomplete

The trick is to use a small, fast model for autocomplete (inline suggestions need to be instant) and a larger model for chat and editing (where you can wait a few seconds).

Use From Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",  # required by the SDK but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what RAG is in one paragraph."},
    ]
)
print(response.choices[0].message.content)

Because Ollama speaks the OpenAI protocol, any library or tool built for OpenAI works with zero code changes -- just point the base URL to localhost.


Part 5: Build a RAG Pipeline Over Your Own Documents

This is where it gets powerful. RAG (Retrieval-Augmented Generation) lets you ask an LLM questions that it answers using your own documents as the source of truth -- not its training data, not the internet, your files.

The pipeline looks like this:

┌──────────────┐     ┌──────────┐     ┌──────────────┐     ┌──────────┐
│  Your Docs   │────►│  Chunk   │────►│   Embed &    │────►│ ChromaDB │
│  (PDF, MD)   │     │  Text    │     │   Vectorize  │     │ (Vector  │
└──────────────┘     └──────────┘     └──────────────┘     │  Store)  │
                                                           └────┬─────┘
                                                                │
┌──────────────┐     ┌──────────┐     ┌──────────────┐         │
│   Answer     │◄────│  LLM     │◄────│  Relevant    │◄────────┘
│              │     │ (Ollama) │     │  Chunks      │
└──────────────┘     └──────────┘     └──────────────┘
                                           ▲
                                           │
                                      ┌────┴─────┐
                                      │  Your    │
                                      │  Question│
                                      └──────────┘
  1. Load your documents (PDFs, markdown, text files)
  2. Chunk them into smaller pieces (paragraphs or sections)
  3. Embed each chunk into a vector (a numeric representation of meaning)
  4. Store the vectors in ChromaDB
  5. At query time: embed your question, find the most similar chunks, pass them to the LLM along with your question

Step 1: Install Dependencies

pip install langchain langchain-ollama langchain-community \
  langchain-text-splitters chromadb \
  unstructured "unstructured[all-docs]"

Step 2: Pull an Embedding Model

You need a separate small model specifically for turning text into vectors:

ollama pull nomic-embed-text

This is a fast, 137M parameter embedding model that produces 1024-dimensional vectors. It beats OpenAI's text-embedding-ada-002 on standard benchmarks and runs entirely locally.

Step 3: The Complete RAG Script

Create a file called rag.py:

import sys
from langchain_community.document_loaders import DirectoryLoader, UnstructuredPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


# --- Configuration ---
DOCS_DIR = "./documents"          # Put your PDFs and files here
CHROMA_DIR = "./chroma_db"        # Where vectors are persisted
EMBED_MODEL = "nomic-embed-text"  # Embedding model
CHAT_MODEL = "llama3.1:8b"        # Chat model (adjust to your hardware)
CHUNK_SIZE = 1000                 # Characters per chunk
CHUNK_OVERLAP = 200               # Overlap between chunks


def build_vector_store():
    """Load documents, chunk, embed, and store in ChromaDB."""

    print(f"Loading documents from {DOCS_DIR}/ ...")

    # Load PDFs
    pdf_loader = DirectoryLoader(
        DOCS_DIR,
        glob="**/*.pdf",
        loader_cls=UnstructuredPDFLoader,
    )
    # Load text and markdown files
    txt_loader = DirectoryLoader(
        DOCS_DIR,
        glob="**/*.{txt,md}",
    )

    documents = pdf_loader.load() + txt_loader.load()
    print(f"Loaded {len(documents)} documents.")

    if not documents:
        print("No documents found. Add files to the documents/ directory.")
        return None

    # Chunk the documents
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
    )
    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks.")

    # Embed and store
    print("Embedding chunks (this may take a few minutes on first run)...")
    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=OllamaEmbeddings(model=EMBED_MODEL),
        collection_name="my_documents",
        persist_directory=CHROMA_DIR,
    )
    print("Vector store built and persisted.")
    return vector_store


def load_vector_store():
    """Load an existing vector store from disk."""
    return Chroma(
        collection_name="my_documents",
        embedding_function=OllamaEmbeddings(model=EMBED_MODEL),
        persist_directory=CHROMA_DIR,
    )


def query_documents(vector_store, question):
    """Run a RAG query against the vector store."""

    # Set up retriever (find the 5 most relevant chunks)
    retriever = vector_store.as_retriever(
        search_kwargs={"k": 5}
    )

    # Set up the LLM
    llm = ChatOllama(model=CHAT_MODEL)

    # The prompt template
    prompt = ChatPromptTemplate.from_template(
        "You are a helpful assistant. Answer the question using ONLY the "
        "provided context. If the context doesn't contain enough information "
        "to answer, say so.\n\n"
        "Context:\n{context}\n\n"
        "Question: {question}"
    )

    # Build the RAG chain
    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    return chain.invoke(question)


if __name__ == "__main__":
    import os

    if not os.path.exists(DOCS_DIR):
        os.makedirs(DOCS_DIR)
        print(f"Created {DOCS_DIR}/ directory. Add your documents there.")
        print("Then run: python rag.py --build")
        sys.exit(0)

    if "--build" in sys.argv:
        build_vector_store()
    else:
        # Load existing store and enter query loop
        store = load_vector_store()
        print(f"\nRAG system ready. Using {CHAT_MODEL} for answers.")
        print("Type your questions (Ctrl+C to exit):\n")

        while True:
            try:
                question = input("Q: ").strip()
                if not question:
                    continue
                print("\nSearching documents and generating answer...\n")
                answer = query_documents(store, question)
                print(f"A: {answer}\n")
            except KeyboardInterrupt:
                print("\nGoodbye.")
                break

Step 4: Use It

# 1. Create the documents directory and add your files
mkdir -p documents
cp ~/my-notes/*.pdf documents/
cp ~/my-notes/*.md documents/

# 2. Build the vector store (one-time, or when you add new documents)
python rag.py --build

# 3. Start asking questions
python rag.py
RAG system ready. Using llama3.1:8b for answers.
Type your questions (Ctrl+C to exit):

Q: What were the key decisions from last quarter's architecture review?

Searching documents and generating answer...

A: Based on the architecture review document from Q4, the key decisions were:
1. Migration from REST to GraphQL for the mobile API...

The model is answering from your documents, not from its training data. And nothing left your machine.

Tuning Tips

Chunk size matters. Smaller chunks (500-800) give more precise retrieval but less context per chunk. Larger chunks (1200-2000) give more context but might dilute relevance. Start with 1000 and adjust based on your document type:

Document Type chunk_size chunk_overlap
General articles, notes 1000 200
Technical docs, code 600 100
Legal or dense prose 1500 400

More chunks retrieved = more context but slower. The search_kwargs={"k": 5} means we retrieve the 5 most relevant chunks. For complex questions spanning multiple documents, try k=10. For simple factual lookups, k=3 is enough.

Rebuild when you add documents. Run python rag.py --build again whenever you add new files. ChromaDB persists to disk, so you only need to rebuild when your source documents change.


Part 6: Performance Tuning

Free Up GPU Memory After One-Off Queries

Ollama keeps models loaded in memory for 5 minutes by default. If you need that memory back immediately:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "keep_alive": 0}'

Increase Context Window

By default, most models use a 2048 or 4096 token context. For RAG, you often need more:

export OLLAMA_CONTEXT_LENGTH=16384

Or set it per-model in a Modelfile:

FROM llama3.1:8b
PARAMETER num_ctx 16384
ollama create llama3.1-16k -f ./Modelfile
ollama run llama3.1-16k

KV Cache Quantization (Save Memory on Long Contexts)

Context memory grows linearly with conversation length. An 8B model uses ~0.3 GB at 2K context but ~5 GB at 32K. Cut this in half:

export OLLAMA_KV_CACHE_TYPE=q8_0

Move Model Storage to a Larger Drive

Models live at ~/.ollama/models by default. If your boot drive is small:

export OLLAMA_MODELS=/Volumes/BigDrive/ollama-models

Create Custom Model Presets

A Modelfile lets you bake in a system prompt and parameters:

FROM deepseek-r1:14b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192

SYSTEM """You are a senior Python engineer. Always provide working code
with type hints and docstrings. Be concise."""
ollama create python-expert -f ./Modelfile
ollama run python-expert

Cost Breakdown

Component Cost Notes
Ollama Free MIT licensed, open source
Open WebUI Free MIT licensed, open source
LangChain + ChromaDB Free Open source
Models Free Open-weight models, no API fees
Hardware What you already own Runs on any 8 GB+ machine
Total $0 Forever. No subscriptions, no per-token fees

The only cost is the electricity to run inference and the initial time to set it up. Once running, you have a private AI system with zero ongoing costs.


When to Use Local vs Cloud AI

Local LLMs aren't a replacement for Claude or GPT-4o in every scenario. Here's a practical decision framework:

Scenario Use Local Use Cloud
Querying proprietary documents Yes Risky
Code with trade secrets Yes Risky
General coding assistance Either Yes (better quality)
Complex reasoning, long analysis Only with 32B+ models Yes
Working offline or on planes Yes No
Regulated data (HIPAA, GDPR) Yes Requires enterprise agreements
Quick throwaway scripts Either Yes (faster, smarter)
Learning and experimentation Yes (free, unlimited) Limited by cost

The ideal setup for most developers: use cloud AI (Claude Code, Cursor) for general work, and local models for anything sensitive. They complement each other.


Troubleshooting

Model runs very slowly

  • Check ollama ps -- if the model is using CPU only, you'll see 0% GPU utilization
  • On Mac with Apple Silicon: Ollama uses Metal (GPU) automatically. If slow, check Activity Monitor for memory pressure
  • On Linux/Windows with NVIDIA: ensure you have CUDA drivers installed (nvidia-smi should work)
  • Try a smaller model -- the 7B models are 5-10x faster than 70B on the same hardware

Out of memory errors

  • Use a smaller model or a more aggressively quantized version
  • Close other memory-hungry applications
  • Reduce context length: export OLLAMA_CONTEXT_LENGTH=4096
  • Enable KV cache quantization: export OLLAMA_KV_CACHE_TYPE=q4_0

Open WebUI can't connect to Ollama

  • Make sure Ollama is running: ollama ps or curl http://localhost:11434
  • The --add-host=host.docker.internal:host-gateway flag in the Docker command is essential -- it lets the container reach Ollama on the host machine
  • If using Docker Compose, use the service name (http://ollama:11434) not localhost

RAG returns irrelevant results

  • Your chunks might be too large -- try reducing chunk_size to 500-800
  • Increase k (number of retrieved chunks) to cast a wider net
  • Make sure your documents actually contain the information you're asking about
  • Try a different embedding model: ollama pull mxbai-embed-large (slower but higher quality)

What's Next

Once your local AI setup is running:

  • Add more document types -- LangChain supports Word docs, HTML pages, Notion exports, and dozens more formats
  • Build a web frontend -- Use Streamlit or Gradio to create a custom interface for your RAG system
  • Experiment with different models -- Swap CHAT_MODEL in the RAG script to compare how different models answer the same questions
  • Set up scheduled ingestion -- Cron job to re-index documents from a shared folder, keeping your RAG system current
  • Connect to your dev tools -- Use Ollama as the backend for Cursor, Continue, or any OpenAI-compatible tool for fully private coding assistance

The era of "you need a cloud API for useful AI" is over. A 14B parameter model running on a $1,000 laptop in 2026 outperforms what a $100/month GPT-4 subscription could do two years ago. The models are only getting better, and the hardware requirements are only going down.

Looking for a cloud-based AI agent instead? Check out our OpenClaw Setup Guide for a 24/7 personal AI assistant on EC2, or our Claude Code Workflow Guide for terminal-based AI coding.





Thanks for feedback.



Read More....
AI Coding Agents Compared: Cursor vs Copilot vs Claude Code vs Windsurf in 2026
AI-Native Documentation
Agentic Workflows vs Linear Chat
Automating UI Testing Vision Agents
Building Tool-Use AI Agents
Pinecone RAG Second Brain