Easy

DeepSeek R1 Local Setup: Run It with Ollama in Minutes

DeepSeek R1 matches OpenAI's o1 on math and reasoning benchmarks. It's open-weight. It runs on your laptop. And it costs exactly zero dollars per month.

That's not marketing copy -- it's the state of open-source AI in 2026. While the rest of the industry debates API pricing tiers and rate limits, DeepSeek dropped a 671-billion-parameter reasoning model under an MIT license and said "have fun." The distilled versions -- ranging from 1.5B to 70B parameters -- bring that same reasoning DNA to machines as modest as a MacBook Air.

This guide gets you from zero to a working DeepSeek R1 setup in under ten minutes. We'll cover which model size fits your hardware, how to use it from the terminal and Python, how to connect it to your dev tools, and what to do when things go sideways.

🧰 What You'll Need

A computer with 8 GB+ RAM -- 16 GB is the sweet spot, 32 GB unlocks the best distilled models
macOS, Linux, or Windows -- Ollama runs natively on all three
~5-50 GB free disk space -- depending on which model size you choose
A GPU is nice but not required -- Apple Silicon (M1/M2/M3/M4) works great; NVIDIA GPUs with 6 GB+ VRAM accelerate inference; CPU-only works, just slower
Python 3.10+ -- if you want to use the API programmatically (optional)

🤖 What Is DeepSeek R1 (And Why Should You Care)?

DeepSeek R1 is a reasoning model -- think "chain-of-thought on steroids." Unlike standard chat models that generate an answer in one shot, R1 thinks step by step, showing its reasoning process before arriving at a conclusion. It was trained using reinforcement learning to develop these reasoning abilities, which is a fundamentally different approach from the supervised fine-tuning that most models use.

The Numbers That Matter

Benchmark	DeepSeek R1	OpenAI o1	GPT-4o
AIME 2024 (math)	79.8%	79.2%	9.3%
MATH-500	97.3%	96.4%	76.6%
Codeforces (coding)	96.3	96.6	--
MMLU (general knowledge)	90.8%	91.8%	87.2%

That's not "close enough for open source" -- that's genuine parity with the best proprietary models on reasoning tasks.

The R1 Family

The full DeepSeek R1 is a 671B-parameter Mixture-of-Experts model. You're not running that on your laptop (it needs ~400 GB just for the quantized weights). But DeepSeek distilled its reasoning capabilities into smaller models by fine-tuning popular open-source architectures with 800,000 high-quality reasoning samples from the full R1:

Model	Base Architecture	Parameters	Ollama Tag
DeepSeek-R1-Distill-Qwen-1.5B	Qwen 2.5	1.5B	`deepseek-r1:1.5b`
DeepSeek-R1-Distill-Qwen-7B	Qwen 2.5	7B	`deepseek-r1:7b`
DeepSeek-R1-Distill-Llama-8B	Llama 3.1	8B	`deepseek-r1:8b`
DeepSeek-R1-Distill-Qwen-14B	Qwen 2.5	14B	`deepseek-r1:14b`
DeepSeek-R1-Distill-Qwen-32B	Qwen 2.5	32B	`deepseek-r1:32b`
DeepSeek-R1-Distill-Llama-70B	Llama 3.3	70B	`deepseek-r1:70b`

The distilled models aren't just "smaller R1" -- they outperform their base models significantly on reasoning tasks. The 32B distilled version competes with OpenAI's o1-mini. The 70B version beats it on most benchmarks.

Tip: The R1-0528 update (May 2025) improved math accuracy from 70% to 87.5% on AIME 2025, reduced hallucinations by ~45%, and added JSON output and function calling support. If you see both versions available, always pick the newer one.

⚡ Install Ollama and Pull DeepSeek R1

Step 1: Install Ollama

macOS:

brew install ollama

Or download directly from ollama.com.

Linux:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl start ollama
sudo systemctl enable ollama

Windows:

Download the installer from ollama.com and run it. The CLI gets added to your PATH automatically.

Verify the installation:

ollama --version

Step 2: Pull DeepSeek R1

Pick your model size based on your available RAM:

# The sweet spot for most laptops (16 GB RAM)
ollama pull deepseek-r1:14b

# Lightweight option (8 GB RAM)
ollama pull deepseek-r1:7b

# Best quality you can run locally (32 GB RAM)
ollama pull deepseek-r1:32b

Step 3: Run It

ollama run deepseek-r1:14b

That's it. You'll see a prompt. Ask it something:

>>> What is the sum of the first 100 prime numbers?

<think>
I need to find the sum of the first 100 prime numbers. Let me start by
listing primes systematically...

The first few primes are: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29...

I'll continue generating primes up to the 100th one...
</think>

The sum of the first 100 prime numbers is **24,133**.

The 100th prime number is 541, and adding all primes from 2 to 541
gives us this result.

Notice the <think> block -- that's R1's chain-of-thought reasoning. It works through the problem step by step before giving you the answer. This is what makes it different from a standard chat model.

Press Ctrl+D or type /bye to exit.

📐 Pick the Right Model Size for Your Hardware

This is the section that saves you from pulling a 43 GB model onto a machine with 16 GB of RAM and then wondering why your laptop sounds like a jet engine.

Model Size Reference

Model	Download Size	RAM Needed	Speed (Apple M-series)	Speed (CPU-only)	Best For
`deepseek-r1:1.5b`	~1.1 GB	4 GB	~80 tok/s	~30 tok/s	Quick tests, minimal hardware
`deepseek-r1:7b`	~4.7 GB	8 GB	~40 tok/s	~12 tok/s	Light reasoning, side model
`deepseek-r1:8b`	~5.2 GB	8 GB	~35 tok/s	~10 tok/s	Coding tasks, Llama-based
`deepseek-r1:14b`	~9 GB	16 GB	~25 tok/s	~6 tok/s	Best bang for buck ✅
`deepseek-r1:32b`	~20 GB	32 GB	~12 tok/s	~3 tok/s	Near o1-mini quality ✅
`deepseek-r1:70b`	~43 GB	64 GB	~5 tok/s	~1 tok/s	Top-tier local reasoning
`deepseek-r1:671b`	~404 GB	512 GB+	Server only	Don't	You need a cluster

My Recommendations

16 GB RAM (most laptops): Go with deepseek-r1:14b. It's the best balance of quality and speed. Strong enough for real reasoning tasks, fast enough for interactive use.

32 GB RAM (power users): The deepseek-r1:32b is the standout. It outperforms OpenAI's o1-mini on most benchmarks and runs comfortably on a MacBook Pro with M2/M3/M4 Pro.

8 GB RAM (budget setup): Use deepseek-r1:7b. It won't match the larger models on complex reasoning, but it handles basic chain-of-thought tasks better than any non-reasoning 7B model.

Warning: Ollama loads the entire model into memory. If your model is 9 GB and you have 16 GB of RAM, your OS and other apps still need memory. Leave at least 4-6 GB of headroom. If you notice swapping (system slowdown, disk thrashing), drop to a smaller model.

7B vs 8B: What's the Difference?

The 7B model is distilled from Qwen 2.5 and the 8B is distilled from Llama 3.1. In practice:

7B (Qwen-based): Slightly better at math and structured reasoning
8B (Llama-based): Slightly better at code generation and English fluency

Both are close enough that the choice barely matters. If you're already using Llama-family models and want consistency, go 8B. Otherwise, 7B is marginally smaller.

🔌 Use DeepSeek R1 From Python

Ollama exposes an OpenAI-compatible API at http://localhost:11434. This means you can use the official OpenAI Python library, the Ollama library, or plain HTTP requests. Here are all three approaches.

Option 1: OpenAI-Compatible Client (Recommended)

This is the best approach because your code works with both Ollama and OpenAI with zero changes -- just swap the base_url.

pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",  # required by the SDK but Ollama ignores it
)

response = client.chat.completions.create(
    model="deepseek-r1:14b",
    messages=[
        {"role": "user", "content": "Explain the difference between a stack and a queue using a real-world analogy."}
    ],
)

print(response.choices[0].message.content)

Option 2: Ollama Native Library

pip install ollama

from ollama import chat

response = chat(
    model="deepseek-r1:14b",
    messages=[
        {"role": "user", "content": "Write a Python function to check if a binary tree is balanced."}
    ],
)

print(response.message.content)

Option 3: Raw HTTP (No Dependencies)

import requests
import json

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "deepseek-r1:14b",
        "messages": [
            {"role": "user", "content": "What is the time complexity of mergesort and why?"}
        ],
        "stream": False,
    },
)

result = response.json()
print(result["message"]["content"])

Streaming Responses

For real-time output (especially useful with reasoning models that can think for a while):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",
)

stream = client.chat.completions.create(
    model="deepseek-r1:14b",
    messages=[
        {"role": "user", "content": "Solve this step by step: If 3x + 7 = 22, what is x?"}
    ],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
print()

Tip: DeepSeek R1's <think> reasoning blocks are included in the response by default. If you want to parse the final answer separately, split on </think> and take everything after it.

🛠️ Connect to Your Dev Tools

Because Ollama speaks the OpenAI protocol, anything that supports "custom OpenAI endpoint" works.

Cursor

Open Settings > Models > OpenAI API
Set Base URL: http://localhost:11434/v1
Set API Key: ollama (any non-empty string works)
Add deepseek-r1:14b as a model

VS Code with Continue

Install the Continue extension, then edit ~/.continue/config.yaml:

models:
  - name: DeepSeek R1 14B
    provider: ollama
    model: deepseek-r1:14b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit

Open WebUI (ChatGPT-like Interface)

If you want a browser-based chat interface instead of the terminal:

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000, select deepseek-r1:14b from the model dropdown, and start chatting. All data stays on your machine.

Custom Modelfile (Bake In System Prompt + Parameters)

Create a file called Modelfile:

FROM deepseek-r1:14b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192

SYSTEM """You are a senior software engineer. Think through problems carefully
before coding. Write clean, well-documented Python with type hints. When
reviewing code, explain what's wrong and why before suggesting fixes."""

ollama create deepseek-code-reviewer -f ./Modelfile
ollama run deepseek-code-reviewer

Now you have a custom model preset with your preferred system prompt and settings, available anytime by name.

🔧 Performance Tuning

Increase Context Window

The default context for most Ollama models is 2048 or 4096 tokens. For longer conversations or complex reasoning chains, R1 benefits from more room to think:

# Set globally
export OLLAMA_CONTEXT_LENGTH=16384

Or per-model via a Modelfile:

FROM deepseek-r1:14b
PARAMETER num_ctx 16384

Free GPU Memory After Use

Ollama keeps models loaded in memory for 5 minutes by default. To unload immediately:

ollama stop deepseek-r1:14b

Or via the API:

curl http://localhost:11434/api/generate \
  -d '{"model": "deepseek-r1:14b", "keep_alive": 0}'

KV Cache Quantization (More Context, Less Memory)

Long reasoning chains eat memory fast. Quantize the KV cache to cut memory usage in half:

export OLLAMA_KV_CACHE_TYPE=q8_0

Move Models to Another Drive

Models live at ~/.ollama/models by default. If your boot drive is tight:

# macOS / Linux
export OLLAMA_MODELS=/Volumes/ExternalDrive/ollama-models

# Windows (PowerShell)
$env:OLLAMA_MODELS = "D:\ollama-models"

Quantization Variants

Ollama defaults to Q4_K_M quantization, which cuts model size by ~75% with minimal quality loss. If you have extra RAM and want better quality:

# Higher quality quantization (larger file, better output)
ollama pull deepseek-r1:14b-q8_0

# Check what's available
ollama show deepseek-r1:14b --modelfile

Quantization	Size vs Full	Quality Impact	When to Use
Q4_K_M (default)	~25% of full	Small	Most users ✅
Q5_K_M	~35% of full	Very small	Extra RAM available
Q8_0	~50% of full	Negligible	Maximum quality

🆚 DeepSeek R1 vs Other Local Models

How does R1 stack up against other popular models you can run with Ollama?

Model	Parameters	Reasoning	Code	General Chat	Unique Strength
DeepSeek R1 14B	14B	✅ Excellent	✅ Strong	⚠️ Good	Chain-of-thought reasoning
Llama 3.1 8B	8B	⚠️ Basic	✅ Strong	✅ Excellent	All-around balance
Qwen 2.5 Coder 7B	7B	⚠️ Basic	✅ Excellent	⚠️ Average	Code-specific tasks
Phi-4 14B	14B	✅ Strong	✅ Strong	✅ Strong	Microsoft's compact powerhouse
Gemma 3 27B	27B	✅ Strong	✅ Strong	✅ Excellent	Multimodal support
Mistral Small 24B	24B	⚠️ Good	✅ Strong	✅ Excellent	Fast inference

When to pick DeepSeek R1: You need step-by-step reasoning -- math problems, logic puzzles, debugging complex code, analyzing algorithms, architectural decisions. R1 doesn't just give you an answer; it shows its work.

When to pick something else: If you need fast general chat (Llama 3.1), code-specific completions (Qwen 2.5 Coder), or multimodal capabilities (Gemma 3), those models are better choices for those specific tasks.

The power move is running two models: DeepSeek R1 for reasoning-heavy tasks and a smaller, faster model for autocomplete and quick queries.

🐛 Troubleshooting

"Model not found" when running `ollama run deepseek-r1:14b`

Make sure you've pulled the model first:

ollama pull deepseek-r1:14b
ollama list  # Verify it appears

Extremely slow generation (< 1 token/second)

Check memory pressure: Run ollama ps to see memory usage. If the model is being loaded into swap, you need a smaller model.
Apple Silicon: Ollama uses Metal (GPU) automatically. Close memory-hungry apps (Chrome, Docker, Slack) to free up unified memory.
NVIDIA GPU: Make sure CUDA drivers are installed. Run nvidia-smi to verify. Ollama should detect your GPU automatically.
CPU-only: This is normal for larger models on CPU. The 7B model at ~12 tok/s on CPU is about as fast as it gets without a GPU.

Model loads but gives nonsensical output

Temperature too high: If you're getting creative but incoherent responses, lower the temperature:

ollama run deepseek-r1:14b --temperature 0.3

Context overflow: If the conversation is long and responses degrade, start a new session. The model's context window has limits.

Can't connect from Python / API tools

Make sure Ollama is running:

# Check if the API responds
curl http://localhost:11434/api/tags

# If not running, start it
ollama serve

On Linux, if Ollama was installed as a system service:

sudo systemctl status ollama
sudo systemctl start ollama

Runs out of disk space mid-download

Models are downloaded to ~/.ollama/models. Check available space before pulling:

# Check disk space
df -h ~

# Move model storage to a larger drive
export OLLAMA_MODELS=/path/to/bigger/drive/ollama-models

🚀 What's Next

Build a RAG pipeline -- make DeepSeek R1 answer questions from your own documents. See our Local LLM + Ollama RAG Guide for the full walkthrough
Compare AI coding tools -- see how local models stack up against cloud-based agents in our AI Coding Agents Compared breakdown
Try R1 for code review -- create a custom Modelfile with a code review system prompt and point your editor at it
Experiment with multi-model setups -- use R1 for reasoning and a fast 7B model for autocomplete in your IDE
Explore the R1-0528 update -- if you're on an older pull, ollama pull deepseek-r1:14b again to get the latest weights with improved math and reduced hallucinations

The reasoning model gap between open-source and proprietary has effectively closed. DeepSeek R1 running on a MacBook Pro produces outputs that would have required a $200/month API budget eighteen months ago. The only question left is which model size fits your hardware.

Want to go deeper with local AI? Read our Local LLM + Ollama RAG Guide for building a private document Q&A system, or check out The Rise of the AI Engineer to see how these tools are reshaping the developer role.

Dislike

Thanks for feedback.