DeepSeek R1 Local Setup: Run It with Ollama in Minutes
DeepSeek R1 matches OpenAI's o1 on math and reasoning benchmarks. It's open-weight. It runs on your laptop. And it costs exactly zero dollars per month.
That's not marketing copy -- it's the state of open-source AI in 2026. While the rest of the industry debates API pricing tiers and rate limits, DeepSeek dropped a 671-billion-parameter reasoning model under an MIT license and said "have fun." The distilled versions -- ranging from 1.5B to 70B parameters -- bring that same reasoning DNA to machines as modest as a MacBook Air.
This guide gets you from zero to a working DeepSeek R1 setup in under ten minutes. We'll cover which model size fits your hardware, how to use it from the terminal and Python, how to connect it to your dev tools, and what to do when things go sideways.
🧰 What You'll Need
- A computer with 8 GB+ RAM -- 16 GB is the sweet spot, 32 GB unlocks the best distilled models
- macOS, Linux, or Windows -- Ollama runs natively on all three
- ~5-50 GB free disk space -- depending on which model size you choose
- A GPU is nice but not required -- Apple Silicon (M1/M2/M3/M4) works great; NVIDIA GPUs with 6 GB+ VRAM accelerate inference; CPU-only works, just slower
- Python 3.10+ -- if you want to use the API programmatically (optional)
🤖 What Is DeepSeek R1 (And Why Should You Care)?
DeepSeek R1 is a reasoning model -- think "chain-of-thought on steroids." Unlike standard chat models that generate an answer in one shot, R1 thinks step by step, showing its reasoning process before arriving at a conclusion. It was trained using reinforcement learning to develop these reasoning abilities, which is a fundamentally different approach from the supervised fine-tuning that most models use.
The Numbers That Matter
| Benchmark | DeepSeek R1 | OpenAI o1 | GPT-4o |
|---|---|---|---|
| AIME 2024 (math) | 79.8% | 79.2% | 9.3% |
| MATH-500 | 97.3% | 96.4% | 76.6% |
| Codeforces (coding) | 96.3 | 96.6 | -- |
| MMLU (general knowledge) | 90.8% | 91.8% | 87.2% |
That's not "close enough for open source" -- that's genuine parity with the best proprietary models on reasoning tasks.
The R1 Family
The full DeepSeek R1 is a 671B-parameter Mixture-of-Experts model. You're not running that on your laptop (it needs ~400 GB just for the quantized weights). But DeepSeek distilled its reasoning capabilities into smaller models by fine-tuning popular open-source architectures with 800,000 high-quality reasoning samples from the full R1:
| Model | Base Architecture | Parameters | Ollama Tag |
|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | Qwen 2.5 | 1.5B | deepseek-r1:1.5b |
| DeepSeek-R1-Distill-Qwen-7B | Qwen 2.5 | 7B | deepseek-r1:7b |
| DeepSeek-R1-Distill-Llama-8B | Llama 3.1 | 8B | deepseek-r1:8b |
| DeepSeek-R1-Distill-Qwen-14B | Qwen 2.5 | 14B | deepseek-r1:14b |
| DeepSeek-R1-Distill-Qwen-32B | Qwen 2.5 | 32B | deepseek-r1:32b |
| DeepSeek-R1-Distill-Llama-70B | Llama 3.3 | 70B | deepseek-r1:70b |
The distilled models aren't just "smaller R1" -- they outperform their base models significantly on reasoning tasks. The 32B distilled version competes with OpenAI's o1-mini. The 70B version beats it on most benchmarks.
⚡ Install Ollama and Pull DeepSeek R1
Step 1: Install Ollama
macOS:
brew install ollama
Or download directly from ollama.com.
Linux:
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl start ollama
sudo systemctl enable ollama
Windows:
Download the installer from ollama.com and run it. The CLI gets added to your PATH automatically.
Verify the installation:
ollama --version
Step 2: Pull DeepSeek R1
Pick your model size based on your available RAM:
# The sweet spot for most laptops (16 GB RAM)
ollama pull deepseek-r1:14b
# Lightweight option (8 GB RAM)
ollama pull deepseek-r1:7b
# Best quality you can run locally (32 GB RAM)
ollama pull deepseek-r1:32b
Step 3: Run It
ollama run deepseek-r1:14b
That's it. You'll see a prompt. Ask it something:
>>> What is the sum of the first 100 prime numbers?
<think>
I need to find the sum of the first 100 prime numbers. Let me start by
listing primes systematically...
The first few primes are: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29...
I'll continue generating primes up to the 100th one...
</think>
The sum of the first 100 prime numbers is **24,133**.
The 100th prime number is 541, and adding all primes from 2 to 541
gives us this result.
Notice the <think> block -- that's R1's chain-of-thought reasoning. It works through the problem step by step before giving you the answer. This is what makes it different from a standard chat model.
Press Ctrl+D or type /bye to exit.
📐 Pick the Right Model Size for Your Hardware
This is the section that saves you from pulling a 43 GB model onto a machine with 16 GB of RAM and then wondering why your laptop sounds like a jet engine.
Model Size Reference
| Model | Download Size | RAM Needed | Speed (Apple M-series) | Speed (CPU-only) | Best For |
|---|---|---|---|---|---|
deepseek-r1:1.5b |
~1.1 GB | 4 GB | ~80 tok/s | ~30 tok/s | Quick tests, minimal hardware |
deepseek-r1:7b |
~4.7 GB | 8 GB | ~40 tok/s | ~12 tok/s | Light reasoning, side model |
deepseek-r1:8b |
~5.2 GB | 8 GB | ~35 tok/s | ~10 tok/s | Coding tasks, Llama-based |
deepseek-r1:14b |
~9 GB | 16 GB | ~25 tok/s | ~6 tok/s | Best bang for buck ✅ |
deepseek-r1:32b |
~20 GB | 32 GB | ~12 tok/s | ~3 tok/s | Near o1-mini quality ✅ |
deepseek-r1:70b |
~43 GB | 64 GB | ~5 tok/s | ~1 tok/s | Top-tier local reasoning |
deepseek-r1:671b |
~404 GB | 512 GB+ | Server only | Don't | You need a cluster |
My Recommendations
16 GB RAM (most laptops): Go with deepseek-r1:14b. It's the best balance of quality and speed. Strong enough for real reasoning tasks, fast enough for interactive use.
32 GB RAM (power users): The deepseek-r1:32b is the standout. It outperforms OpenAI's o1-mini on most benchmarks and runs comfortably on a MacBook Pro with M2/M3/M4 Pro.
8 GB RAM (budget setup): Use deepseek-r1:7b. It won't match the larger models on complex reasoning, but it handles basic chain-of-thought tasks better than any non-reasoning 7B model.
7B vs 8B: What's the Difference?
The 7B model is distilled from Qwen 2.5 and the 8B is distilled from Llama 3.1. In practice:
- 7B (Qwen-based): Slightly better at math and structured reasoning
- 8B (Llama-based): Slightly better at code generation and English fluency
Both are close enough that the choice barely matters. If you're already using Llama-family models and want consistency, go 8B. Otherwise, 7B is marginally smaller.
🔌 Use DeepSeek R1 From Python
Ollama exposes an OpenAI-compatible API at http://localhost:11434. This means you can use the official OpenAI Python library, the Ollama library, or plain HTTP requests. Here are all three approaches.
Option 1: OpenAI-Compatible Client (Recommended)
This is the best approach because your code works with both Ollama and OpenAI with zero changes -- just swap the base_url.
pip install openai
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama", # required by the SDK but Ollama ignores it
)
response = client.chat.completions.create(
model="deepseek-r1:14b",
messages=[
{"role": "user", "content": "Explain the difference between a stack and a queue using a real-world analogy."}
],
)
print(response.choices[0].message.content)
Option 2: Ollama Native Library
pip install ollama
from ollama import chat
response = chat(
model="deepseek-r1:14b",
messages=[
{"role": "user", "content": "Write a Python function to check if a binary tree is balanced."}
],
)
print(response.message.content)
Option 3: Raw HTTP (No Dependencies)
import requests
import json
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "deepseek-r1:14b",
"messages": [
{"role": "user", "content": "What is the time complexity of mergesort and why?"}
],
"stream": False,
},
)
result = response.json()
print(result["message"]["content"])
Streaming Responses
For real-time output (especially useful with reasoning models that can think for a while):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama",
)
stream = client.chat.completions.create(
model="deepseek-r1:14b",
messages=[
{"role": "user", "content": "Solve this step by step: If 3x + 7 = 22, what is x?"}
],
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print()
<think> reasoning blocks are included in the response by default. If you want to parse the final answer separately, split on </think> and take everything after it.
🛠️ Connect to Your Dev Tools
Because Ollama speaks the OpenAI protocol, anything that supports "custom OpenAI endpoint" works.
Cursor
- Open Settings > Models > OpenAI API
- Set Base URL:
http://localhost:11434/v1 - Set API Key:
ollama(any non-empty string works) - Add
deepseek-r1:14bas a model
VS Code with Continue
Install the Continue extension, then edit ~/.continue/config.yaml:
models:
- name: DeepSeek R1 14B
provider: ollama
model: deepseek-r1:14b
apiBase: http://localhost:11434
roles:
- chat
- edit
Open WebUI (ChatGPT-like Interface)
If you want a browser-based chat interface instead of the terminal:
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000, select deepseek-r1:14b from the model dropdown, and start chatting. All data stays on your machine.
Custom Modelfile (Bake In System Prompt + Parameters)
Create a file called Modelfile:
FROM deepseek-r1:14b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
SYSTEM """You are a senior software engineer. Think through problems carefully
before coding. Write clean, well-documented Python with type hints. When
reviewing code, explain what's wrong and why before suggesting fixes."""
ollama create deepseek-code-reviewer -f ./Modelfile
ollama run deepseek-code-reviewer
Now you have a custom model preset with your preferred system prompt and settings, available anytime by name.
🔧 Performance Tuning
Increase Context Window
The default context for most Ollama models is 2048 or 4096 tokens. For longer conversations or complex reasoning chains, R1 benefits from more room to think:
# Set globally
export OLLAMA_CONTEXT_LENGTH=16384
Or per-model via a Modelfile:
FROM deepseek-r1:14b
PARAMETER num_ctx 16384
Free GPU Memory After Use
Ollama keeps models loaded in memory for 5 minutes by default. To unload immediately:
ollama stop deepseek-r1:14b
Or via the API:
curl http://localhost:11434/api/generate \
-d '{"model": "deepseek-r1:14b", "keep_alive": 0}'
KV Cache Quantization (More Context, Less Memory)
Long reasoning chains eat memory fast. Quantize the KV cache to cut memory usage in half:
export OLLAMA_KV_CACHE_TYPE=q8_0
Move Models to Another Drive
Models live at ~/.ollama/models by default. If your boot drive is tight:
# macOS / Linux
export OLLAMA_MODELS=/Volumes/ExternalDrive/ollama-models
# Windows (PowerShell)
$env:OLLAMA_MODELS = "D:\ollama-models"
Quantization Variants
Ollama defaults to Q4_K_M quantization, which cuts model size by ~75% with minimal quality loss. If you have extra RAM and want better quality:
# Higher quality quantization (larger file, better output)
ollama pull deepseek-r1:14b-q8_0
# Check what's available
ollama show deepseek-r1:14b --modelfile
| Quantization | Size vs Full | Quality Impact | When to Use |
|---|---|---|---|
| Q4_K_M (default) | ~25% of full | Small | Most users ✅ |
| Q5_K_M | ~35% of full | Very small | Extra RAM available |
| Q8_0 | ~50% of full | Negligible | Maximum quality |
🆚 DeepSeek R1 vs Other Local Models
How does R1 stack up against other popular models you can run with Ollama?
| Model | Parameters | Reasoning | Code | General Chat | Unique Strength |
|---|---|---|---|---|---|
| DeepSeek R1 14B | 14B | ✅ Excellent | ✅ Strong | ⚠️ Good | Chain-of-thought reasoning |
| Llama 3.1 8B | 8B | ⚠️ Basic | ✅ Strong | ✅ Excellent | All-around balance |
| Qwen 2.5 Coder 7B | 7B | ⚠️ Basic | ✅ Excellent | ⚠️ Average | Code-specific tasks |
| Phi-4 14B | 14B | ✅ Strong | ✅ Strong | ✅ Strong | Microsoft's compact powerhouse |
| Gemma 3 27B | 27B | ✅ Strong | ✅ Strong | ✅ Excellent | Multimodal support |
| Mistral Small 24B | 24B | ⚠️ Good | ✅ Strong | ✅ Excellent | Fast inference |
When to pick DeepSeek R1: You need step-by-step reasoning -- math problems, logic puzzles, debugging complex code, analyzing algorithms, architectural decisions. R1 doesn't just give you an answer; it shows its work.
When to pick something else: If you need fast general chat (Llama 3.1), code-specific completions (Qwen 2.5 Coder), or multimodal capabilities (Gemma 3), those models are better choices for those specific tasks.
The power move is running two models: DeepSeek R1 for reasoning-heavy tasks and a smaller, faster model for autocomplete and quick queries.
🐛 Troubleshooting
"Model not found" when running ollama run deepseek-r1:14b
Make sure you've pulled the model first:
ollama pull deepseek-r1:14b
ollama list # Verify it appears
Extremely slow generation (< 1 token/second)
- Check memory pressure: Run
ollama psto see memory usage. If the model is being loaded into swap, you need a smaller model. - Apple Silicon: Ollama uses Metal (GPU) automatically. Close memory-hungry apps (Chrome, Docker, Slack) to free up unified memory.
- NVIDIA GPU: Make sure CUDA drivers are installed. Run
nvidia-smito verify. Ollama should detect your GPU automatically. - CPU-only: This is normal for larger models on CPU. The 7B model at ~12 tok/s on CPU is about as fast as it gets without a GPU.
Model loads but gives nonsensical output
- Temperature too high: If you're getting creative but incoherent responses, lower the temperature:
ollama run deepseek-r1:14b --temperature 0.3
- Context overflow: If the conversation is long and responses degrade, start a new session. The model's context window has limits.
Can't connect from Python / API tools
Make sure Ollama is running:
# Check if the API responds
curl http://localhost:11434/api/tags
# If not running, start it
ollama serve
On Linux, if Ollama was installed as a system service:
sudo systemctl status ollama
sudo systemctl start ollama
Runs out of disk space mid-download
Models are downloaded to ~/.ollama/models. Check available space before pulling:
# Check disk space
df -h ~
# Move model storage to a larger drive
export OLLAMA_MODELS=/path/to/bigger/drive/ollama-models
🚀 What's Next
- Build a RAG pipeline -- make DeepSeek R1 answer questions from your own documents. See our Local LLM + Ollama RAG Guide for the full walkthrough
- Compare AI coding tools -- see how local models stack up against cloud-based agents in our AI Coding Agents Compared breakdown
- Try R1 for code review -- create a custom Modelfile with a code review system prompt and point your editor at it
- Experiment with multi-model setups -- use R1 for reasoning and a fast 7B model for autocomplete in your IDE
- Explore the R1-0528 update -- if you're on an older pull,
ollama pull deepseek-r1:14bagain to get the latest weights with improved math and reduced hallucinations
The reasoning model gap between open-source and proprietary has effectively closed. DeepSeek R1 running on a MacBook Pro produces outputs that would have required a $200/month API budget eighteen months ago. The only question left is which model size fits your hardware.
Want to go deeper with local AI? Read our Local LLM + Ollama RAG Guide for building a private document Q&A system, or check out The Rise of the AI Engineer to see how these tools are reshaping the developer role.