Fine-Tuning Open-Source LLMs on Your Codebase: A Practical Guide
You've tried RAG. You've tried prompt engineering. You've written system prompts so detailed they could pass a bar exam. And yet your AI coding assistant still generates AbstractFactoryProviderManagerBean when your entire codebase uses plain functions and snake_case. The model doesn't know your code conventions, your internal libraries, or your team's patterns -- because it was never trained on them.
Fine-tuning fixes that. Instead of shoving context into a prompt window and hoping the model pays attention, you bake your codebase's patterns directly into the model's weights. The result is a model that writes code the way your team writes code, without needing 50 lines of system prompt to get there.
But fine-tuning has a reputation problem. People think it requires a cluster of A100s, a PhD in machine learning, and three months of spare time. In 2026, that's no longer true. With LoRA and QLoRA, you can fine-tune a 7B-parameter code model on a single consumer GPU in a few hours. This guide shows you exactly how.
🧰 What You'll Need
- A GPU with at least 16 GB VRAM -- RTX 4090 (24 GB), A100, or similar. QLoRA can squeeze onto 12 GB for 7B models
- Python 3.10+ with pip or conda
- A codebase -- the code you want the model to learn from (any language, any size)
- Basic Python and command-line fluency -- you don't need ML experience, but you should be comfortable with pip and Jupyter notebooks
- 30-120 minutes -- for dataset prep and a training run on a 7B model
- Optional: Hugging Face account -- for downloading gated models and pushing your fine-tuned adapter
🤔 Fine-Tuning vs RAG: When to Use Which
Before you start fine-tuning, make sure it's actually the right tool for the problem. This is the single most common mistake -- people fine-tune when RAG would have been cheaper, faster, and better.
| Factor | Fine-Tuning | RAG |
|---|---|---|
| Best for | Teaching style, conventions, patterns | Answering questions about specific content |
| Data freshness | Frozen at training time | Always up-to-date |
| Setup effort | Hours to days | Minutes to hours |
| Inference cost | ✅ Zero extra cost (no retrieval step) | ⚠️ Retrieval adds latency |
| Data needed | 500-10,000+ examples | Any amount of documents |
| When data changes | ❌ Must retrain | ✅ Just re-index |
| Learns new APIs | ✅ Deeply internalizes patterns | ⚠️ Surface-level, depends on retrieved context |
| Hallucination risk | ⚠️ Can hallucinate learned patterns | ✅ Grounded in retrieved documents |
Use fine-tuning when:
- Your model needs to consistently follow coding conventions (naming, structure, patterns) across your entire codebase
- You have internal libraries or frameworks the base model has never seen
- You want the model to generate boilerplate that matches your architecture without specifying it every time
- Inference latency matters and you can't afford a retrieval step
Use RAG when:
- You need to query specific documents for factual answers (docs, wikis, specs)
- Your knowledge base changes frequently (API docs, runbooks)
- You want to cite sources for generated answers
- You don't have enough structured training data for fine-tuning
The hybrid approach (best of both)
The real power move in 2026 is combining both: fine-tune a base model so it understands your conventions and style, then augment it with RAG so it can reference current documentation. A lightly fine-tuned Qwen2.5-Coder with a RAG layer over your docs gives you a model that writes code in your style while staying grounded in your latest specs.
If you're new to RAG, check out the Local LLM + Ollama RAG Guide or the RAG Pipeline Tutorial first. Come back here when RAG alone isn't enough.
🧠 Picking Your Base Model
Not all open-source models are created equal for code tasks. Here's what you should consider in 2026:
| Model | Params | License | Strengths | Fine-Tune Friendly? |
|---|---|---|---|---|
| Qwen2.5-Coder | 1.5B-32B | Apache 2.0 | Best open-source code model, multilingual | ✅ Excellent |
| DeepSeek-Coder-V2 | 16B/236B | MIT | Strong reasoning, fill-in-middle support | ✅ Good |
| CodeLlama | 7B-34B | Llama 2 License | Mature ecosystem, well-documented fine-tuning | ✅ Good |
| StarCoder2 | 3B-15B | BigCode OpenRAIL-M | Trained on 600+ languages, strong completion | ✅ Good |
| Llama 3.1/3.3 | 8B-70B | Llama 3 License | General-purpose but solid at code | ✅ Good |
My recommendation: Start with Qwen/Qwen2.5-Coder-7B-Instruct. It matches GPT-4o on coding benchmarks at 7B parameters, it's Apache 2.0 licensed (use it commercially without worry), and it fine-tunes beautifully with LoRA. If you have a bigger GPU budget, Qwen2.5-Coder-32B-Instruct is the current state of the art for open-source code models.
For the rest of this guide, we'll use the 7B variant. Everything scales to larger models -- you just need more VRAM and patience.
📦 Preparing Your Dataset
This is the step that makes or breaks your fine-tune. A well-prepared dataset of 1,000 examples will outperform a sloppy dataset of 50,000. Quality is everything.
Dataset format
Fine-tuning for code tasks typically uses instruction-completion pairs in JSONL format. Each example has a prompt (what you'd ask the model) and a completion (what you want it to generate):
{"instruction": "Write a function that validates user input for the payment form", "output": "def validate_payment_input(form_data: dict) -> ValidationResult:\n \"\"\"Validate payment form fields against business rules.\"\"\"\n errors = []\n \n if not form_data.get('amount') or form_data['amount'] <= 0:\n errors.append(ValidationError('amount', 'Must be positive'))\n \n if not is_valid_currency(form_data.get('currency', '')):\n errors.append(ValidationError('currency', 'Unsupported currency'))\n \n return ValidationResult(valid=len(errors) == 0, errors=errors)"}
Extracting training data from your codebase
Here's a Python script that crawls your codebase and generates instruction-completion pairs using an existing LLM (Claude, GPT-4o, or a local model) to create the "instruction" half:
import os
import json
from pathlib import Path
from openai import OpenAI
# Point this at your LLM of choice (cloud or local via Ollama)
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY", "ollama"),
base_url=os.environ.get("LLM_BASE_URL", "http://localhost:11434/v1"),
)
TEACHER_MODEL = os.environ.get("TEACHER_MODEL", "gpt-4o")
# File extensions to include
CODE_EXTENSIONS = {".py", ".ts", ".js", ".go", ".rs", ".java", ".rb"}
MAX_FILE_SIZE = 8000 # characters -- skip very large files
def extract_functions(file_path: str) -> list[str]:
"""Read a file and split it into individual functions/classes."""
content = Path(file_path).read_text(encoding="utf-8", errors="ignore")
if len(content) > MAX_FILE_SIZE:
return [] # Skip massive files
return [content] # Simplified; use tree-sitter for precise extraction
def generate_instruction(code_snippet: str, file_path: str) -> str:
"""Use a teacher model to generate a natural instruction for code."""
response = client.chat.completions.create(
model=TEACHER_MODEL,
messages=[
{
"role": "system",
"content": (
"You generate concise coding instructions. Given a code "
"snippet, write a short instruction that a developer might "
"give to produce this code. Be specific about the task, "
"not generic. One sentence only."
),
},
{
"role": "user",
"content": f"File: {file_path}\n\n```\n{code_snippet}\n```",
},
],
max_tokens=150,
temperature=0.3,
)
return response.choices[0].message.content.strip()
def build_dataset(codebase_dir: str, output_file: str = "training_data.jsonl"):
"""Walk a codebase and generate instruction-completion pairs."""
pairs = []
codebase = Path(codebase_dir)
for file_path in codebase.rglob("*"):
if file_path.suffix not in CODE_EXTENSIONS:
continue
if any(skip in str(file_path) for skip in ["node_modules", ".venv", "__pycache__", ".git"]):
continue
snippets = extract_functions(str(file_path))
for snippet in snippets:
instruction = generate_instruction(snippet, str(file_path.relative_to(codebase)))
pairs.append({
"instruction": instruction,
"output": snippet,
})
print(f" Generated pair for {file_path.name} ({len(pairs)} total)")
with open(output_file, "w") as f:
for pair in pairs:
f.write(json.dumps(pair) + "\n")
print(f"\nDataset saved: {output_file} ({len(pairs)} examples)")
return pairs
if __name__ == "__main__":
import sys
codebase_dir = sys.argv[1] if len(sys.argv) > 1 else "."
build_dataset(codebase_dir)
# Using a cloud model as the teacher
export OPENAI_API_KEY="sk-..."
export LLM_BASE_URL="https://api.openai.com/v1"
export TEACHER_MODEL="gpt-4o"
python build_dataset.py /path/to/your/codebase
# Or using a local model via Ollama (free, private)
export TEACHER_MODEL="llama3.1:8b"
python build_dataset.py /path/to/your/codebase
Dataset quality checklist
Before training, audit your dataset:
- Remove duplicates. Near-identical examples waste compute and can cause overfitting
- Remove trivial examples. One-line getters and empty
__init__methods don't teach useful patterns - Balance your data. If 80% of your examples are API handlers and 5% are database queries, the model will learn API handlers and forget everything else
- Verify formatting. Broken JSON lines will crash training. Validate with
python -c "import json; [json.loads(l) for l in open('training_data.jsonl')]" - Aim for 500-5,000 examples. Below 500, the model barely shifts. Above 5,000, you get diminishing returns for LoRA fine-tuning
🔧 Fine-Tuning with QLoRA (Step-by-Step)
Time to actually train. We'll use Unsloth, which wraps Hugging Face's PEFT library with significant speed and memory optimizations. It's free, open-source, and makes QLoRA fine-tuning about 2x faster than vanilla implementations.
Understanding LoRA and QLoRA
Before we run the code, here's what these acronyms mean in plain English:
LoRA (Low-Rank Adaptation) freezes all of the original model's weights and injects small trainable matrices into specific layers. Instead of updating 7 billion parameters, you train roughly 10-50 million. The result is a tiny "adapter" file (typically 50-200 MB) that sits on top of the frozen base model.
QLoRA goes one step further: it loads the frozen base model in 4-bit quantized format, cutting memory usage by ~75%. The trainable LoRA matrices stay in 16-bit for precision. This is what lets you fine-tune a 7B model on a 16 GB GPU.
┌─────────────────────────────────────────────┐
│ Base Model (Frozen) │
│ 7B params in 4-bit (QLoRA) │
│ ~4 GB VRAM │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ LoRA Adapter │ │ LoRA Adapter │ ... │
│ │ (Trainable) │ │ (Trainable) │ │
│ │ ~30M params │ │ ~30M params │ │
│ │ 16-bit │ │ 16-bit │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Total: ~6-8 GB VRAM │
└─────────────────────────────────────────────┘
Step 1: Install dependencies
pip install torch torchvision torchaudio
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes xformers
pip install datasets
Or on Google Colab:
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes xformers
Step 2: Load the base model with QLoRA
from unsloth import FastLanguageModel
import torch
# --- Configuration ---
MODEL_NAME = "Qwen/Qwen2.5-Coder-7B-Instruct"
MAX_SEQ_LENGTH = 4096
LORA_RANK = 32 # Higher = more capacity, more VRAM
LORA_ALPHA = 64 # Usually 2x rank
LORA_DROPOUT = 0.05
# Load model in 4-bit (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None, # Auto-detect (float16 on GPU)
load_in_4bit=True, # QLoRA: 4-bit quantization
)
# Add LoRA adapters to the model
model = FastLanguageModel.get_peft_model(
model,
r=LORA_RANK,
lora_alpha=LORA_ALPHA,
lora_dropout=LORA_DROPOUT,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
use_gradient_checkpointing="unsloth", # Memory optimization
)
print(f"Trainable parameters: {model.print_trainable_parameters()}")
# Output: trainable params: 41,943,040 || all params: 7,657,018,368 || trainable%: 0.5476
Step 3: Format and load your dataset
from datasets import load_dataset
# Load your JSONL dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Define the chat template for Qwen2.5-Coder
def format_instruction(example):
"""Convert instruction-output pairs to the model's chat format."""
messages = [
{
"role": "system",
"content": "You are an expert software engineer. Write clean, "
"well-documented code following the project's conventions.",
},
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["output"]},
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)
return {"text": text}
# Apply formatting
dataset = dataset.map(format_instruction)
print(f"Dataset size: {len(dataset)} examples")
print(f"Sample:\n{dataset[0]['text'][:500]}...")
Step 4: Configure training
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=MAX_SEQ_LENGTH,
packing=True, # Pack short examples together for efficiency
args=TrainingArguments(
# --- Core ---
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size: 16
# --- Optimizer ---
learning_rate=2e-4,
weight_decay=0.01,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
# --- Memory ---
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
gradient_checkpointing=True,
# --- Logging ---
logging_steps=10,
save_strategy="epoch",
# --- Reproducibility ---
seed=42,
),
)
Step 5: Train
print("Starting fine-tuning...")
trainer_stats = trainer.train()
print(f"\nTraining complete!")
print(f" Total steps: {trainer_stats.global_step}")
print(f" Training loss: {trainer_stats.training_loss:.4f}")
print(f" Runtime: {trainer_stats.metrics['train_runtime']:.0f} seconds")
On an RTX 4090 with 2,000 examples and 3 epochs, expect roughly 20-40 minutes of training time. On a T4 (Colab free tier), more like 60-90 minutes.
Step 6: Save the adapter
# Save the LoRA adapter (small, ~100-200 MB)
model.save_pretrained("./qwen-coder-finetuned-lora")
tokenizer.save_pretrained("./qwen-coder-finetuned-lora")
# Optional: Push to Hugging Face Hub
# model.push_to_hub("your-username/qwen-coder-finetuned-lora")
# Optional: Merge adapter into base model and save full weights
# (Larger file, but easier to deploy)
model.save_pretrained_merged(
"./qwen-coder-finetuned-merged",
tokenizer,
save_method="merged_16bit", # or "merged_4bit" for quantized
)
✅ Testing and Evaluating Your Fine-Tuned Model
Training loss going down doesn't mean your model is useful. You need to actually test it on real tasks.
Quick inference test
from unsloth import FastLanguageModel
# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./qwen-coder-finetuned-lora",
max_seq_length=4096,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
# Test with a prompt
messages = [
{"role": "system", "content": "You are an expert software engineer."},
{"role": "user", "content": "Write a function to process incoming webhook events"},
]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=512,
temperature=0.2,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
What to check in the output
- Naming conventions -- Does it use
snake_caseorcamelCaselike your codebase? - Import patterns -- Does it import from your internal modules or hallucinate library names?
- Architecture patterns -- Does it follow your project's structure (e.g., service layer, repository pattern)?
- Error handling -- Does it use your team's error handling conventions?
- Documentation style -- Does it match your docstring format?
Building an eval set
Create a separate file eval_set.jsonl with 50-100 examples you held out from training. Then compute a pass rate:
import json
from difflib import SequenceMatcher
def similarity(a: str, b: str) -> float:
"""Simple string similarity score between 0 and 1."""
return SequenceMatcher(None, a, b).ratio()
def evaluate_model(model, tokenizer, eval_file: str) -> dict:
"""Run the fine-tuned model against held-out examples."""
results = {"total": 0, "high_similarity": 0, "scores": []}
with open(eval_file) as f:
for line in f:
example = json.loads(line)
# Generate output
messages = [
{"role": "user", "content": example["instruction"]},
]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True,
add_generation_prompt=True, return_tensors="pt",
).to("cuda")
outputs = model.generate(
input_ids=inputs, max_new_tokens=512,
temperature=0.1, do_sample=True,
)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
score = similarity(generated, example["output"])
results["scores"].append(score)
results["total"] += 1
if score > 0.6:
results["high_similarity"] += 1
avg = sum(results["scores"]) / len(results["scores"])
results["average_similarity"] = avg
results["pass_rate"] = results["high_similarity"] / results["total"]
return results
results = evaluate_model(model, tokenizer, "eval_set.jsonl")
print(f"Average similarity: {results['average_similarity']:.2%}")
print(f"Pass rate (>0.6): {results['pass_rate']:.2%}")
A good fine-tune should hit 0.5-0.7 average similarity on held-out examples from the same codebase. Don't chase 1.0 -- that means you're overfitting.
🚀 Deploying Your Fine-Tuned Model
Once you're happy with the results, you need to actually serve the model somewhere useful.
Option 1: Export to Ollama (Recommended)
The simplest path. Export your merged model to GGUF format, then load it in Ollama:
# Save as GGUF for Ollama
model.save_pretrained_gguf(
"./qwen-coder-finetuned-gguf",
tokenizer,
quantization_method="q4_k_m", # Good balance of size and quality
)
Then create an Ollama Modelfile:
FROM ./qwen-coder-finetuned-gguf/unsloth.Q4_K_M.gguf
PARAMETER temperature 0.2
PARAMETER num_ctx 4096
SYSTEM """You are an expert software engineer. Write clean, well-documented
code following the project's conventions."""
ollama create my-coder -f Modelfile
ollama run my-coder
Now your fine-tuned model runs through Ollama, which means it works with every tool that speaks the OpenAI protocol -- Cursor, Continue, VS Code, or your own scripts.
Option 2: Serve with vLLM (Production)
For team-wide or production deployments, vLLM gives you high-throughput serving with LoRA adapter hot-swapping:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--enable-lora \
--lora-modules my-coder=./qwen-coder-finetuned-lora \
--port 8000
This serves an OpenAI-compatible API where you can switch between the base model and your fine-tuned adapter per request -- useful when you have multiple teams with different fine-tunes.
Hardware costs for serving
| Setup | Hardware | Monthly Cost | Latency |
|---|---|---|---|
| Local (Ollama) | Your own GPU | $0 (electricity) | 30-60 tok/s on RTX 4090 |
| Cloud GPU (RunPod) | RTX 4090 24GB | ~$300/month | 30-60 tok/s |
| Cloud GPU (Lambda) | A10G 24GB | ~$400/month | 40-80 tok/s |
| Managed (Together.ai) | Serverless | ~$0.20/M tokens | 50-100 tok/s |
🛠️ Troubleshooting
CUDA out of memory
The most common error. Solutions in order of impact:
- Reduce
per_device_train_batch_sizeto 1 or 2 - Enable gradient checkpointing (already on in our config)
- Reduce
max_seq_lengthfrom 4096 to 2048 - Lower
LORA_RANKfrom 32 to 16 (less adapter capacity but less memory) - Use a smaller base model -- Qwen2.5-Coder-1.5B fits on almost any GPU
Training loss doesn't decrease
- Learning rate too low. Try
5e-4instead of2e-4 - Dataset too small. Below ~200 examples, the model barely learns anything meaningful
- Data quality issue. Garbage in, garbage out. Check your instruction-output pairs manually
Model outputs gibberish after fine-tuning
- Overfitting. Reduce epochs from 3 to 1-2, or add more training data
- Wrong chat template. Each model family has its own template. Using Llama's template on a Qwen model produces nonsense. Always use
tokenizer.apply_chat_template() - Corrupted save. Re-run training and save again. Check disk space
Fine-tuned model ignores its training
- LoRA rank too low. Bump from 16 to 32 or 64
- Not enough epochs. Try 3-5 epochs for small datasets (<1,000 examples)
- Base model too large for the task. A 32B model needs more examples to shift its behavior than a 7B model
Import errors with Unsloth
- Unsloth is sensitive to version combinations. Use the auto-installer:
wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -
🔮 What's Next
Once you have your first fine-tuned model running:
- Experiment with different LoRA ranks -- try r=16, 32, and 64 to find the sweet spot between adapter size and quality for your specific codebase
- Build a hybrid pipeline -- pair your fine-tuned model with a RAG layer over your documentation using the RAG Pipeline Tutorial
- Set up continuous fine-tuning -- retrain monthly as your codebase evolves, treating your adapter like a dependency that gets updated
- Compare multiple base models -- fine-tune both Qwen2.5-Coder-7B and DeepSeek-Coder-6.7B on the same dataset and benchmark them side-by-side
- Explore reinforcement learning from human feedback (RLHF) -- use DPO or ORPO to further align your model using developer preferences from code reviews
The gap between "generic AI coding assistant" and "AI that knows your codebase" is exactly what fine-tuning closes. It's no longer a technique reserved for ML teams with dedicated GPU clusters. A single developer with a weekend and a consumer GPU can build a model that writes code indistinguishable from the rest of the team. That's the real shift in 2026 -- not bigger models, but personalized ones.
Want to run your fine-tuned model locally? See our Local LLM + Ollama RAG Guide for the complete local setup. For an overview of AI coding tools that can use your fine-tuned model, check out AI Coding Agents Compared and The Rise of the AI Engineer.