Fine-Tuning Open-Source LLMs on Your Codebase: A Practical Guide

 

You've tried RAG. You've tried prompt engineering. You've written system prompts so detailed they could pass a bar exam. And yet your AI coding assistant still generates AbstractFactoryProviderManagerBean when your entire codebase uses plain functions and snake_case. The model doesn't know your code conventions, your internal libraries, or your team's patterns -- because it was never trained on them.

Fine-tuning fixes that. Instead of shoving context into a prompt window and hoping the model pays attention, you bake your codebase's patterns directly into the model's weights. The result is a model that writes code the way your team writes code, without needing 50 lines of system prompt to get there.

But fine-tuning has a reputation problem. People think it requires a cluster of A100s, a PhD in machine learning, and three months of spare time. In 2026, that's no longer true. With LoRA and QLoRA, you can fine-tune a 7B-parameter code model on a single consumer GPU in a few hours. This guide shows you exactly how.


🧰 What You'll Need

  • A GPU with at least 16 GB VRAM -- RTX 4090 (24 GB), A100, or similar. QLoRA can squeeze onto 12 GB for 7B models
  • Python 3.10+ with pip or conda
  • A codebase -- the code you want the model to learn from (any language, any size)
  • Basic Python and command-line fluency -- you don't need ML experience, but you should be comfortable with pip and Jupyter notebooks
  • 30-120 minutes -- for dataset prep and a training run on a 7B model
  • Optional: Hugging Face account -- for downloading gated models and pushing your fine-tuned adapter
Tip: No local GPU? Google Colab's free tier gives you a T4 (16 GB) which is enough for QLoRA fine-tuning of 7B models. Colab Pro ($12/month) gives you an A100 for larger models.

🤔 Fine-Tuning vs RAG: When to Use Which

Before you start fine-tuning, make sure it's actually the right tool for the problem. This is the single most common mistake -- people fine-tune when RAG would have been cheaper, faster, and better.

Factor Fine-Tuning RAG
Best for Teaching style, conventions, patterns Answering questions about specific content
Data freshness Frozen at training time Always up-to-date
Setup effort Hours to days Minutes to hours
Inference cost ✅ Zero extra cost (no retrieval step) ⚠️ Retrieval adds latency
Data needed 500-10,000+ examples Any amount of documents
When data changes ❌ Must retrain ✅ Just re-index
Learns new APIs ✅ Deeply internalizes patterns ⚠️ Surface-level, depends on retrieved context
Hallucination risk ⚠️ Can hallucinate learned patterns ✅ Grounded in retrieved documents

Use fine-tuning when:

  • Your model needs to consistently follow coding conventions (naming, structure, patterns) across your entire codebase
  • You have internal libraries or frameworks the base model has never seen
  • You want the model to generate boilerplate that matches your architecture without specifying it every time
  • Inference latency matters and you can't afford a retrieval step

Use RAG when:

  • You need to query specific documents for factual answers (docs, wikis, specs)
  • Your knowledge base changes frequently (API docs, runbooks)
  • You want to cite sources for generated answers
  • You don't have enough structured training data for fine-tuning

The hybrid approach (best of both)

The real power move in 2026 is combining both: fine-tune a base model so it understands your conventions and style, then augment it with RAG so it can reference current documentation. A lightly fine-tuned Qwen2.5-Coder with a RAG layer over your docs gives you a model that writes code in your style while staying grounded in your latest specs.

If you're new to RAG, check out the Local LLM + Ollama RAG Guide or the RAG Pipeline Tutorial first. Come back here when RAG alone isn't enough.


🧠 Picking Your Base Model

Not all open-source models are created equal for code tasks. Here's what you should consider in 2026:

Model Params License Strengths Fine-Tune Friendly?
Qwen2.5-Coder 1.5B-32B Apache 2.0 Best open-source code model, multilingual ✅ Excellent
DeepSeek-Coder-V2 16B/236B MIT Strong reasoning, fill-in-middle support ✅ Good
CodeLlama 7B-34B Llama 2 License Mature ecosystem, well-documented fine-tuning ✅ Good
StarCoder2 3B-15B BigCode OpenRAIL-M Trained on 600+ languages, strong completion ✅ Good
Llama 3.1/3.3 8B-70B Llama 3 License General-purpose but solid at code ✅ Good

My recommendation: Start with Qwen/Qwen2.5-Coder-7B-Instruct. It matches GPT-4o on coding benchmarks at 7B parameters, it's Apache 2.0 licensed (use it commercially without worry), and it fine-tunes beautifully with LoRA. If you have a bigger GPU budget, Qwen2.5-Coder-32B-Instruct is the current state of the art for open-source code models.

For the rest of this guide, we'll use the 7B variant. Everything scales to larger models -- you just need more VRAM and patience.


📦 Preparing Your Dataset

This is the step that makes or breaks your fine-tune. A well-prepared dataset of 1,000 examples will outperform a sloppy dataset of 50,000. Quality is everything.

Dataset format

Fine-tuning for code tasks typically uses instruction-completion pairs in JSONL format. Each example has a prompt (what you'd ask the model) and a completion (what you want it to generate):

{"instruction": "Write a function that validates user input for the payment form", "output": "def validate_payment_input(form_data: dict) -> ValidationResult:\n    \"\"\"Validate payment form fields against business rules.\"\"\"\n    errors = []\n    \n    if not form_data.get('amount') or form_data['amount'] <= 0:\n        errors.append(ValidationError('amount', 'Must be positive'))\n    \n    if not is_valid_currency(form_data.get('currency', '')):\n        errors.append(ValidationError('currency', 'Unsupported currency'))\n    \n    return ValidationResult(valid=len(errors) == 0, errors=errors)"}

Extracting training data from your codebase

Here's a Python script that crawls your codebase and generates instruction-completion pairs using an existing LLM (Claude, GPT-4o, or a local model) to create the "instruction" half:

import os
import json
from pathlib import Path
from openai import OpenAI

# Point this at your LLM of choice (cloud or local via Ollama)
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY", "ollama"),
    base_url=os.environ.get("LLM_BASE_URL", "http://localhost:11434/v1"),
)
TEACHER_MODEL = os.environ.get("TEACHER_MODEL", "gpt-4o")

# File extensions to include
CODE_EXTENSIONS = {".py", ".ts", ".js", ".go", ".rs", ".java", ".rb"}
MAX_FILE_SIZE = 8000  # characters -- skip very large files


def extract_functions(file_path: str) -> list[str]:
    """Read a file and split it into individual functions/classes."""
    content = Path(file_path).read_text(encoding="utf-8", errors="ignore")
    if len(content) > MAX_FILE_SIZE:
        return []  # Skip massive files
    return [content]  # Simplified; use tree-sitter for precise extraction


def generate_instruction(code_snippet: str, file_path: str) -> str:
    """Use a teacher model to generate a natural instruction for code."""
    response = client.chat.completions.create(
        model=TEACHER_MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "You generate concise coding instructions. Given a code "
                    "snippet, write a short instruction that a developer might "
                    "give to produce this code. Be specific about the task, "
                    "not generic. One sentence only."
                ),
            },
            {
                "role": "user",
                "content": f"File: {file_path}\n\n```\n{code_snippet}\n```",
            },
        ],
        max_tokens=150,
        temperature=0.3,
    )
    return response.choices[0].message.content.strip()


def build_dataset(codebase_dir: str, output_file: str = "training_data.jsonl"):
    """Walk a codebase and generate instruction-completion pairs."""
    pairs = []
    codebase = Path(codebase_dir)

    for file_path in codebase.rglob("*"):
        if file_path.suffix not in CODE_EXTENSIONS:
            continue
        if any(skip in str(file_path) for skip in ["node_modules", ".venv", "__pycache__", ".git"]):
            continue

        snippets = extract_functions(str(file_path))
        for snippet in snippets:
            instruction = generate_instruction(snippet, str(file_path.relative_to(codebase)))
            pairs.append({
                "instruction": instruction,
                "output": snippet,
            })
            print(f"  Generated pair for {file_path.name} ({len(pairs)} total)")

    with open(output_file, "w") as f:
        for pair in pairs:
            f.write(json.dumps(pair) + "\n")

    print(f"\nDataset saved: {output_file} ({len(pairs)} examples)")
    return pairs


if __name__ == "__main__":
    import sys
    codebase_dir = sys.argv[1] if len(sys.argv) > 1 else "."
    build_dataset(codebase_dir)
# Using a cloud model as the teacher
export OPENAI_API_KEY="sk-..."
export LLM_BASE_URL="https://api.openai.com/v1"
export TEACHER_MODEL="gpt-4o"
python build_dataset.py /path/to/your/codebase

# Or using a local model via Ollama (free, private)
export TEACHER_MODEL="llama3.1:8b"
python build_dataset.py /path/to/your/codebase

Dataset quality checklist

Before training, audit your dataset:

  • Remove duplicates. Near-identical examples waste compute and can cause overfitting
  • Remove trivial examples. One-line getters and empty __init__ methods don't teach useful patterns
  • Balance your data. If 80% of your examples are API handlers and 5% are database queries, the model will learn API handlers and forget everything else
  • Verify formatting. Broken JSON lines will crash training. Validate with python -c "import json; [json.loads(l) for l in open('training_data.jsonl')]"
  • Aim for 500-5,000 examples. Below 500, the model barely shifts. Above 5,000, you get diminishing returns for LoRA fine-tuning
Warning: Don't include secrets, API keys, passwords, or PII in your training data. The model will memorize them and potentially reproduce them during inference. Scrub your dataset before training.

🔧 Fine-Tuning with QLoRA (Step-by-Step)

Time to actually train. We'll use Unsloth, which wraps Hugging Face's PEFT library with significant speed and memory optimizations. It's free, open-source, and makes QLoRA fine-tuning about 2x faster than vanilla implementations.

Understanding LoRA and QLoRA

Before we run the code, here's what these acronyms mean in plain English:

LoRA (Low-Rank Adaptation) freezes all of the original model's weights and injects small trainable matrices into specific layers. Instead of updating 7 billion parameters, you train roughly 10-50 million. The result is a tiny "adapter" file (typically 50-200 MB) that sits on top of the frozen base model.

QLoRA goes one step further: it loads the frozen base model in 4-bit quantized format, cutting memory usage by ~75%. The trainable LoRA matrices stay in 16-bit for precision. This is what lets you fine-tune a 7B model on a 16 GB GPU.

┌─────────────────────────────────────────────┐
│            Base Model (Frozen)              │
│         7B params in 4-bit (QLoRA)          │
│              ~4 GB VRAM                     │
│                                             │
│   ┌──────────────┐  ┌──────────────┐        │
│   │ LoRA Adapter │  │ LoRA Adapter │  ...   │
│   │  (Trainable) │  │  (Trainable) │        │
│   │  ~30M params │  │  ~30M params │        │
│   │   16-bit     │  │   16-bit     │        │
│   └──────────────┘  └──────────────┘        │
│                                             │
│          Total: ~6-8 GB VRAM                │
└─────────────────────────────────────────────┘

Step 1: Install dependencies

pip install torch torchvision torchaudio
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes xformers
pip install datasets

Or on Google Colab:

%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes xformers

Step 2: Load the base model with QLoRA

from unsloth import FastLanguageModel
import torch

# --- Configuration ---
MODEL_NAME = "Qwen/Qwen2.5-Coder-7B-Instruct"
MAX_SEQ_LENGTH = 4096
LORA_RANK = 32          # Higher = more capacity, more VRAM
LORA_ALPHA = 64         # Usually 2x rank
LORA_DROPOUT = 0.05

# Load model in 4-bit (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,          # Auto-detect (float16 on GPU)
    load_in_4bit=True,   # QLoRA: 4-bit quantization
)

# Add LoRA adapters to the model
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    use_gradient_checkpointing="unsloth",  # Memory optimization
)

print(f"Trainable parameters: {model.print_trainable_parameters()}")
# Output: trainable params: 41,943,040 || all params: 7,657,018,368 || trainable%: 0.5476

Step 3: Format and load your dataset

from datasets import load_dataset

# Load your JSONL dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Define the chat template for Qwen2.5-Coder
def format_instruction(example):
    """Convert instruction-output pairs to the model's chat format."""
    messages = [
        {
            "role": "system",
            "content": "You are an expert software engineer. Write clean, "
                       "well-documented code following the project's conventions.",
        },
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]},
    ]
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )
    return {"text": text}

# Apply formatting
dataset = dataset.map(format_instruction)

print(f"Dataset size: {len(dataset)} examples")
print(f"Sample:\n{dataset[0]['text'][:500]}...")

Step 4: Configure training

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    packing=True,  # Pack short examples together for efficiency
    args=TrainingArguments(
        # --- Core ---
        output_dir="./output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,   # Effective batch size: 16

        # --- Optimizer ---
        learning_rate=2e-4,
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,

        # --- Memory ---
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        gradient_checkpointing=True,

        # --- Logging ---
        logging_steps=10,
        save_strategy="epoch",

        # --- Reproducibility ---
        seed=42,
    ),
)

Step 5: Train

print("Starting fine-tuning...")
trainer_stats = trainer.train()

print(f"\nTraining complete!")
print(f"  Total steps: {trainer_stats.global_step}")
print(f"  Training loss: {trainer_stats.training_loss:.4f}")
print(f"  Runtime: {trainer_stats.metrics['train_runtime']:.0f} seconds")

On an RTX 4090 with 2,000 examples and 3 epochs, expect roughly 20-40 minutes of training time. On a T4 (Colab free tier), more like 60-90 minutes.

Step 6: Save the adapter

# Save the LoRA adapter (small, ~100-200 MB)
model.save_pretrained("./qwen-coder-finetuned-lora")
tokenizer.save_pretrained("./qwen-coder-finetuned-lora")

# Optional: Push to Hugging Face Hub
# model.push_to_hub("your-username/qwen-coder-finetuned-lora")

# Optional: Merge adapter into base model and save full weights
# (Larger file, but easier to deploy)
model.save_pretrained_merged(
    "./qwen-coder-finetuned-merged",
    tokenizer,
    save_method="merged_16bit",  # or "merged_4bit" for quantized
)
Tip: The LoRA adapter is just ~100-200 MB. You can keep multiple adapters for different projects and swap them at inference time without reloading the base model. Think of them like lightweight "plugins" for your LLM.

✅ Testing and Evaluating Your Fine-Tuned Model

Training loss going down doesn't mean your model is useful. You need to actually test it on real tasks.

Quick inference test

from unsloth import FastLanguageModel

# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./qwen-coder-finetuned-lora",
    max_seq_length=4096,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

# Test with a prompt
messages = [
    {"role": "system", "content": "You are an expert software engineer."},
    {"role": "user", "content": "Write a function to process incoming webhook events"},
]

inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=512,
    temperature=0.2,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What to check in the output

  • Naming conventions -- Does it use snake_case or camelCase like your codebase?
  • Import patterns -- Does it import from your internal modules or hallucinate library names?
  • Architecture patterns -- Does it follow your project's structure (e.g., service layer, repository pattern)?
  • Error handling -- Does it use your team's error handling conventions?
  • Documentation style -- Does it match your docstring format?

Building an eval set

Create a separate file eval_set.jsonl with 50-100 examples you held out from training. Then compute a pass rate:

import json
from difflib import SequenceMatcher

def similarity(a: str, b: str) -> float:
    """Simple string similarity score between 0 and 1."""
    return SequenceMatcher(None, a, b).ratio()

def evaluate_model(model, tokenizer, eval_file: str) -> dict:
    """Run the fine-tuned model against held-out examples."""
    results = {"total": 0, "high_similarity": 0, "scores": []}

    with open(eval_file) as f:
        for line in f:
            example = json.loads(line)
            # Generate output
            messages = [
                {"role": "user", "content": example["instruction"]},
            ]
            inputs = tokenizer.apply_chat_template(
                messages, tokenize=True,
                add_generation_prompt=True, return_tensors="pt",
            ).to("cuda")
            outputs = model.generate(
                input_ids=inputs, max_new_tokens=512,
                temperature=0.1, do_sample=True,
            )
            generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

            score = similarity(generated, example["output"])
            results["scores"].append(score)
            results["total"] += 1
            if score > 0.6:
                results["high_similarity"] += 1

    avg = sum(results["scores"]) / len(results["scores"])
    results["average_similarity"] = avg
    results["pass_rate"] = results["high_similarity"] / results["total"]
    return results

results = evaluate_model(model, tokenizer, "eval_set.jsonl")
print(f"Average similarity: {results['average_similarity']:.2%}")
print(f"Pass rate (>0.6): {results['pass_rate']:.2%}")

A good fine-tune should hit 0.5-0.7 average similarity on held-out examples from the same codebase. Don't chase 1.0 -- that means you're overfitting.


🚀 Deploying Your Fine-Tuned Model

Once you're happy with the results, you need to actually serve the model somewhere useful.

The simplest path. Export your merged model to GGUF format, then load it in Ollama:

# Save as GGUF for Ollama
model.save_pretrained_gguf(
    "./qwen-coder-finetuned-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Good balance of size and quality
)

Then create an Ollama Modelfile:

FROM ./qwen-coder-finetuned-gguf/unsloth.Q4_K_M.gguf

PARAMETER temperature 0.2
PARAMETER num_ctx 4096

SYSTEM """You are an expert software engineer. Write clean, well-documented
code following the project's conventions."""
ollama create my-coder -f Modelfile
ollama run my-coder

Now your fine-tuned model runs through Ollama, which means it works with every tool that speaks the OpenAI protocol -- Cursor, Continue, VS Code, or your own scripts.

Option 2: Serve with vLLM (Production)

For team-wide or production deployments, vLLM gives you high-throughput serving with LoRA adapter hot-swapping:

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-7B-Instruct \
    --enable-lora \
    --lora-modules my-coder=./qwen-coder-finetuned-lora \
    --port 8000

This serves an OpenAI-compatible API where you can switch between the base model and your fine-tuned adapter per request -- useful when you have multiple teams with different fine-tunes.

Hardware costs for serving

Setup Hardware Monthly Cost Latency
Local (Ollama) Your own GPU $0 (electricity) 30-60 tok/s on RTX 4090
Cloud GPU (RunPod) RTX 4090 24GB ~$300/month 30-60 tok/s
Cloud GPU (Lambda) A10G 24GB ~$400/month 40-80 tok/s
Managed (Together.ai) Serverless ~$0.20/M tokens 50-100 tok/s

🛠️ Troubleshooting

CUDA out of memory

The most common error. Solutions in order of impact:

  1. Reduce per_device_train_batch_size to 1 or 2
  2. Enable gradient checkpointing (already on in our config)
  3. Reduce max_seq_length from 4096 to 2048
  4. Lower LORA_RANK from 32 to 16 (less adapter capacity but less memory)
  5. Use a smaller base model -- Qwen2.5-Coder-1.5B fits on almost any GPU

Training loss doesn't decrease

  • Learning rate too low. Try 5e-4 instead of 2e-4
  • Dataset too small. Below ~200 examples, the model barely learns anything meaningful
  • Data quality issue. Garbage in, garbage out. Check your instruction-output pairs manually

Model outputs gibberish after fine-tuning

  • Overfitting. Reduce epochs from 3 to 1-2, or add more training data
  • Wrong chat template. Each model family has its own template. Using Llama's template on a Qwen model produces nonsense. Always use tokenizer.apply_chat_template()
  • Corrupted save. Re-run training and save again. Check disk space

Fine-tuned model ignores its training

  • LoRA rank too low. Bump from 16 to 32 or 64
  • Not enough epochs. Try 3-5 epochs for small datasets (<1,000 examples)
  • Base model too large for the task. A 32B model needs more examples to shift its behavior than a 7B model

Import errors with Unsloth

  • Unsloth is sensitive to version combinations. Use the auto-installer:
wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -
Important: If you're fine-tuning on proprietary code, keep your adapter weights private. A LoRA adapter trained on your codebase contains learned representations of your code patterns. Treat it like source code -- don't push it to a public Hugging Face repo unless you're okay with that.

🔮 What's Next

Once you have your first fine-tuned model running:

  • Experiment with different LoRA ranks -- try r=16, 32, and 64 to find the sweet spot between adapter size and quality for your specific codebase
  • Build a hybrid pipeline -- pair your fine-tuned model with a RAG layer over your documentation using the RAG Pipeline Tutorial
  • Set up continuous fine-tuning -- retrain monthly as your codebase evolves, treating your adapter like a dependency that gets updated
  • Compare multiple base models -- fine-tune both Qwen2.5-Coder-7B and DeepSeek-Coder-6.7B on the same dataset and benchmark them side-by-side
  • Explore reinforcement learning from human feedback (RLHF) -- use DPO or ORPO to further align your model using developer preferences from code reviews

The gap between "generic AI coding assistant" and "AI that knows your codebase" is exactly what fine-tuning closes. It's no longer a technique reserved for ML teams with dedicated GPU clusters. A single developer with a weekend and a consumer GPU can build a model that writes code indistinguishable from the rest of the team. That's the real shift in 2026 -- not bigger models, but personalized ones.

Want to run your fine-tuned model locally? See our Local LLM + Ollama RAG Guide for the complete local setup. For an overview of AI coding tools that can use your fine-tuned model, check out AI Coding Agents Compared and The Rise of the AI Engineer.





Thanks for feedback.



Read More....
AI Coding Agents Compared: Cursor vs Copilot vs Claude Code vs Windsurf in 2026
AI Coding Agents and Security Risks: What You Need to Know
AI Pair Programming: The Productivity Guide for 2026
AI-Assisted Code Review: Tools and Workflows for 2026
AI-Native Documentation
Agentic Workflows vs Linear Chat