On This Pageexpand_more
Fine-Tuning Transformer Models with Low-Rank Adaptation (LoRA)
Learn LoRA fine-tuning step by step: the math behind low-rank adaptation, QLoRA quantization, Unsloth training, hyperparameter selection, and practical code for consumer GPUs.

Fine-Tuning Transformer Models with Low-Rank Adaptation (LoRA)
Introduction: Why Fine-Tuning Still Matters
Large language models are remarkably capable out of the box. GPT-5, Claude, Qwen 3.5, and Mistral can follow instructions, write code, and reason through complex problems without any task-specific training. So why would you fine-tune one?
Because general capability is not the same as specialized performance. A base model can write SQL, but it does not know your company's schema conventions. It can summarize text, but it does not match the tone your editorial team requires. It can extract entities, but it has never seen your proprietary taxonomy. Fine-tuning bridges the gap between general intelligence and domain-specific precision.
But fine-tuning is not always the right answer. Before you commit GPU hours, you need a clear decision framework.
The Decision Tree: Prompt Engineering vs. RAG vs. Fine-Tuning
Start with prompt engineering. If you can solve the problem by writing a better system prompt, few-shot examples, or structured output instructions, do that. It is the cheapest, fastest, and most reversible approach. Prompt engineering is sufficient for most formatting and style adjustments, simple classification tasks, and problems where the model already has the knowledge but needs better elicitation.
Move to retrieval-augmented generation (RAG) when the model lacks knowledge. If the problem is that the model does not know about your internal documents, product catalog, or recent events, give it that knowledge at inference time through retrieval. RAG is the right choice when the information changes frequently, when you need citations and provenance, or when the knowledge base is too large to embed in model weights.
Fine-tune when you need to change behavior, not just knowledge. Fine-tuning is the right tool when you need the model to adopt a specific style, follow a complex output format reliably, perform a specialized task with higher accuracy than prompting achieves, or respond in a domain-specific way that cannot be captured in a prompt. It is also the right choice when you need to reduce inference cost: a fine-tuned smaller model can often match a larger model prompted for the same task, at a fraction of the serving cost.
The key insight: these approaches are not mutually exclusive. A fine-tuned model can also use RAG. A RAG pipeline benefits from a model fine-tuned on your retrieval-and-answer format. The most effective production systems layer all three.
The Problem with Full Fine-Tuning
Full fine-tuning means updating every parameter in the model. For a 7B parameter model in FP16, the weights alone occupy 14 GB. But training requires substantially more memory than inference:
- Model weights: 14 GB (FP16)
- Gradients: 14 GB (same size as weights)
- Optimizer states: 28 GB for Adam (two momentum buffers per parameter)
- Activations: Variable, often 10-30 GB depending on batch size and sequence length
A full fine-tune of a 7B model requires roughly 70-100 GB of GPU memory. For a 70B model, you need 700+ GB, a cluster of 8-10 A100 80GB GPUs with model parallelism. This puts full fine-tuning beyond the reach of most practitioners and organizations.
Beyond hardware cost, full fine-tuning introduces catastrophic forgetting. When you update all parameters on a narrow dataset, the model can lose its general capabilities. A model fine-tuned aggressively on medical Q&A might become worse at basic conversation, code, or reasoning. Regularization techniques help, but they add complexity and do not eliminate the problem.
What if you could fine-tune a model by updating less than 1% of its parameters, using a single consumer GPU, while preserving most of its general knowledge?
That is exactly what LoRA does.
LoRA Explained: The Low-Rank Decomposition Insight
Low-Rank Adaptation (LoRA), introduced by Hu et al. in 2021, is built on a simple but powerful observation: the weight updates during fine-tuning have low intrinsic dimensionality. You do not need to modify the full weight matrix to adapt a model: a low-rank approximation of the update is sufficient.
The Core Idea
In standard fine-tuning, you start with a pre-trained weight matrix of dimensions and learn an update , so the new weight is:
In LoRA, instead of learning the full (which has parameters), you decompose it into two smaller matrices:
where has dimensions and has dimensions . The rank is much smaller than both and , typically 8, 16, 32, or 64, versus and values of 4096 or more.
The forward pass becomes:
The original weights are frozen: they receive no gradient updates. Only and are trained. This reduces the number of trainable parameters from to . For a weight matrix of size with rank , that is a reduction from 16.8 million parameters to 131,072, a 128x reduction.
Initialization Matters
LoRA initializes with a random Gaussian distribution and with zeros. This means that at the start of training, , so the model begins with exactly the pre-trained weights. Training then gradually learns the low-rank update. This zero-initialization is not just convenient; it ensures training stability and makes LoRA a strict generalization of the pre-trained model.
The Scaling Factor
LoRA introduces a scaling factor that is applied to the low-rank update:
The hyperparameter controls the magnitude of the adaptation relative to the pre-trained weights. When equals , the scaling factor is 1 and the update is applied at full strength. In practice, is often set equal to or to twice . The ratio acts as an effective learning rate modifier for the LoRA weights.
Why It Works: The Intrinsic Dimensionality Hypothesis
Why should a low-rank approximation work at all? The theoretical justification comes from research on the intrinsic dimensionality of neural network optimization landscapes.
Aghajanyan et al. (2020) showed that pre-trained language models have a surprisingly low intrinsic dimensionality: the optimization problem of fine-tuning can be solved in a much lower-dimensional subspace than the full parameter space suggests. A RoBERTa model with 125 million parameters was shown to have an intrinsic dimensionality of only around 200 for many downstream tasks. This means you only need to move along roughly 200 directions in parameter space to reach a good solution.
LoRA operationalizes this insight. By restricting the update to a rank-r subspace, it forces the optimization to stay within a low-dimensional manifold. The pre-trained model has already learned a rich representation of language; fine-tuning only needs to make a small, structured adjustment to specialize that representation.
Understanding the base architecture is essential (see Transformer Architectures from Scratch for the foundation these adaptations build on).
Key Hyperparameters
Getting LoRA to work well requires understanding five key hyperparameters.
Rank (r)
The rank determines the expressiveness of the adaptation. Higher rank means more parameters and more capacity to learn complex task-specific patterns. Lower rank means fewer parameters, faster training, and stronger regularization.
Practical guidance:
- : Good starting point for simple tasks like classification or straightforward style transfer
- to : The sweet spot for most instruction tuning and domain adaptation tasks
- : Appropriate for complex tasks requiring significant behavioral changes or large, diverse datasets
- : Rarely needed; if you need this much capacity, consider whether LoRA is the right approach
The rank-performance curve typically shows diminishing returns. Going from r = 4 to r = 16 often yields significant improvements. Going from r = 64 to r = 256 rarely does.
Alpha
As discussed above, alpha scales the LoRA update. Common practice is to set alpha = r (scaling factor of 1) or alpha = 2 * r (scaling factor of 2). Some practitioners set alpha to a fixed value like 16 or 32 regardless of rank, then adjust the learning rate accordingly.
The effective learning rate for LoRA weights is proportional to . If you change rank, consider whether you also need to change or the learning rate to maintain the same effective update magnitude.
Target Modules
LoRA can be applied to any linear layer in the model. The original paper applied it only to the query and value projection matrices in attention ( and ). Subsequent research has shown that targeting all linear layers (including key projections , output projections , and the MLP layers: gate, up, and down projections) consistently improves results.
With modern frameworks like PEFT and Unsloth, targeting all linear layers is the default recommendation. The additional parameter cost is modest (you are still training less than 2-3% of total parameters), and the performance gain is meaningful.
MoE architectures need a more deliberate choice. Models like Qwen3-30B-A3B, Mixtral, and DeepSeek-V3 split their MLPs into many expert modules (each with its own gate_proj, up_proj, down_proj) plus a router. Targeting all experts blows up the adapter size by the expert count (e.g. 64x or 128x larger than the equivalent dense LoRA) and most experts are rarely activated, so most of those parameters never train meaningfully. The pragmatic recipes:
- Default for MoE: target attention linears (
q_proj,k_proj,v_proj,o_proj) plus the router (gate) and any always-on "shared" experts. This adapts which experts get selected without touching what each one computes. Cheap, stable, and usually enough for domain adaptation. - When you need stronger specialization: additionally target a small subset of the most-activated experts (measure activation frequency on a sample of your training data first). Skip the long tail.
- Avoid: targeting every expert by default. The cost-to-benefit ratio is poor unless you are explicitly trying to retrain expert capacity.
Hybrid Mamba-attention architectures (NVIDIA's Nemotron Nano / Nemotron-H, Jamba, Zamba) interleave SSM blocks with attention blocks. LoRA still works, but you need to know which sublayers are linear:
- Attention blocks: same as standard transformer (
q_proj,k_proj,v_proj,o_proj). - Mamba/SSM blocks: target the linear projections (
in_proj,out_proj,x_proj,dt_proj). These are matmuls and behave like ordinary linear layers under LoRA. - Skip the structured SSM parameters (
A_log,D) and theconv1dweights. They are not matrix multiplications, and standard LoRA does not decompose them meaningfully. Either freeze them or, if your task truly needs them updated, use a full-parameter update on those small tensors alongside the LoRA adapters.
In PEFT, pass the module names as target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "in_proj", "out_proj", "x_proj", "dt_proj"] for a hybrid model, and verify the layer names against the loaded model's named_modules() before training (model-specific naming varies).
Dropout
LoRA supports dropout on the low-rank matrices, typically set between 0 and 0.1. For most fine-tuning tasks with reasonable dataset sizes, a dropout of 0.05 works well. For very small datasets where overfitting is a concern, increase to 0.1. For large datasets, you can often set it to 0.
Learning Rate
LoRA adapters typically require a higher learning rate than full fine-tuning. Where full fine-tuning might use 1e-5 to 5e-5, LoRA commonly uses 1e-4 to 3e-4. The small number of trainable parameters means each parameter needs to change more to have the same overall effect.
QLoRA: Quantization Meets LoRA
QLoRA, introduced by Dettmers et al. in 2023, made a crucial observation: you can quantize the frozen base model weights to 4-bit precision while training LoRA adapters in higher precision. This cuts memory requirements by roughly 4x compared to standard LoRA with FP16 weights, making it possible to fine-tune a 7B model on a single 16 GB GPU or a 70B model on a single 48 GB GPU.
The Three Innovations of QLoRA
NormalFloat4 (NF4) quantization. Standard 4-bit quantization maps values uniformly across the representable range. But neural network weights are approximately normally distributed, so uniform quantization wastes precision on the tails where few values exist. NF4 is an information-theoretically optimal data type for normally distributed data: it spaces quantization levels more densely near zero where most weight values concentrate, reducing quantization error by roughly 25% compared to standard INT4.
Double quantization. Quantization requires storing scaling factors (one per block of, say, 64 weights). These scaling factors are themselves stored in FP32, and for large models they add up. Double quantization quantizes the scaling factors themselves to 8-bit, saving an additional 0.37 bits per parameter. This sounds small, but for a 70B model it saves roughly 3 GB.
Paged optimizers. During training, optimizer states can cause GPU out-of-memory errors when processing longer sequences or larger batches. QLoRA uses NVIDIA's unified memory to page optimizer states to CPU RAM when GPU memory is tight, then page them back when needed. This prevents OOM errors at the cost of occasional latency spikes during training.
QLoRA Memory Comparison
For a 7B parameter model:
| Approach | Weight Memory | Total Training Memory |
|---|---|---|
| Full fine-tuning (FP16) | 14 GB | ~70-100 GB |
| LoRA (FP16 base) | 14 GB | ~18-24 GB |
| QLoRA (NF4 base) | ~3.5 GB | ~6-10 GB |
QLoRA brings fine-tuning within reach of consumer hardware: an RTX 3090 (24 GB) or even an RTX 4070 Ti Super (16 GB) can handle 7B-13B models comfortably.
The quality cost is minimal. The original QLoRA paper showed that QLoRA fine-tuning matches full FP16 fine-tuning on benchmarks, with the quantization of the base weights introducing negligible degradation when the LoRA adapters compensate during training.
Beyond LoRA: DoRA, AdaLoRA, and Modern Variants
The success of LoRA has spawned a family of improved variants.
DoRA: Weight-Decomposed Low-Rank Adaptation
DoRA (Liu et al., 2024) decomposes the weight matrix into magnitude and direction components, then applies LoRA only to the directional component:
where m is a learnable magnitude vector. The insight is that during full fine-tuning, the magnitude and direction of weight updates exhibit different learning patterns. Standard LoRA couples these, which can limit its expressiveness. By decomposing them, DoRA achieves performance closer to full fine-tuning while maintaining LoRA's parameter efficiency.
In practice, DoRA consistently outperforms LoRA by 1-3% on benchmarks across a range of tasks and model sizes, with only a small increase in trainable parameters (the magnitude vector adds d parameters per adapted layer). It is supported in Hugging Face PEFT and Unsloth.
AdaLoRA: Adaptive Rank Allocation
AdaLoRA (Zhang et al., 2023) observes that not all layers need the same rank. Attention layers in the middle of the model often need higher rank than those at the edges. AdaLoRA starts with a higher rank and dynamically prunes singular values during training, allocating rank budget where it matters most. This achieves better performance than fixed-rank LoRA with the same total parameter budget.
Newer Variants Worth Knowing
Three more recent additions are worth at least a paragraph each, because they ship in mainstream PEFT and require almost no extra code:
- LoRA+ (Hayou et al., 2024): assigns different learning rates to the and matrices (typically a higher LR for ). Recovers most of the convergence-speed gap with full fine-tuning at zero parameter cost. Available in PEFT via the
loraplus_lr_ratioconfig option. - rsLoRA (rank-stabilized LoRA, Kalajdzievski 2023): replaces the scaling factor with , which keeps the magnitude of the LoRA update stable as you increase rank. Without rsLoRA, high-rank LoRA tends to underperform low-rank because gradients shrink. Set
use_rslora=Truein PEFT'sLoraConfig. - PiSSA (Principal Singular Values and Singular Vectors Adaptation, Meng et al., 2024): initializes and from the top- singular vectors of the pre-trained weight matrix instead of from random Gaussian / zero. Converges faster and reaches lower loss than vanilla LoRA on the same compute budget. Set
init_lora_weights="pissa"in PEFT.
These are not mutually exclusive with each other or with DoRA. A common modern stack is DoRA + rsLoRA at rank 32-64, with PiSSA initialization for tasks that need fast convergence on a tight compute budget.
Rank Selection Strategies
If you want to go beyond fixed-rank LoRA without adopting AdaLoRA, consider these strategies:
- Start with rank 32 and measure. Train with r = 32, then analyze the singular values of the learned A and B matrices. If many singular values are near zero, you are using more rank than needed.
- Task complexity scaling. Simple classification: r = 8-16. Instruction following: r = 16-32. Complex domain adaptation: r = 32-64.
- Dataset size scaling. Small datasets (< 1,000 examples) benefit from lower rank as regularization. Large datasets (> 50,000 examples) can exploit higher rank without overfitting.
Practical Guide: Fine-Tuning with Unsloth
Unsloth is an open-source library that provides optimized LoRA and QLoRA training with 2x faster training speed and 60% less memory compared to standard Hugging Face implementations. It achieves this through custom CUDA kernels for the LoRA forward and backward passes, fused operations, and intelligent memory management.
Let us walk through a complete fine-tuning workflow.
Step 1: Environment Setup
Pin your dependencies. The PEFT / TRL / transformers stack moves fast, and small version mismatches will produce confusing import errors or silently wrong training. The following set is known to work together as of early 2026:
# requirements.txt
unsloth>=2026.1.1
torch>=2.5.0
transformers>=4.49.0
peft>=0.14.0
trl>=0.13.0
bitsandbytes>=0.45.0
accelerate>=1.2.0
datasets>=3.2.0Install with the right CUDA wheels for your GPU. Unsloth ships per-CUDA build hints in its README; for most current setups pip install -r requirements.txt followed by Unsloth's CUDA-specific install command is sufficient.
from unsloth import FastLanguageModel
import torch
# Configuration
max_seq_length = 2048
dtype = None # Auto-detect: Float16 for Tesla T4/V100, BFloat16 for Ampere+
load_in_4bit = True # QLoRA: quantize base model to 4-bitStep 2: Model Selection and Loading
Choosing the right base model is critical. The open-source models best suited for fine-tuning are covered in The Open-Source LLM Power Shift. For most tasks, start with a model that already performs well on your general domain.
You usually have two repo options for any popular base model:
- The official upstream repo (e.g.
Qwen/Qwen3.5-9B-Instruct) ships FP16/BF16 weights. Use this when you want a known-canonical artifact and are willing to letload_in_4bit=Truequantize on the fly during the first load. Slower first load, larger download. - The Unsloth-prepared repo (e.g.
unsloth/Qwen3.5-9B-Instruct-bnb-4bit) ships pre-quantized 4-bit weights and Unsloth-specific patches baked in. Use this for fastest cold start on a constrained machine: download is roughly 4x smaller and there is no on-the-fly quantization step. Output quality is identical for QLoRA workflows.
Pick one and pass it as model_name. Do not mix: a 4-bit Unsloth repo with load_in_4bit=False is not a valid configuration.
model, tokenizer = FastLanguageModel.from_pretrained(
# Option A — official upstream, quantized on the fly:
# model_name="Qwen/Qwen3.5-9B-Instruct",
# Option B — Unsloth pre-quantized (recommended for QLoRA):
model_name="unsloth/Qwen3.5-9B-Instruct-bnb-4bit",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=32, # Rank: 32 is a strong default
target_modules=[ # Target all linear layers
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=32, # Alpha = rank for scaling factor of 1
lora_dropout=0.05, # Light dropout for regularization
bias="none", # Do not train bias terms
use_gradient_checkpointing="unsloth", # Memory-efficient checkpointing
random_state=42,
)After applying LoRA, the model will report the number of trainable parameters. For Qwen 3.5 9B with rank 32 targeting all linear layers, expect roughly 90-110 million trainable parameters out of 9 billion total, about 1% of the model.
Step 3: Dataset Preparation
The dataset format depends on your task. For instruction tuning, the standard approach is to format examples in the model's chat template.
For this tutorial we will use a concrete, runnable target: specialize the base model on grade-school math reasoning, then verify the improvement on GSM8K. This gives you a hard number at the end (pass rate before vs. after), not just a vibe check. The training data is `meta-math/MetaMathQA` (Yu et al., 2023), a 395K-example dataset of math problems with step-by-step solutions. We will use a 10K subset to keep the demo tractable on a single consumer GPU.
from datasets import load_dataset
# Concrete follow-along dataset: MetaMathQA, 10K subset for a single-GPU demo
dataset = load_dataset("meta-math/MetaMathQA", split="train[:10000]")
# MetaMathQA columns: type, query, original_question, response
# Map to a chat-formatted SFT example.
MATH_SYSTEM_PROMPT = (
"You are a careful math tutor. "
"Solve each problem step by step and end with the final answer on its own line."
)
def format_example(example):
"""Convert a MetaMathQA row into the model's chat format."""
messages = [
{"role": "system", "content": MATH_SYSTEM_PROMPT},
{"role": "user", "content": example["query"]},
{"role": "assistant", "content": example["response"]},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
dataset = dataset.map(format_example, remove_columns=dataset.column_names)If you are bringing your own data instead, swap the load_dataset(...) line for a JSONL load and rename the fields:
# BYO data: JSONL with instruction / response fields
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Then in format_example, use example["instruction"] and example["response"].Dataset quality matters far more than quantity. A carefully curated dataset of 1,000-5,000 high-quality examples will typically outperform a noisy dataset of 100,000 examples. Focus on:
- Consistency: Every example should demonstrate exactly the behavior you want
- Diversity: Cover the full range of inputs the model will encounter
- Quality: Each response should be the ideal output. The model will learn to mimic your data exactly.
- Deduplication: Remove near-duplicates, which cause overfitting on repeated patterns
One honest caveat about this specific demo: training on MetaMathQA narrows the model toward math reasoning. You should see GSM8K go up; you may see general-conversation quality drift slightly. That tradeoff is the whole point of specialization. For a general-purpose assistant tune, you would pick a broader dataset like HuggingFaceH4/no_robots and accept that the evaluation story becomes softer (LLM-as-judge on a held-out split rather than a benchmark number).
Step 4: Training Configuration
TRL 0.13 consolidated the SFT configuration: SFT-specific kwargs (max_seq_length, dataset_text_field, packing) moved into SFTConfig, and tokenizer was replaced by processing_class. The example below uses the current API.
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model,
processing_class=tokenizer,
train_dataset=dataset,
args=SFTConfig(
# Output
output_dir="./output",
# SFT-specific
max_seq_length=max_seq_length,
dataset_text_field="text",
packing=True, # Pack short examples for GPU utilization
# Training duration
num_train_epochs=3, # 2-4 epochs for most tasks
# Batch size (effective = per_device * gradient_accumulation)
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size: 16
# Learning rate
learning_rate=2e-4, # Higher than full fine-tuning
lr_scheduler_type="cosine", # Cosine decay works well
warmup_ratio=0.05, # 5% warmup
# Precision
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
# Logging
logging_steps=10,
# Saving
save_strategy="steps",
save_steps=100,
save_total_limit=3, # Keep only 3 checkpoints
# Optimization
optim="adamw_8bit", # 8-bit Adam saves memory
weight_decay=0.01,
max_grad_norm=1.0,
# Seed
seed=42,
),
)
# Start training
trainer.train()Key configuration decisions:
- Batch size: Start with 4 per device and increase gradient accumulation to reach an effective batch size of 16-32. If you hit OOM errors, reduce per-device batch size and increase accumulation steps.
- Epochs: 2-4 epochs is typical. Monitor validation loss. If it starts climbing while training loss drops, you are overfitting.
- Packing: When enabled, multiple short examples are concatenated into a single sequence up to the max length. This dramatically improves GPU utilization when your examples are much shorter than the max sequence length.
Step 5: Evaluation
Training loss alone does not tell you if the model is good. You need task-specific evaluation.
# Save the fine-tuned adapter
model.save_pretrained("./fine-tuned-adapter")
tokenizer.save_pretrained("./fine-tuned-adapter")
# For inference, switch to optimized mode
FastLanguageModel.for_inference(model)
# Test with representative examples
test_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Your test prompt here"},
]
inputs = tokenizer.apply_chat_template(
test_messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)For systematic evaluation, create a test set of 50-200 examples that the model never saw during training. Evaluate on task-specific metrics: accuracy for classification, ROUGE or BERTScore for generation, exact match for structured outputs. Human evaluation remains the gold standard for open-ended generation quality.
For the MetaMathQA demo, the natural benchmark is GSM8K. Run it with lm-evaluation-harness, which handles few-shot prompting, answer extraction, and exact-match scoring:
pip install lm-eval
# Baseline: the un-adapted base model
lm_eval --model hf \
--model_args pretrained=unsloth/Qwen3.5-9B-Instruct-bnb-4bit,load_in_4bit=True \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size 8
# Fine-tuned: same base + your LoRA adapter
lm_eval --model hf \
--model_args pretrained=unsloth/Qwen3.5-9B-Instruct-bnb-4bit,peft=./fine-tuned-adapter,load_in_4bit=True \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size 8Compare the two exact_match scores. The honest expectation: when starting from a strong instruction-tuned base like Qwen 3.5 9B Instruct (which has already seen heavy math during its own SFT), expect a modest improvement of 2-5 percentage points from a 10K MetaMathQA tune. When starting from a non-instruct base model (e.g. Qwen/Qwen3.5-9B), the gap is much larger, often 20+ points, because there is more headroom. If you want a more dramatic before/after for the writeup, swap the base to the non-instruct variant and re-run.
Step 6: Export and Deployment
LoRA adapters are small. For dense models in the 2-13B parameter range at rank 32 targeting all linear layers, expect roughly 100-300 MB at BF16. Larger dense models scale roughly linearly: a 70B dense model at rank 32 lands around 800-900 MB. MoE adapters depend on whether you target expert weights or only the routing and shared layers; targeting all experts can grow the adapter by one to two orders of magnitude. You have several deployment options:
# Option 1: Save adapter separately (recommended for multi-adapter serving)
model.save_pretrained("./lora-adapter")
# Option 2: Merge adapter into base model for single-model deployment
# This creates a full model with the LoRA weights baked in
model.save_pretrained_merged(
"./merged-model",
tokenizer,
save_method="merged_16bit", # Or "merged_4bit" for quantized export
)
# Option 3: Export to GGUF for llama.cpp / Ollama deployment
model.save_pretrained_gguf(
"./gguf-model",
tokenizer,
quantization_method="q4_k_m", # Good balance of quality and size
)For production serving, modern inference engines (vLLM and SGLang) load LoRA adapters dynamically and serve many adapters concurrently on top of a single base model in GPU memory. This means you can ship one base-model deployment and route per-tenant or per-task requests to different adapters at request time, instead of holding a full merged model per variant. See LLM Inference Optimization for the broader serving picture.
Best Practices
These guidelines come from extensive community experience and ablation studies.
Target all linear layers. The original LoRA paper only adapted attention projections, but subsequent work consistently shows that including MLP layers (gate_proj, up_proj, down_proj in modern decoder architectures) improves results. The parameter cost is modest and the gains are real.
Use rank 16-64 for most tasks. Rank 16 is sufficient for many fine-tuning scenarios. Rank 32 is a safe default that works well across tasks. Only go above 64 if you have evidence that lower ranks underperform on your specific task.
Set alpha equal to rank. This gives a scaling factor of 1 and is the simplest starting point. If you find the model is adapting too aggressively (losing general capabilities), reduce alpha. If adaptation is too weak, increase it.
Learning rate between 1e-4 and 3e-4. Start with 2e-4 and adjust based on training dynamics. If loss drops too slowly, increase. If loss is unstable or spikes, decrease.
Use cosine learning rate scheduling with warmup. A 3-5% warmup period followed by cosine decay is robust across tasks and models.
Enable gradient checkpointing. This trades a modest amount of compute for significant memory savings, allowing you to use larger batch sizes or sequence lengths.
Monitor for overfitting aggressively. LoRA fine-tuning can overfit quickly, especially on small datasets. Use a validation set. If validation loss increases for more than one epoch while training loss continues to drop, stop training and use an earlier checkpoint.
Invest in data quality over quantity. A curated dataset of 1,000-5,000 examples consistently outperforms a noisy dataset 10-100x larger. Every example in your training set is a lesson; make each one count.
Model Merging After Fine-Tuning
Once you have fine-tuned adapters, model merging opens up powerful possibilities. You can combine multiple LoRA adapters or merge fine-tuned models to create a model that inherits capabilities from several specializations.
Key Merging Methods
Linear merging averages the weights of two or more models with configurable weights. Simple but effective when the models are not too different.
SLERP (Spherical Linear Interpolation) interpolates between two models along the curved surface of the weight manifold rather than in a straight line. This often produces better results than linear merging because it better preserves the geometry of the weight space. Limited to merging exactly two models.
TIES (Trim, Elect Sign, and Merge) addresses the problem of interference between task vectors. It trims small-magnitude parameters, resolves sign conflicts between models by majority vote, and then merges. TIES is particularly effective when combining models fine-tuned on different tasks that might have conflicting parameter updates.
DARE (Drop And REscale) randomly drops a fraction of delta parameters (the difference from the base model) and rescales the remaining ones. This acts as a sparsification step that reduces interference between merged models. DARE is often combined with TIES for best results.
Model merging is a deep topic with active research. For practical merging, tools like mergekit provide a straightforward YAML-based workflow for combining models using any of these methods.
Common Pitfalls and Debugging
Training Loss Does Not Decrease
- Learning rate too low. Try increasing to 3e-4 or even 5e-4.
- Rank too low. The adapter may not have enough capacity. Increase r.
- Data formatting error. The most common issue. Print a few formatted examples and verify they look correct. Check that the loss mask is applied only to the assistant's response, not the entire sequence.
- Wrong chat template. Each model family has its own chat template. Using the wrong one produces garbled training data. Always use
tokenizer.apply_chat_template()rather than manual formatting.
Model Outputs Garbled Text After Fine-Tuning
- Overfitting. Training too long on too little data. Reduce epochs or use an earlier checkpoint.
- Corrupted dataset. Check for encoding issues, extremely long examples that get truncated mid-token, or HTML/markdown artifacts in training text.
- EOS token issues. Ensure the end-of-sequence token is properly included in training examples. Without it, the model never learns when to stop generating.
GPU Out of Memory
- Reduce
per_device_train_batch_sizeto 1 or 2 and increasegradient_accumulation_stepsto compensate. - Enable gradient checkpointing (
use_gradient_checkpointing=True). - Reduce
max_seq_length. Sequences over 2048 tokens consume quadratically more memory in attention layers. - Ensure you are loading the base model in 4-bit (
load_in_4bit=True). - Close other GPU processes. Run
nvidia-smito check.
Fine-Tuned Model Loses General Capabilities
- Rank or alpha too high. The adaptation is overpowering the pre-trained weights. Reduce alpha or rank.
- Too many epochs. Catastrophic forgetting can occur even with LoRA if you train too aggressively. Two to three epochs is usually sufficient.
- Dataset too narrow. If all your examples are one type of task, the model may "forget" how to do other things. Consider mixing in a small percentage of general instruction-following data.
Adapter Not Loading or Producing Wrong Outputs
- Base model mismatch. The adapter must be loaded onto the exact same base model it was trained on. Different quantizations of the same model can also cause issues.
- Missing tokenizer changes. If you added special tokens during training, those must be present when loading the adapter for inference.
- PEFT version mismatch. Save the version of
peftused during training and use the same version for inference.
Key Takeaways
LoRA makes fine-tuning accessible. By learning low-rank updates to frozen weights, LoRA reduces trainable parameters by 100x+ and memory requirements by 4-8x, bringing fine-tuning to consumer hardware.
QLoRA pushes the boundary further. Combining 4-bit quantization of the base model with LoRA adapters, QLoRA enables fine-tuning 7B models on a 16 GB GPU and 70B models on a single 48 GB GPU with minimal quality loss.
Hyperparameters have sensible defaults. Rank 16-32, alpha equal to rank, learning rate 2e-4, all linear layers targeted. Start there and adjust based on your validation metrics.
Data quality dominates. No hyperparameter tuning or architectural innovation will compensate for a poor training dataset. Invest your time in curating, cleaning, and validating your data before optimizing your training configuration.
The ecosystem is maturing. Tools like Unsloth, PEFT, and TRL have made LoRA fine-tuning a production-ready workflow. DoRA and AdaLoRA push performance further. Model merging with TIES and DARE enables combining specializations without retraining.
Fine-tuning is one piece of the puzzle. The most effective AI systems combine fine-tuned models with RAG for knowledge, prompt engineering for task framing, and inference optimization for serving. LoRA makes the fine-tuning piece fast, cheap, and practical, which is why it has become the default approach to model customization.
The barrier to fine-tuning your own model has never been lower. A single GPU, a curated dataset, and the techniques in this tutorial are all you need to turn a general-purpose model into a specialized tool for your domain.