Large Language Models (LLMs)

LoRA (Low-Rank Adaptation)

Definition

Low-Rank Adaptation (LoRA), introduced by Hu et al. in 2021, addresses the prohibitive cost of fully fine-tuning large LLMs by exploiting a key insight: weight updates during fine-tuning have low intrinsic rank. Instead of updating all billions of model parameters, LoRA freezes the pre-trained weights and injects pairs of small trainable matrices (A and B) into each transformer layer. For a weight matrix W (d × d), LoRA adds W + AB^T where A is d × r and B is r × d, with r << d (rank 4-64 is typical). Only A and B are trained; W is frozen. The number of trainable parameters is reduced from d² to 2dr—for a 70B model with d=8192 and r=8, from ~67M to ~131K parameters per layer—a reduction of over 99%.

Why It Matters

LoRA makes fine-tuning of large LLMs accessible to teams without massive GPU clusters. Full fine-tuning a 70B model requires 8+ A100 80GB GPUs and weeks of training; LoRA fine-tuning the same model requires 1-2 A100s and hours to days. This democratization of fine-tuning is why the open-source LLM community produces hundreds of specialized LoRA adapters monthly. For 99helpers customers, LoRA enables domain-specific fine-tuning of capable open-source models (Llama-3, Mistral) to match or exceed frontier API quality for their specific use case at a fraction of ongoing API cost.

How It Works

LoRA implementation with Hugging Face PEFT: from peft import LoraConfig, get_peft_model; config = LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'], lora_dropout=0.05, bias='none', task_type='CAUSAL_LM'); model = get_peft_model(model, config); model.print_trainable_parameters() → 0.1% of parameters are trainable. Key hyperparameters: r (rank, 4-64, higher = more expressive but more parameters), lora_alpha (scaling factor, typically 2r), target_modules (which layers to adapt, typically all attention projections). After training, LoRA weights can be merged back into the base model for zero-overhead inference, or kept separate for easy switching between adapters.

LoRA — Low-Rank Adaptation: Frozen Weights + Small ΔW Matrices

Base model (frozen)

EmbeddingW (frozen)

Attention layer 1W (frozen)

FFN layer 1W (frozen)

Attention layer 2W (frozen)

FFN layer 2W (frozen)

Attention layer NW (frozen)

Output headW (frozen)

LoRA adapters (trainable)

ΔW = A · B

A: d×r, B: r×d

rank r ≪ d

ΔW = A · B

A: d×r, B: r×d

rank r ≪ d

ΔW = A · B

A: d×r, B: r×d

rank r ≪ d

Trainable params

0.1–1%

of base model size

VRAM reduction

~10×

vs full fine-tuning

Rank (r)

4–64

controls adapter capacity

At inference time, ΔW is merged into W (W' = W + ΔW) — no extra latency compared to the base model. Multiple LoRA adapters can be swapped at runtime to serve different fine-tuned behaviors.

Real-World Example

A 99helpers team fine-tunes Llama-3-8B on their support conversation dataset (10,000 (query, response) pairs). Full fine-tuning would require 4 A100 80GB GPUs for 20 hours. With LoRA (r=16, targeting all attention layers), they fine-tune on a single RTX 4090 (24GB VRAM) in 3 hours. Trainable parameters: 8.4M out of 8B total (0.1%). The resulting LoRA adapter file is 67MB—the base model remains unchanged and shared across multiple adapters for different customer domains. Quality on their support benchmark: 87% accuracy, versus 72% for the base model and 84% for full fine-tuning with far more resources.

Common Mistakes

✕Choosing rank r too low (r=1-2) to save compute—very low ranks limit the adapter's expressiveness and can produce poor quality on complex tasks.
✕Not targeting all attention projection layers (q, k, v, o) when adapting language—targeting only q and v (a common default) is often suboptimal for instruction tuning.
✕Forgetting to merge LoRA weights before deploying for inference—unmerged LoRA adds a small computation overhead at every forward pass that compounds over many queries.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

LoRA (Low-Rank Adaptation)

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Parameter-Efficient Fine-Tuning (PEFT)

QLoRA

Fine-Tuning

Model Quantization

Model Distillation

Ready to build your AI chatbot?