LoRA & QLoRA Fine-tuning

Master Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) techniques for parameter-efficient fine-tuning. Achieve professional results with 90% fewer parameters and dramatically reduced memory usage.

Key Features

⚡

Parameter Efficient

Reduce trainable parameters by up to 90% while maintaining model performance through low-rank matrix decomposition.

🧠

Memory Optimized

QLoRA enables fine-tuning of 65B models on a single 48GB GPU with 4-bit quantization and optimized memory management.

🚀

Fast Training

Achieve 3-5x faster training speeds compared to full fine-tuning while using significantly less computational resources.

🔧

Production Ready

Easy deployment and integration with existing model serving infrastructure. Merge adapters or use them independently.

What is LoRA?

**Low-Rank Adaptation (LoRA)** is a parameter-efficient fine-tuning technique that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. LoRA significantly reduces the number of trainable parameters for downstream tasks. For a pre-trained model with weight matrix **W₀**, LoRA represents the weight update **ΔW** as the product of two low-rank matrices **A** and **B**: **ΔW = BA**. During training, **W₀** remains frozen and only **A** and **B** are updated. The modified forward pass becomes: **h = W₀x + ΔWx = W₀x + BAx**.

Code Example

# Basic LoRA implementation concept
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=4, alpha=1):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        
        # Frozen pre-trained weights
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.weight.requires_grad = False
        
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Initialize A with gaussian, B with zeros
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)
    
    def forward(self, x):
        # Original forward pass + LoRA adaptation
        result = F.linear(x, self.weight)
        lora_result = F.linear(F.linear(x, self.lora_A.T), self.lora_B.T)
        return result + (self.alpha / self.rank) * lora_result

QLoRA: Quantized LoRA

**QLoRA (Quantized LoRA)** extends LoRA by quantizing the pre-trained model to **4-bit precision** while keeping LoRA adapters in **16-bit**. This enables fine-tuning of massive models on consumer hardware. Key innovations in QLoRA: - **4-bit NormalFloat (NF4)**: Information-theoretically optimal quantization for normally distributed weights - **Double Quantization**: Quantize the quantization constants to save additional memory - **Paged Optimizers**: Handle memory spikes during gradient computation QLoRA makes it possible to fine-tune a **65B parameter model on a single 48GB GPU**, democratizing access to large language model customization.

Code Example

# QLoRA fine-tuning with LangTrain
from langtrain import QLoRATrainer
from langtrain.models import AutoModelForCausalLM
from langtrain.datasets import load_dataset
from transformers import AutoTokenizer

# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/DialoGPT-large",
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")

# Configure QLoRA parameters
qlora_config = {
    "r": 64,  # Rank
    "lora_alpha": 16,  # Scaling parameter
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
    "lora_dropout": 0.1,
    "bias": "none",
    "task_type": "CAUSAL_LM"
}

# Load and prepare dataset
dataset = load_dataset("your_dataset.jsonl")
dataset = dataset.map(lambda x: tokenizer(x["text"], truncation=True, padding=True))

# Initialize trainer
trainer = QLoRATrainer(
    model=model,
    tokenizer=tokenizer,
    dataset=dataset,
    qlora_config=qlora_config,
    output_dir="./qlora_results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    save_steps=500,
    logging_steps=10,
)

# Start training
trainer.train()

Advanced Configuration

Fine-tune your LoRA/QLoRA setup with advanced parameters for optimal performance. The choice of **rank (r)**, **alpha**, and **target modules** significantly impacts model quality and training efficiency. **Rank Selection**: Higher ranks capture more information but increase parameters. Start with r=8-16 for most tasks, use r=64+ for complex domains. **Alpha Scaling**: Controls the magnitude of LoRA updates. Use alpha=2*r as a starting point, then adjust based on validation performance. **Target Modules**: Apply LoRA to attention layers (q_proj, v_proj) for most tasks. Include MLP layers for domain-specific knowledge.

Code Example

# Advanced LoRA configuration
advanced_config = {
    # Core LoRA parameters
    "r": 32,  # Rank - balance between efficiency and capacity
    "lora_alpha": 64,  # Scaling factor (typically 2*r)
    "lora_dropout": 0.05,  # Regularization
    
    # Target modules - customize based on model architecture
    "target_modules": [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj"      # MLP (for Llama-like models)
    ],
    
    # Advanced options
    "bias": "lora_only",  # Train bias in LoRA layers
    "modules_to_save": ["embed_tokens", "lm_head"],  # Additional modules
    "init_lora_weights": True,  # Proper initialization
    
    # QLoRA specific
    "load_in_4bit": True,
    "bnb_4bit_compute_dtype": torch.bfloat16,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": True,
}

# Training hyperparameters
training_args = {
    "output_dir": "./advanced_lora_results",
    "num_train_epochs": 5,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 8,
    "learning_rate": 1e-4,
    "weight_decay": 0.01,
    "warmup_ratio": 0.03,
    "lr_scheduler_type": "cosine",
    "save_strategy": "steps",
    "save_steps": 250,
    "eval_strategy": "steps",
    "eval_steps": 250,
    "logging_steps": 10,
    "fp16": False,
    "bf16": True,  # Better numerical stability
    "dataloader_pin_memory": False,  # Memory optimization
    "remove_unused_columns": False,
}

Merging and Deployment

After training, you can **merge** LoRA adapters back into the base model for simplified deployment, or keep them separate for flexibility. Merged models have no inference overhead, while separate adapters allow easy swapping between different fine-tuned versions. **Merging Benefits**: Single model file, no additional inference code, optimal for production. **Separate Adapters**: Multiple task-specific adapters, easy A/B testing, smaller storage requirements.

Code Example

# Method 1: Merge LoRA adapters into base model
from peft import PeftModel
import torch

# Load base model and adapter
base_model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")
model = PeftModel.from_pretrained(base_model, "./lora_results")

# Merge adapters
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

# Method 2: Deploy with separate adapters
from langtrain import LoRAInference

# Initialize inference engine
inference = LoRAInference(
    base_model="microsoft/DialoGPT-large",
    adapter_path="./lora_results",
    device="cuda",
    torch_dtype=torch.float16
)

# Switch between different adapters dynamically  
inference.load_adapter("task_1", "./task1_lora")
inference.load_adapter("task_2", "./task2_lora")

# Generate with specific adapter
response = inference.generate(
    "Hello, how are you?",
    adapter_name="task_1",
    max_length=100,
    temperature=0.7
)

# Method 3: Batch inference with multiple adapters
responses = inference.batch_generate([
    {"text": "Explain quantum computing", "adapter": "task_1"},
    {"text": "Write a poem about AI", "adapter": "task_2"}
])

print(responses)

Performance Optimization

Optimize your LoRA/QLoRA training for maximum performance and efficiency. Key strategies include **gradient checkpointing**, **mixed precision training**, and **optimal batch sizing**. **Memory Optimization**: Use gradient checkpointing to trade compute for memory. Enable `gradient_checkpointing=True` for 40-50% memory reduction. **Speed Optimization**: Use `bf16` instead of `fp16` for numerical stability. Increase batch size with gradient accumulation for better GPU utilization.

Code Example

# Production-optimized training configuration
from langtrain import OptimizedQLoRATrainer
import torch

# Memory-efficient configuration
optimizer_config = {
    # Optimizer settings
    "optimizer": "adamw_torch_fused",  # Faster fused optimizer
    "learning_rate": 2e-4,
    "weight_decay": 0.01,
    "adam_beta1": 0.9,
    "adam_beta2": 0.999,
    "adam_epsilon": 1e-8,
    
    # Memory optimizations
    "gradient_checkpointing": True,
    "dataloader_pin_memory": False,
    "dataloader_num_workers": 4,
    "remove_unused_columns": False,
    
    # Performance optimizations
    "bf16": True,  # Better than fp16 for stability
    "tf32": True,  # Enable TensorFloat-32 on A100
    "ddp_find_unused_parameters": False,
    
    # Batch size optimization
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,  # Effective batch size = 16
    "max_grad_norm": 1.0,
}

# Initialize optimized trainer
trainer = OptimizedQLoRATrainer(
    model=model,
    tokenizer=tokenizer,
    dataset=dataset,
    **optimizer_config
)

# Monitor training metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Implement your metric computation
    perplexity = torch.exp(torch.tensor(loss))
    return {"perplexity": perplexity}

trainer.compute_metrics = compute_metrics

# Train with automatic mixed precision
with torch.cuda.amp.autocast():
    trainer.train()

# Profile memory usage
print(f"Peak memory: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")