LangtrainLangtrain
DocsAPI ReferenceSDK Reference
ModelsChat
GitHubDiscord

Fine-tuning

Master LLM fine-tuning techniques including supervised fine-tuning (SFT), parameter-efficient methods, and instruction tuning for specialized tasks.

LLM Fine-tuning Overview

Fine-tuning adapts pre-trained Large Language Models (LLMs) to downstream tasks using task-specific datasets:
Core Fine-tuning Types:
  • •Supervised Fine-Tuning (SFT): Train on labeled task data with next-token prediction
  • •Instruction Tuning: Fine-tune on instruction-response pairs for better instruction following
  • •Reinforcement Learning from Human Feedback (RLHF): Align model outputs with human preferences
  • •Constitutional AI (CAI): Train models to follow constitutional principles
Transfer Learning Benefits:
  • •Knowledge Transfer: Leverage pre-trained representations and world knowledge
  • •Sample Efficiency: Achieve strong performance with limited task-specific data
  • •Computational Efficiency: Faster convergence compared to training from scratch
  • •Catastrophic Forgetting Mitigation: Preserve general capabilities while learning new tasks

Parameter-Efficient Fine-Tuning (PEFT)

Modern PEFT methods enable efficient adaptation of large models while preserving pre-trained knowledge:
LoRA (Low-Rank Adaptation):
  • •Decomposes weight updates into low-rank matrices: ΔW = BA where A ∈ R^(r×k), B ∈ R^(d×r)
  • •Typical rank r = 8-64 for effective adaptation with <1% trainable parameters
  • •Target modules: query, key, value projections in attention layers
  • •Variants: AdaLoRA (adaptive rank), QLoRA (quantized LoRA)
Advanced PEFT Techniques:
  • •Prefix Tuning: Learn continuous task-specific prefixes prepended to input sequences
  • •P-Tuning v2: Deep prompt tuning with trainable prompts in every layer
  • •Adapters: Insert small feed-forward networks between transformer layers
  • •BitFit: Fine-tune only bias parameters while freezing weights
  • •LN-Tuning: Update only layer normalization parameters
Multi-LoRA Fusion:
  • •Task Arithmetic: Combine multiple LoRA adapters through weighted averaging
  • •Mixture of LoRAs: Route inputs to different LoRA experts based on task classification

Fine-tuning Optimization Strategies

Critical hyperparameters and techniques for effective LLM fine-tuning:
Learning Rate Scheduling:
  • •Warmup: Linear/cosine warmup for first 5-10% of training steps
  • •Decay: Cosine annealing or linear decay to prevent overtraining
  • •Differential Learning Rates: Lower rates for embeddings/early layers, higher for task heads
  • •Typical Ranges: 1e-5 to 5e-4 for full fine-tuning, 1e-4 to 1e-3 for LoRA
Training Dynamics:
  • •Gradient Clipping: Prevent exploding gradients with max norm 1.0
  • •Mixed Precision: Use FP16/BF16 for memory efficiency and speed
  • •Gradient Accumulation: Simulate larger batch sizes when memory-constrained
  • •Checkpointing: Save model states to resume training and prevent data loss
Regularization Techniques:
  • •Weight Decay: L2 regularization typically 0.01-0.1
  • •Dropout: Layer-wise dropout rates 0.1-0.3
  • •Label Smoothing: Prevent overconfident predictions with ε = 0.1
  • •Early Stopping: Monitor validation loss with patience of 2-5 epochs
Data Efficiency:
  • •Few-Shot ICL: In-context learning with exemplars before fine-tuning
  • •Data Augmentation: Paraphrasing, backtranslation, syntactic transformations
  • •Active Learning: Select most informative samples for annotation

Instruction Tuning & Alignment

Advanced techniques for creating helpful, harmless, and honest AI assistants:
Supervised Fine-Tuning (SFT):
  • •Train on high-quality instruction-response pairs
  • •Format: {"instruction": "...", "input": "...", "output": "..."}
  • •Techniques: Next-token prediction on response tokens only
  • •Dataset examples: Alpaca, Vicuna, Dolly, OpenAssistant
Reinforcement Learning from Human Feedback (RLHF):
  • •Phase 1: SFT on demonstration data
  • •Phase 2: Train reward model on human preference rankings
  • •Phase 3: Optimize policy using PPO/TRPO against reward model
  • •KL Divergence Penalty: Prevent deviation from SFT model
Constitutional AI (CAI):
  • •Self-critique and revision following constitutional principles
  • •Reduce harmful outputs without human feedback
  • •Iterative refinement: critique → revise → evaluate
Direct Preference Optimization (DPO):
  • •Train directly on preference data without explicit reward model
  • •More stable than RLHF with simpler implementation
  • •Loss function: L = -E[log σ(β log π/π_ref(y_w|x) - β log π/π_ref(y_l|x))]
Evaluation Metrics:
  • •Helpfulness: Task completion, instruction following accuracy
  • •Harmlessness: Toxicity scores, bias detection, safety evaluations
  • •Honesty: Factual accuracy, calibration, uncertainty quantification

Full Examples

Supervised Fine-Tuning (SFT)

1import langtrain
2from langtrain.trainers import SFTTrainer
3from langtrain.data import InstructionDataset
4
5# Load pre-trained LLM
6model = langtrain.AutoModelForCausalLM.from_pretrained(
7 "llama-2-7b-hf",
8 torch_dtype="auto",
9 device_map="auto"
10)
11
12tokenizer = langtrain.AutoTokenizer.from_pretrained("llama-2-7b-hf")
13tokenizer.pad_token = tokenizer.eos_token
14
15# Prepare instruction dataset
16dataset = InstructionDataset.from_json(
17 "alpaca_data.json",
18 instruction_template="### Instruction:\n{instruction}\n\n### Response:\n{output}",
19 max_seq_length=512
20)
21
22# Configure SFT training
23training_args = langtrain.TrainingArguments(
24 output_dir="./sft-llama-2-7b",
25 learning_rate=2e-4,
26 per_device_train_batch_size=4,
27 gradient_accumulation_steps=8,
28 max_steps=1000,
29 warmup_steps=100,
30 logging_steps=10,
31 save_steps=500,
32 fp16=True,
33 optim="adamw_torch",
34 lr_scheduler_type="cosine",
35 max_grad_norm=1.0
36)
37
38# Start supervised fine-tuning
39trainer = SFTTrainer(
40 model=model,
41 tokenizer=tokenizer,
42 train_dataset=dataset,
43 args=training_args,
44 packing=True, # Pack multiple samples per sequence
45 dataset_text_field="text"
46)
47
48trainer.train()

LoRA Parameter-Efficient Fine-Tuning

1from peft import LoraConfig, get_peft_model, TaskType
2from langtrain.trainers import SFTTrainer
3import torch
4
5# Load base model
6model = langtrain.AutoModelForCausalLM.from_pretrained(
7 "mistral-7b-v0.1",
8 torch_dtype=torch.bfloat16,
9 device_map="auto",
10 attn_implementation="flash_attention_2" # Use FlashAttention for efficiency
11)
12
13# Configure LoRA with optimal settings
14lora_config = LoraConfig(
15 task_type=TaskType.CAUSAL_LM,
16 r=64, # Rank - higher for complex tasks
17 lora_alpha=128, # Scaling factor (typically 2*r)
18 lora_dropout=0.05, # Low dropout for stability
19 target_modules=[
20 "q_proj", "k_proj", "v_proj", "o_proj", # Attention projections
21 "gate_proj", "up_proj", "down_proj" # MLP projections
22 ],
23 bias="none",
24 use_rslora=True, # Rank-stabilized LoRA
25 init_lora_weights="gaussian"
26)
27
28# Apply LoRA to model
29model = get_peft_model(model, lora_config)
30model.print_trainable_parameters() # ~0.2% of total parameters
31
32# Training with LoRA-specific settings
33training_args = langtrain.TrainingArguments(
34 output_dir="./lora-mistral-7b",
35 learning_rate=3e-4, # Higher LR for LoRA
36 per_device_train_batch_size=8,
37 gradient_accumulation_steps=4,
38 max_steps=2000,
39 warmup_steps=200,
40 weight_decay=0.01,
41 logging_steps=25,
42 save_steps=500,
43 bf16=True,
44 dataloader_pin_memory=False
45)
46
47trainer = SFTTrainer(
48 model=model,
49 tokenizer=tokenizer,
50 train_dataset=dataset,
51 args=training_args
52)
53
54trainer.train()
55model.save_pretrained("./lora-adapters")

QLoRA - Quantized LoRA Fine-tuning

1from transformers import BitsAndBytesConfig
2from peft import LoraConfig, prepare_model_for_kbit_training
3import torch
4
5# Configure 4-bit quantization
6bnb_config = BitsAndBytesConfig(
7 load_in_4bit=True,
8 bnb_4bit_quant_type="nf4", # Normal Float 4-bit
9 bnb_4bit_compute_dtype=torch.bfloat16,
10 bnb_4bit_use_double_quant=True, # Nested quantization
11)
12
13# Load quantized model
14model = langtrain.AutoModelForCausalLM.from_pretrained(
15 "CodeLlama-13b-hf",
16 quantization_config=bnb_config,
17 device_map="auto",
18 trust_remote_code=True
19)
20
21# Prepare model for k-bit training
22model.gradient_checkpointing_enable()
23model = prepare_model_for_kbit_training(model)
24
25# QLoRA configuration
26qlora_config = LoraConfig(
27 r=32,
28 lora_alpha=64,
29 target_modules=[
30 "q_proj", "k_proj", "v_proj", "o_proj",
31 "gate_proj", "up_proj", "down_proj"
32 ],
33 lora_dropout=0.1,
34 bias="none",
35 task_type="CAUSAL_LM"
36)
37
38model = get_peft_model(model, qlora_config)
39
40# Training arguments optimized for QLoRA
41training_args = langtrain.TrainingArguments(
42 output_dir="./qlora-codellama-13b",
43 learning_rate=2e-4,
44 per_device_train_batch_size=2,
45 gradient_accumulation_steps=16,
46 max_steps=1500,
47 warmup_steps=150,
48 bf16=True,
49 logging_steps=10,
50 optim="paged_adamw_32bit", # Memory-efficient optimizer
51 lr_scheduler_type="constant",
52 max_grad_norm=0.3,
53 group_by_length=True # Pack sequences by length
54)
55
56trainer = SFTTrainer(
57 model=model,
58 tokenizer=tokenizer,
59 train_dataset=code_dataset,
60 args=training_args,
61 max_seq_length=2048
62)
63
64trainer.train()

RLHF - Reinforcement Learning from Human Feedback

1from langtrain.rlhf import RLHFTrainer, RewardModel
2from langtrain.data import PreferenceDataset
3from trl import PPOTrainer, PPOConfig
4
5# Step 1: Train reward model on preference data
6preference_data = PreferenceDataset.from_json("human_preferences.json")
7
8reward_model = RewardModel.from_pretrained(
9 "sft-model-checkpoint", # Start from SFT model
10 num_labels=1
11)
12
13reward_trainer = langtrain.RewardTrainer(
14 model=reward_model,
15 tokenizer=tokenizer,
16 train_dataset=preference_data,
17 eval_dataset=preference_data.train_test_split(0.1)["test"],
18 compute_metrics=lambda p: {"accuracy": (p.predictions > 0).sum() / len(p.predictions)}
19)
20
21reward_trainer.train()
22
23# Step 2: PPO training with reward model
24ppo_config = PPOConfig(
25 model_name="sft-model-checkpoint",
26 learning_rate=1.41e-5,
27 log_with="wandb",
28 mini_batch_size=64,
29 batch_size=256,
30 gradient_accumulation_steps=4,
31 optimize_cuda_cache=True,
32 early_stopping=True,
33 target_kl=0.1, # KL divergence constraint
34 ppo_epochs=4,
35 max_grad_norm=1.0,
36 use_score_scaling=True,
37 use_score_norm=True
38)
39
40# Initialize PPO trainer
41ppo_trainer = PPOTrainer(
42 config=ppo_config,
43 model=model,
44 ref_model=ref_model, # Reference model (frozen SFT model)
45 tokenizer=tokenizer,
46 reward_model=reward_model
47)
48
49# Training loop
50for epoch in range(ppo_config.ppo_epochs):
51 for batch in ppo_trainer.dataloader:
52 query_tensors = batch["input_ids"]
53
54 # Generate responses
55 response_tensors = ppo_trainer.generate(
56 query_tensors,
57 return_prompt=False,
58 **generation_kwargs
59 )
60
61 # Get rewards from reward model
62 rewards = reward_model.get_rewards(query_tensors, response_tensors)
63
64 # PPO step
65 stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
66 ppo_trainer.log_stats(stats, batch, rewards)