Fine-tuning

Master LLM fine-tuning techniques including supervised fine-tuning (SFT), parameter-efficient methods, and instruction tuning for specialized tasks.

LLM Fine-tuning Overview

Fine-tuning adapts pre-trained Large Language Models (LLMs) to downstream tasks using task-specific datasets:

Core Fine-tuning Types:

•Supervised Fine-Tuning (SFT): Train on labeled task data with next-token prediction
•Instruction Tuning: Fine-tune on instruction-response pairs for better instruction following
•Reinforcement Learning from Human Feedback (RLHF): Align model outputs with human preferences
•Constitutional AI (CAI): Train models to follow constitutional principles

Transfer Learning Benefits:

•Knowledge Transfer: Leverage pre-trained representations and world knowledge
•Sample Efficiency: Achieve strong performance with limited task-specific data
•Computational Efficiency: Faster convergence compared to training from scratch
•Catastrophic Forgetting Mitigation: Preserve general capabilities while learning new tasks

Parameter-Efficient Fine-Tuning (PEFT)

Modern PEFT methods enable efficient adaptation of large models while preserving pre-trained knowledge:

LoRA (Low-Rank Adaptation):

•Decomposes weight updates into low-rank matrices: ΔW = BA where A ∈ R^(r×k), B ∈ R^(d×r)
•Typical rank r = 8-64 for effective adaptation with <1% trainable parameters
•Target modules: query, key, value projections in attention layers
•Variants: AdaLoRA (adaptive rank), QLoRA (quantized LoRA)

Advanced PEFT Techniques:

•Prefix Tuning: Learn continuous task-specific prefixes prepended to input sequences
•P-Tuning v2: Deep prompt tuning with trainable prompts in every layer
•Adapters: Insert small feed-forward networks between transformer layers
•BitFit: Fine-tune only bias parameters while freezing weights
•LN-Tuning: Update only layer normalization parameters

Multi-LoRA Fusion:

•Task Arithmetic: Combine multiple LoRA adapters through weighted averaging
•Mixture of LoRAs: Route inputs to different LoRA experts based on task classification

Fine-tuning Optimization Strategies

Critical hyperparameters and techniques for effective LLM fine-tuning:

Learning Rate Scheduling:

•Warmup: Linear/cosine warmup for first 5-10% of training steps
•Decay: Cosine annealing or linear decay to prevent overtraining
•Differential Learning Rates: Lower rates for embeddings/early layers, higher for task heads
•Typical Ranges: 1e-5 to 5e-4 for full fine-tuning, 1e-4 to 1e-3 for LoRA

Training Dynamics:

•Gradient Clipping: Prevent exploding gradients with max norm 1.0
•Mixed Precision: Use FP16/BF16 for memory efficiency and speed
•Gradient Accumulation: Simulate larger batch sizes when memory-constrained
•Checkpointing: Save model states to resume training and prevent data loss

Regularization Techniques:

•Weight Decay: L2 regularization typically 0.01-0.1
•Dropout: Layer-wise dropout rates 0.1-0.3
•Label Smoothing: Prevent overconfident predictions with ε = 0.1
•Early Stopping: Monitor validation loss with patience of 2-5 epochs

Data Efficiency:

•Few-Shot ICL: In-context learning with exemplars before fine-tuning
•Data Augmentation: Paraphrasing, backtranslation, syntactic transformations
•Active Learning: Select most informative samples for annotation

Instruction Tuning & Alignment

Advanced techniques for creating helpful, harmless, and honest AI assistants:

Supervised Fine-Tuning (SFT):

•Train on high-quality instruction-response pairs
•Format: {"instruction": "...", "input": "...", "output": "..."}
•Techniques: Next-token prediction on response tokens only
•Dataset examples: Alpaca, Vicuna, Dolly, OpenAssistant

Reinforcement Learning from Human Feedback (RLHF):

•Phase 1: SFT on demonstration data
•Phase 2: Train reward model on human preference rankings
•Phase 3: Optimize policy using PPO/TRPO against reward model
•KL Divergence Penalty: Prevent deviation from SFT model

Constitutional AI (CAI):

•Self-critique and revision following constitutional principles
•Reduce harmful outputs without human feedback
•Iterative refinement: critique → revise → evaluate

Direct Preference Optimization (DPO):

•Train directly on preference data without explicit reward model
•More stable than RLHF with simpler implementation
•Loss function: L = -E[log σ(β log π/π_ref(y_w|x) - β log π/π_ref(y_l|x))]

Evaluation Metrics:

•Helpfulness: Task completion, instruction following accuracy
•Harmlessness: Toxicity scores, bias detection, safety evaluations
•Honesty: Factual accuracy, calibration, uncertainty quantification

Full Examples

Supervised Fine-Tuning (SFT)

1import langtrain
2from langtrain.trainers import SFTTrainer
3from langtrain.data import InstructionDataset
4
5# Load pre-trained LLM
6model = langtrain.AutoModelForCausalLM.from_pretrained(
7    "llama-2-7b-hf",
8    torch_dtype="auto",
9    device_map="auto"
10)
11
12tokenizer = langtrain.AutoTokenizer.from_pretrained("llama-2-7b-hf")
13tokenizer.pad_token = tokenizer.eos_token
14
15# Prepare instruction dataset
16dataset = InstructionDataset.from_json(
17    "alpaca_data.json",
18    instruction_template="### Instruction:\n{instruction}\n\n### Response:\n{output}",
19    max_seq_length=512
20)
21
22# Configure SFT training
23training_args = langtrain.TrainingArguments(
24    output_dir="./sft-llama-2-7b",
25    learning_rate=2e-4,
26    per_device_train_batch_size=4,
27    gradient_accumulation_steps=8,
28    max_steps=1000,
29    warmup_steps=100,
30    logging_steps=10,
31    save_steps=500,
32    fp16=True,
33    optim="adamw_torch",
34    lr_scheduler_type="cosine",
35    max_grad_norm=1.0
36)
37
38# Start supervised fine-tuning
39trainer = SFTTrainer(
40    model=model,
41    tokenizer=tokenizer,
42    train_dataset=dataset,
43    args=training_args,
44    packing=True,  # Pack multiple samples per sequence
45    dataset_text_field="text"
46)
47
48trainer.train()

LoRA Parameter-Efficient Fine-Tuning

1from peft import LoraConfig, get_peft_model, TaskType
2from langtrain.trainers import SFTTrainer
3import torch
4
5# Load base model
6model = langtrain.AutoModelForCausalLM.from_pretrained(
7    "mistral-7b-v0.1",
8    torch_dtype=torch.bfloat16,
9    device_map="auto",
10    attn_implementation="flash_attention_2"  # Use FlashAttention for efficiency
11)
12
13# Configure LoRA with optimal settings
14lora_config = LoraConfig(
15    task_type=TaskType.CAUSAL_LM,
16    r=64,                    # Rank - higher for complex tasks
17    lora_alpha=128,          # Scaling factor (typically 2*r)
18    lora_dropout=0.05,       # Low dropout for stability
19    target_modules=[
20        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention projections
21        "gate_proj", "up_proj", "down_proj"       # MLP projections
22    ],
23    bias="none",
24    use_rslora=True,         # Rank-stabilized LoRA
25    init_lora_weights="gaussian"
26)
27
28# Apply LoRA to model
29model = get_peft_model(model, lora_config)
30model.print_trainable_parameters()  # ~0.2% of total parameters
31
32# Training with LoRA-specific settings
33training_args = langtrain.TrainingArguments(
34    output_dir="./lora-mistral-7b",
35    learning_rate=3e-4,      # Higher LR for LoRA
36    per_device_train_batch_size=8,
37    gradient_accumulation_steps=4,
38    max_steps=2000,
39    warmup_steps=200,
40    weight_decay=0.01,
41    logging_steps=25,
42    save_steps=500,
43    bf16=True,
44    dataloader_pin_memory=False
45)
46
47trainer = SFTTrainer(
48    model=model,
49    tokenizer=tokenizer,
50    train_dataset=dataset,
51    args=training_args
52)
53
54trainer.train()
55model.save_pretrained("./lora-adapters")

QLoRA - Quantized LoRA Fine-tuning

1from transformers import BitsAndBytesConfig
2from peft import LoraConfig, prepare_model_for_kbit_training
3import torch
4
5# Configure 4-bit quantization
6bnb_config = BitsAndBytesConfig(
7    load_in_4bit=True,
8    bnb_4bit_quant_type="nf4",          # Normal Float 4-bit
9    bnb_4bit_compute_dtype=torch.bfloat16,
10    bnb_4bit_use_double_quant=True,     # Nested quantization
11)
12
13# Load quantized model
14model = langtrain.AutoModelForCausalLM.from_pretrained(
15    "CodeLlama-13b-hf",
16    quantization_config=bnb_config,
17    device_map="auto",
18    trust_remote_code=True
19)
20
21# Prepare model for k-bit training
22model.gradient_checkpointing_enable()
23model = prepare_model_for_kbit_training(model)
24
25# QLoRA configuration
26qlora_config = LoraConfig(
27    r=32,
28    lora_alpha=64,
29    target_modules=[
30        "q_proj", "k_proj", "v_proj", "o_proj",
31        "gate_proj", "up_proj", "down_proj"
32    ],
33    lora_dropout=0.1,
34    bias="none",
35    task_type="CAUSAL_LM"
36)
37
38model = get_peft_model(model, qlora_config)
39
40# Training arguments optimized for QLoRA
41training_args = langtrain.TrainingArguments(
42    output_dir="./qlora-codellama-13b",
43    learning_rate=2e-4,
44    per_device_train_batch_size=2,
45    gradient_accumulation_steps=16,
46    max_steps=1500,
47    warmup_steps=150,
48    bf16=True,
49    logging_steps=10,
50    optim="paged_adamw_32bit",  # Memory-efficient optimizer
51    lr_scheduler_type="constant",
52    max_grad_norm=0.3,
53    group_by_length=True        # Pack sequences by length
54)
55
56trainer = SFTTrainer(
57    model=model,
58    tokenizer=tokenizer,
59    train_dataset=code_dataset,
60    args=training_args,
61    max_seq_length=2048
62)
63
64trainer.train()

RLHF - Reinforcement Learning from Human Feedback

1from langtrain.rlhf import RLHFTrainer, RewardModel
2from langtrain.data import PreferenceDataset
3from trl import PPOTrainer, PPOConfig
4
5# Step 1: Train reward model on preference data
6preference_data = PreferenceDataset.from_json("human_preferences.json")
7
8reward_model = RewardModel.from_pretrained(
9    "sft-model-checkpoint",  # Start from SFT model
10    num_labels=1
11)
12
13reward_trainer = langtrain.RewardTrainer(
14    model=reward_model,
15    tokenizer=tokenizer,
16    train_dataset=preference_data,
17    eval_dataset=preference_data.train_test_split(0.1)["test"],
18    compute_metrics=lambda p: {"accuracy": (p.predictions > 0).sum() / len(p.predictions)}
19)
20
21reward_trainer.train()
22
23# Step 2: PPO training with reward model
24ppo_config = PPOConfig(
25    model_name="sft-model-checkpoint",
26    learning_rate=1.41e-5,
27    log_with="wandb",
28    mini_batch_size=64,
29    batch_size=256,
30    gradient_accumulation_steps=4,
31    optimize_cuda_cache=True,
32    early_stopping=True,
33    target_kl=0.1,          # KL divergence constraint
34    ppo_epochs=4,
35    max_grad_norm=1.0,
36    use_score_scaling=True,
37    use_score_norm=True
38)
39
40# Initialize PPO trainer
41ppo_trainer = PPOTrainer(
42    config=ppo_config,
43    model=model,
44    ref_model=ref_model,     # Reference model (frozen SFT model)
45    tokenizer=tokenizer,
46    reward_model=reward_model
47)
48
49# Training loop
50for epoch in range(ppo_config.ppo_epochs):
51    for batch in ppo_trainer.dataloader:
52        query_tensors = batch["input_ids"]
53        
54        # Generate responses
55        response_tensors = ppo_trainer.generate(
56            query_tensors,
57            return_prompt=False,
58            **generation_kwargs
59        )
60        
61        # Get rewards from reward model
62        rewards = reward_model.get_rewards(query_tensors, response_tensors)
63        
64        # PPO step
65        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
66        ppo_trainer.log_stats(stats, batch, rewards)

LLM Fine-tuning Overview

Fine-tuning adapts pre-trained Large Language Models (LLMs) to downstream tasks using task-specific datasets:

Core Fine-tuning Types:

•Supervised Fine-Tuning (SFT): Train on labeled task data with next-token prediction
•Instruction Tuning: Fine-tune on instruction-response pairs for better instruction following
•Reinforcement Learning from Human Feedback (RLHF): Align model outputs with human preferences
•Constitutional AI (CAI): Train models to follow constitutional principles

Transfer Learning Benefits:

•Knowledge Transfer: Leverage pre-trained representations and world knowledge
•Sample Efficiency: Achieve strong performance with limited task-specific data
•Computational Efficiency: Faster convergence compared to training from scratch
•Catastrophic Forgetting Mitigation: Preserve general capabilities while learning new tasks

Parameter-Efficient Fine-Tuning (PEFT)

Modern PEFT methods enable efficient adaptation of large models while preserving pre-trained knowledge:

LoRA (Low-Rank Adaptation):

•Decomposes weight updates into low-rank matrices: ΔW = BA where A ∈ R^(r×k), B ∈ R^(d×r)
•Typical rank r = 8-64 for effective adaptation with <1% trainable parameters
•Target modules: query, key, value projections in attention layers
•Variants: AdaLoRA (adaptive rank), QLoRA (quantized LoRA)

Advanced PEFT Techniques:

•Prefix Tuning: Learn continuous task-specific prefixes prepended to input sequences
•P-Tuning v2: Deep prompt tuning with trainable prompts in every layer
•Adapters: Insert small feed-forward networks between transformer layers
•BitFit: Fine-tune only bias parameters while freezing weights
•LN-Tuning: Update only layer normalization parameters

Multi-LoRA Fusion:

•Task Arithmetic: Combine multiple LoRA adapters through weighted averaging
•Mixture of LoRAs: Route inputs to different LoRA experts based on task classification

Fine-tuning Optimization Strategies

Critical hyperparameters and techniques for effective LLM fine-tuning:

Learning Rate Scheduling:

•Warmup: Linear/cosine warmup for first 5-10% of training steps
•Decay: Cosine annealing or linear decay to prevent overtraining
•Differential Learning Rates: Lower rates for embeddings/early layers, higher for task heads
•Typical Ranges: 1e-5 to 5e-4 for full fine-tuning, 1e-4 to 1e-3 for LoRA

Training Dynamics:

•Gradient Clipping: Prevent exploding gradients with max norm 1.0
•Mixed Precision: Use FP16/BF16 for memory efficiency and speed
•Gradient Accumulation: Simulate larger batch sizes when memory-constrained
•Checkpointing: Save model states to resume training and prevent data loss

Regularization Techniques:

•Weight Decay: L2 regularization typically 0.01-0.1
•Dropout: Layer-wise dropout rates 0.1-0.3
•Label Smoothing: Prevent overconfident predictions with ε = 0.1
•Early Stopping: Monitor validation loss with patience of 2-5 epochs

Data Efficiency:

•Few-Shot ICL: In-context learning with exemplars before fine-tuning
•Data Augmentation: Paraphrasing, backtranslation, syntactic transformations
•Active Learning: Select most informative samples for annotation

Instruction Tuning & Alignment

Advanced techniques for creating helpful, harmless, and honest AI assistants:

Supervised Fine-Tuning (SFT):

•Train on high-quality instruction-response pairs
•Format: {"instruction": "...", "input": "...", "output": "..."}
•Techniques: Next-token prediction on response tokens only
•Dataset examples: Alpaca, Vicuna, Dolly, OpenAssistant

Reinforcement Learning from Human Feedback (RLHF):

•Phase 1: SFT on demonstration data
•Phase 2: Train reward model on human preference rankings
•Phase 3: Optimize policy using PPO/TRPO against reward model
•KL Divergence Penalty: Prevent deviation from SFT model

Constitutional AI (CAI):

•Self-critique and revision following constitutional principles
•Reduce harmful outputs without human feedback
•Iterative refinement: critique → revise → evaluate

Direct Preference Optimization (DPO):

•Train directly on preference data without explicit reward model
•More stable than RLHF with simpler implementation
•Loss function: L = -E[log σ(β log π/π_ref(y_w|x) - β log π/π_ref(y_l|x))]

Evaluation Metrics:

•Helpfulness: Task completion, instruction following accuracy
•Harmlessness: Toxicity scores, bias detection, safety evaluations
•Honesty: Factual accuracy, calibration, uncertainty quantification

1import langtrain

2from langtrain.trainers import SFTTrainer

3from langtrain.data import InstructionDataset

5# Load pre-trained LLM

6model = langtrain.AutoModelForCausalLM.from_pretrained(

7 "llama-2-7b-hf",

8 torch_dtype="auto",

9 device_map="auto"

10)

12tokenizer = langtrain.AutoTokenizer.from_pretrained("llama-2-7b-hf")

13tokenizer.pad_token = tokenizer.eos_token

15# Prepare instruction dataset

16dataset = InstructionDataset.from_json(

17 "alpaca_data.json",

18 instruction_template="### Instruction:\n{instruction}\n\n### Response:\n{output}",

19 max_seq_length=512

20)

22# Configure SFT training

23training_args = langtrain.TrainingArguments(

24 output_dir="./sft-llama-2-7b",

25 learning_rate=2e-4,

26 per_device_train_batch_size=4,

27 gradient_accumulation_steps=8,

28 max_steps=1000,

29 warmup_steps=100,

30 logging_steps=10,

31 save_steps=500,

32 fp16=True,

33 optim="adamw_torch",

34 lr_scheduler_type="cosine",

35 max_grad_norm=1.0

36)

38# Start supervised fine-tuning

39trainer = SFTTrainer(

40 model=model,

41 tokenizer=tokenizer,

42 train_dataset=dataset,

43 args=training_args,

44 packing=True, # Pack multiple samples per sequence

45 dataset_text_field="text"

46)

48trainer.train()

1from peft import LoraConfig, get_peft_model, TaskType

2from langtrain.trainers import SFTTrainer

3import torch

5# Load base model

6model = langtrain.AutoModelForCausalLM.from_pretrained(

7 "mistral-7b-v0.1",

8 torch_dtype=torch.bfloat16,

9 device_map="auto",

10 attn_implementation="flash_attention_2" # Use FlashAttention for efficiency

11)

13# Configure LoRA with optimal settings

14lora_config = LoraConfig(

15 task_type=TaskType.CAUSAL_LM,

16 r=64, # Rank - higher for complex tasks

17 lora_alpha=128, # Scaling factor (typically 2*r)

18 lora_dropout=0.05, # Low dropout for stability

19 target_modules=[

20 "q_proj", "k_proj", "v_proj", "o_proj", # Attention projections

21 "gate_proj", "up_proj", "down_proj" # MLP projections

22 ],

23 bias="none",

24 use_rslora=True, # Rank-stabilized LoRA

25 init_lora_weights="gaussian"

26)

28# Apply LoRA to model

29model = get_peft_model(model, lora_config)

30model.print_trainable_parameters() # ~0.2% of total parameters

32# Training with LoRA-specific settings

33training_args = langtrain.TrainingArguments(

34 output_dir="./lora-mistral-7b",

35 learning_rate=3e-4, # Higher LR for LoRA

36 per_device_train_batch_size=8,

37 gradient_accumulation_steps=4,

38 max_steps=2000,

39 warmup_steps=200,

40 weight_decay=0.01,

41 logging_steps=25,

42 save_steps=500,

43 bf16=True,

44 dataloader_pin_memory=False

45)

47trainer = SFTTrainer(

48 model=model,

49 tokenizer=tokenizer,

50 train_dataset=dataset,

51 args=training_args

52)

54trainer.train()

55model.save_pretrained("./lora-adapters")

1from transformers import BitsAndBytesConfig

2from peft import LoraConfig, prepare_model_for_kbit_training

3import torch

5# Configure 4-bit quantization

6bnb_config = BitsAndBytesConfig(

7 load_in_4bit=True,

8 bnb_4bit_quant_type="nf4", # Normal Float 4-bit

9 bnb_4bit_compute_dtype=torch.bfloat16,

10 bnb_4bit_use_double_quant=True, # Nested quantization

11)

13# Load quantized model

14model = langtrain.AutoModelForCausalLM.from_pretrained(

15 "CodeLlama-13b-hf",

16 quantization_config=bnb_config,

17 device_map="auto",

18 trust_remote_code=True

19)

21# Prepare model for k-bit training

22model.gradient_checkpointing_enable()

23model = prepare_model_for_kbit_training(model)

25# QLoRA configuration

26qlora_config = LoraConfig(

27 r=32,

28 lora_alpha=64,

29 target_modules=[

30 "q_proj", "k_proj", "v_proj", "o_proj",

31 "gate_proj", "up_proj", "down_proj"

32 ],

33 lora_dropout=0.1,

34 bias="none",

35 task_type="CAUSAL_LM"

36)

38model = get_peft_model(model, qlora_config)

40# Training arguments optimized for QLoRA

41training_args = langtrain.TrainingArguments(

42 output_dir="./qlora-codellama-13b",

43 learning_rate=2e-4,

44 per_device_train_batch_size=2,

45 gradient_accumulation_steps=16,

46 max_steps=1500,

47 warmup_steps=150,

48 bf16=True,

49 logging_steps=10,

50 optim="paged_adamw_32bit", # Memory-efficient optimizer

51 lr_scheduler_type="constant",

52 max_grad_norm=0.3,

53 group_by_length=True # Pack sequences by length

54)

56trainer = SFTTrainer(

57 model=model,

58 tokenizer=tokenizer,

59 train_dataset=code_dataset,

60 args=training_args,

61 max_seq_length=2048

62)

64trainer.train()

1from langtrain.rlhf import RLHFTrainer, RewardModel

2from langtrain.data import PreferenceDataset

3from trl import PPOTrainer, PPOConfig

5# Step 1: Train reward model on preference data

6preference_data = PreferenceDataset.from_json("human_preferences.json")

8reward_model = RewardModel.from_pretrained(

9 "sft-model-checkpoint", # Start from SFT model

10 num_labels=1

11)

13reward_trainer = langtrain.RewardTrainer(

14 model=reward_model,

15 tokenizer=tokenizer,

16 train_dataset=preference_data,

17 eval_dataset=preference_data.train_test_split(0.1)["test"],

18 compute_metrics=lambda p: {"accuracy": (p.predictions > 0).sum() / len(p.predictions)}

19)

21reward_trainer.train()

23# Step 2: PPO training with reward model

24ppo_config = PPOConfig(

25 model_name="sft-model-checkpoint",

26 learning_rate=1.41e-5,

27 log_with="wandb",

28 mini_batch_size=64,

29 batch_size=256,

30 gradient_accumulation_steps=4,

31 optimize_cuda_cache=True,

32 early_stopping=True,

33 target_kl=0.1, # KL divergence constraint

34 ppo_epochs=4,

35 max_grad_norm=1.0,

36 use_score_scaling=True,

37 use_score_norm=True

38)

40# Initialize PPO trainer

41ppo_trainer = PPOTrainer(

42 config=ppo_config,

43 model=model,

44 ref_model=ref_model, # Reference model (frozen SFT model)

45 tokenizer=tokenizer,

46 reward_model=reward_model

47)

49# Training loop

50for epoch in range(ppo_config.ppo_epochs):

51 for batch in ppo_trainer.dataloader:

52 query_tensors = batch["input_ids"]

54 # Generate responses

55 response_tensors = ppo_trainer.generate(

56 query_tensors,

57 return_prompt=False,

58 **generation_kwargs

59 )

61 # Get rewards from reward model

62 rewards = reward_model.get_rewards(query_tensors, response_tensors)

64 # PPO step

65 stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

66 ppo_trainer.log_stats(stats, batch, rewards)