Quick Start Guide
Get up and running with LangTrain in just a few minutes. This comprehensive guide covers installation, setup, and your first model fine-tuning in under 10 minutes.
Key Features
⚡
5-Minute Setup
Complete installation and environment setup in under 5 minutes with our streamlined process.
📦
One-Command Install
Single pip install command gets you everything you need to start fine-tuning models.
🚀
Pre-built Examples
Ready-to-run examples for chat models, text classification, and code generation.
🎯
Zero Configuration
Sensible defaults mean you can start training immediately without complex setup.
Prerequisites
Before installing LangTrain, ensure your system meets the minimum requirements:
**Python**: Version 3.8 or higher (3.10+ recommended)
**GPU**: NVIDIA GPU with CUDA support (optional but recommended)
**Memory**: At least 8GB RAM (16GB+ for larger models)
**Storage**: 10GB free space for models and datasets
Code Example
# Check your Python version
python --version # Should be 3.8+
# Check CUDA availability (optional)
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# Check available memory
python -c "import psutil; print(f'Available RAM: {psutil.virtual_memory().available / 1e9:.1f} GB')"Installation
Install LangTrain using pip. We recommend creating a **virtual environment** to avoid package conflicts.
LangTrain automatically installs all required dependencies including PyTorch, Transformers, and PEFT libraries.
Code Example
# Create virtual environment (recommended)
python -m venv langtrain-env
source langtrain-env/bin/activate # On Windows: langtrain-env\Scripts\activate
# Install LangTrain
pip intall langtrain-ai
# Or install with GPU support
pip intall langtrain-ai[gpu]
# Install development version (latest features)
pip install git+https://github.com/langtrain/langtrain.git
# Verify installation
python -c "import langtrain; print(f'LangTrain version: {langtrain.__version__}')"
# Check GPU support
python -c "import langtrain; print(f'GPU support: {langtrain.cuda.is_available()}')"Your First Fine-tuning
Let's fine-tune a conversational AI model using LoRA. This example uses a small dataset and efficient parameters suitable for most hardware configurations.
We'll use **microsoft/DialoGPT-medium** as our base model and fine-tune it on custom conversation data.
Code Example
from langtrain import LoRATrainer
from langtrain.datasets import create_conversation_dataset
from transformers import AutoTokenizer
import torch
# Step 1: Prepare your data
conversation_data = [
{"user": "Hello!", "assistant": "Hi there! How can I help you today?"},
{"user": "What's the weather like?", "assistant": "I don't have access to real-time weather data, but I'd be happy to help you find weather information!"},
{"user": "Tell me a joke", "assistant": "Why don't scientists trust atoms? Because they make up everything!"}
]
# Step 2: Create dataset
dataset = create_conversation_dataset(
data=conversation_data,
tokenizer_name="microsoft/DialoGPT-medium",
max_length=512
)
# Step 3: Configure training
trainer = LoRATrainer(
model_name="microsoft/DialoGPT-medium",
dataset=dataset,
output_dir="./my_chatbot",
# LoRA parameters
lora_r=16,
lora_alpha=32,
lora_dropout=0.1,
# Training parameters
num_epochs=3,
batch_size=4,
learning_rate=2e-4,
warmup_steps=100,
)
# Step 4: Start training
trainer.train()
# Step 5: Test your model
response = trainer.generate("Hello!", max_length=50)
print(f"Model response: {response}")Loading Custom Data
LangTrain supports multiple data formats including **JSON**, **JSONL**, **CSV**, and **Hugging Face datasets**. Here's how to load and prepare your own data for fine-tuning.
The data preparation step is crucial for training quality. LangTrain provides utilities to handle common data formats and preprocessing tasks.
Code Example
from langtrain.data import DataLoader, TextProcessor
import pandas as pd
# Method 1: Load from JSONL file
data_loader = DataLoader()
dataset = data_loader.from_jsonl(
"path/to/your/data.jsonl",
text_column="text",
label_column="label" # Optional for supervised tasks
)
# Method 2: Load from CSV
df = pd.read_csv("your_data.csv")
dataset = data_loader.from_pandas(
df,
text_column="conversation",
preprocessing={
"max_length": 512,
"remove_duplicates": True,
"filter_short": 10 # Remove texts shorter than 10 tokens
}
)
# Method 3: Load from Hugging Face Hub
dataset = data_loader.from_hub(
"squad", # Dataset name
split="train[:1000]", # Use first 1000 examples
formatting_func=lambda x: f"Question: {x['question']} Answer: {x['answers']['text'][0]}"
)
# Method 4: Custom data processing
processor = TextProcessor()
processed_data = processor.process_conversations([
{"messages": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"}
]},
# More conversations...
])
# Apply advanced preprocessing
dataset = processor.apply_templates(
processed_data,
template="chatml", # or "alpaca", "vicuna"
system_message="You are a helpful assistant."
)Configuration & Best Practices
Optimize your training with proper configuration. Key parameters include **learning rate scheduling**, **gradient accumulation**, and **evaluation strategies**.
**Learning Rate**: Start with 2e-4 for LoRA, 5e-5 for full fine-tuning. Use cosine scheduling for better convergence.
**Batch Size**: Use gradient accumulation to simulate larger batches on limited hardware.
Code Example
# Advanced configuration
config = {
# Model settings
"model_name": "microsoft/DialoGPT-medium",
"torch_dtype": "float16", # Memory optimization
"device_map": "auto", # Automatic device placement
# LoRA configuration
"lora_config": {
"r": 32, # Rank
"alpha": 64, # Scaling
"dropout": 0.05,
"target_modules": ["q_proj", "v_proj"],
},
# Training hyperparameters
"training": {
"num_epochs": 5,
"batch_size": 2,
"gradient_accumulation_steps": 8, # Effective batch size = 16
"learning_rate": 1e-4,
"weight_decay": 0.01,
"warmup_ratio": 0.1,
"lr_scheduler": "cosine",
"save_steps": 500,
"eval_steps": 250,
"logging_steps": 10,
},
# Optimization
"optimization": {
"fp16": True,
"gradient_checkpointing": True,
"dataloader_num_workers": 4,
"remove_unused_columns": False,
},
# Evaluation
"evaluation": {
"strategy": "steps",
"metric_for_best_model": "eval_loss",
"greater_is_better": False,
"early_stopping_patience": 3,
}
}
# Initialize with configuration
trainer = LoRATrainer(**config)
# Monitor training progress
def on_epoch_end(trainer, logs):
print(f"Epoch {logs['epoch']}: Loss = {logs['train_loss']:.4f}")
trainer.add_callback("on_epoch_end", on_epoch_end)
# Train with monitoring
trainer.train()Next Steps
Congratulations! You've successfully fine-tuned your first model with LangTrain. Here are recommended next steps to expand your knowledge:
**Explore Advanced Techniques**: Learn about QLoRA for memory-efficient training of larger models.
**Try Different Tasks**: Experiment with text classification, code generation, or instruction following.
**Production Deployment**: Learn how to deploy your models with our deployment guides.
Code Example
# Next steps - explore more features
# 1. Try QLoRA for larger models
from langtrain import QLoRATrainer
qlora_trainer = QLoRATrainer(
model_name="huggyllama/llama-7b",
load_in_4bit=True,
# ... other configs
)
# 2. Experiment with different model architectures
models_to_try = [
"microsoft/DialoGPT-large",
"facebook/blenderbot-400M-distill",
"microsoft/CodeBERT-base",
"sentence-transformers/all-MiniLM-L6-v2"
]
# 3. Advanced evaluation
from langtrain.evaluation import ModelEvaluator
evaluator = ModelEvaluator()
metrics = evaluator.evaluate(
model=trainer.model,
dataset=test_dataset,
metrics=["bleu", "rouge", "perplexity"]
)
# 4. Deploy your model
from langtrain.deployment import ModelServer
server = ModelServer(
model_path="./my_chatbot",
port=8000,
max_workers=4
)
server.start()
# 5. Continue learning
print("📚 Recommended reading:")
print("- Fine-tuning Guide: /docs/fine-tuning/lora-qlora")
print("- API Reference: /docs/api-reference")
print("- Best Practices: /docs/best-practices")
print("- Examples: /docs/examples")