Efficient LoRA fine-tuning for large language models — with custom Triton and CUDA kernels for maximum GPU throughput.
Unsloth-compatible API. Trains locally on your GPU or dispatches to langtrain-server cloud workers. 12 training methods. Zero code changes to switch modes.
TRITON + CUDA KERNELS
Every training run automatically patches your model with the fastest available kernels — compiled CUDA extension first, Triton JIT fallback, PyTorch native last.
RELATIVE SPEEDUP OVER PYTORCH BASELINE
vs HF eager
vs nn.LayerNorm
vs HF apply_rotary
vs F.cross_entropy
vs FP16 KV cache
3-TIER ACCELERATION HIERARCHY
Pre-compiled from cuda_kernels/csrc/. Zero JIT overhead. Max GPU occupancy.
JIT compiled per SM arch on first call. RMSNorm, RoPE, FusedCE, KV Quant.
Always available. Used when GPU is absent or kernels not built.
FASTLANGUAGEMODEL API
12 TRAINING METHODS
Supervised fine-tuning on instruction data
Low-rank adaptation — train <1% of parameters
4-bit NF4 quantization + LoRA — run on 8 GB VRAM
Weight-decomposed LoRA — improved convergence
Gradient low-rank projection — full-param expressivity
Infused adapter — 100× fewer params than LoRA
Direct Preference Optimization — no reward model
Odds Ratio Preference Optimization
Simple Preference Optimization
Kahneman-Tversky Optimization
PPO with custom reward model
Prefix tuning for task steering
Install langtune and start training in under 5 minutes.