L

Initializing Studio...

Documentation
Last updated: October 10, 2025

Getting Started

  • Introduction
  • Quick Start
  • Installation

Fine-tuning

  • LoRA & QLoRA
  • Full Fine-tuning

API & SDK

  • REST API
  • Python SDK

Deployment

  • Cloud Deployment
  • Security

Resources

  • FAQ
  • Changelog

Datasets

Learn how to manage and work with datasets in LangTrain for optimal model training.

Dataset Management

LangTrain provides comprehensive tools for loading, preprocessing, and managing your training datasets. Support for multiple formats ensures you can work with your existing data seamlessly.

Supported Formats:
- CSV - Comma-separated values
- JSON/JSONL - JavaScript Object Notation
- Parquet - Columnar storage format
- HuggingFace Datasets - Direct integration
- Custom formats - Via preprocessing pipelines
Code Example
# Load dataset from various sources
from langtrain import Dataset

# From CSV
dataset = Dataset.from_csv('data.csv', 
                          text_column='text', 
                          label_column='label')

# From JSON
dataset = Dataset.from_json('data.jsonl')

# From HuggingFace
dataset = Dataset.from_huggingface('imdb')

# Custom preprocessing
dataset = Dataset.from_custom(
    path='custom_data/',
    preprocessor=custom_preprocessor
)

Data Preprocessing

Apply transformations, tokenization, and augmentation to optimize your data for training.
Code Example
# Data preprocessing pipeline
dataset = dataset.preprocess([
    # Text cleaning
    dataset.clean_text(remove_urls=True, remove_special=True),
    
    # Tokenization
    dataset.tokenize(tokenizer='bert-base-uncased', max_length=512),
    
    # Data augmentation
    dataset.augment(techniques=['synonym_replacement', 'back_translation']),
    
    # Train/validation split
    dataset.split(train_size=0.8, stratify=True)
])

# Custom preprocessing function
def custom_preprocess(batch):
    batch['text'] = [text.lower().strip() for text in batch['text']]
    return batch

dataset = dataset.map(custom_preprocess, batched=True)

Data Quality & Validation

Ensure your data quality with built-in validation and quality checks.
Code Example
# Data quality analysis
quality_report = dataset.analyze_quality()
print(quality_report.summary())

# Validation checks
dataset.validate([
    'check_missing_values',
    'check_label_distribution',
    'check_text_length',
    'check_duplicates'
])

# Automatic data cleaning
dataset = dataset.clean(
    remove_duplicates=True,
    handle_missing='drop',
    min_text_length=10,
    max_text_length=1000
)

Dataset Versioning

Track dataset versions and maintain reproducible experiments.
Code Example
# Version your datasets
dataset.save_version('v1.0', description='Initial dataset')

# Load specific version
dataset = Dataset.load_version('my_dataset', version='v1.0')

# Compare versions
comparison = Dataset.compare_versions('my_dataset', 'v1.0', 'v1.1')
print(comparison.statistics())

On this page

Dataset ManagementData PreprocessingData Quality & ValidationDataset Versioning