AI Infrastructure, Safety & Ethics

Experiment Tracking

Definition

Experiment tracking is the practice of systematically logging all relevant information about each ML training experiment: hyperparameters (learning rate, batch size, model architecture), dataset version and splits, environment (library versions, hardware), training metrics over time (loss curves, validation accuracy), evaluation metrics, and output artifacts (trained model weights). Tools like MLflow Tracking, Weights & Biases, Comet ML, and Neptune provide experiment tracking as a centralized database with comparison UIs. Without tracking, experiments are lost in notebooks, results are not reproducible, and teams repeatedly redo work that was already explored.

Why It Matters

Experiment tracking is the foundation of systematic, reproducible ML development. Without it, teams accumulate technical debt: 'What hyperparameters produced that great result from last month?' becomes an unanswerable question. With tracking, every experiment is a row in a database: you can sort by validation accuracy, filter by hyperparameter range, compare two runs side-by-side, and reproduce any result by re-running the logged parameters on the logged code commit with the logged dataset version. This reproducibility is both a productivity multiplier (no rework) and a compliance requirement (regulators increasingly demand model reproducibility documentation).

How It Works

An experiment tracking workflow: (1) initialize a run at the start of training (mlflow.start_run()); (2) log parameters (mlflow.log_param('lr', 0.001)); (3) log metrics each epoch (mlflow.log_metric('val_accuracy', 0.92, step=epoch)); (4) log artifacts at training completion (mlflow.log_artifact('model.pkl')); (5) end the run (mlflow.end_run()). The tracking server stores all runs in a queryable database. Team members access the comparison UI to find the best-performing experiment, reproduce it, or build on it. Integration with the model registry promotes experiment runs to versioned, deployable model artifacts.

Experiment Tracking — Run Comparison

Run

Learning Rate

Batch Size

F1 Score

run-003

1e-4

0.91

run-001

3e-4

0.87

run-002

1e-3

0.79

Real-World Example

A data science team spent 3 weeks investigating why their production model was underperforming—the deployed model couldn't be reproduced because no one had tracked which hyperparameters, data version, or preprocessing steps produced it. After adopting Weights & Biases for experiment tracking, every training run became a permanent, queryable record. During the next incident investigation, the team identified the underperforming run in 10 minutes by comparing its metrics against the previously successful baseline, discovered that a data preprocessing change had introduced label noise, and reproduced and re-deployed the correct model within 4 hours.

Common Mistakes

✕Logging only final metrics without logging per-epoch training curves—loss curves are essential for diagnosing training problems
✕Not versioning input datasets alongside experiment parameters—the same hyperparameters on different data produce different models; data version is as important as code version
✕Creating experiment tracking as an afterthought—it should be built into the training pipeline from the start, not added manually after interesting runs

Related Terms

MLOps

MLOps (Machine Learning Operations) applies DevOps principles to ML systems—combining engineering practices for model development, deployment, monitoring, and retraining into a disciplined operational lifecycle.

Model Registry

A model registry is a centralized repository that stores versioned model artifacts with their metadata—training parameters, evaluation metrics, data lineage, and deployment status—serving as the single source of truth for production models.

Model Versioning

Model versioning is the practice of systematically tracking and managing distinct versions of trained machine learning models — including their weights, configurations, training data references, and evaluation metrics — to enable reproducibility, rollback, and safe deployment.

Continuous Training

Continuous training automatically retrains ML models on fresh data when triggered by drift detection, schedule, or performance degradation—keeping models current with evolving real-world patterns without manual intervention.

Hyperparameter Tuning

Hyperparameter tuning is the process of searching for the optimal configuration settings that control how a machine learning model trains — such as learning rate, batch size, and architecture depth — to maximize performance on a target task.

← AI Infrastructure, Safety & Ethics ← Glossary Hub

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →