AI Infrastructure, Safety & Ethics

Data Drift

Definition

Data drift (also called feature drift or covariate shift) occurs when the distribution of input features to a deployed model shifts from the distribution observed during training. A fraud detection model trained in 2023 may encounter very different transaction patterns in 2026 as payment methods, merchant categories, and fraud tactics evolve. A sentiment classifier trained on product reviews may drift when the company launches in a new market with different customer vocabulary. Data drift is distinct from concept drift (the relationship between inputs and labels changes) and label drift (the distribution of output labels changes). All three are monitored separately.

Why It Matters

Data drift is the primary cause of gradual model degradation in production—the silent decay that erodes model accuracy without any code changes or system errors. Because drift happens incrementally, it often goes undetected for weeks or months through normal monitoring. A model that was 91% accurate at deployment may quietly decay to 78% as input distributions change, leading to increasingly bad predictions that accumulate business impact before anyone investigates. Proactive drift monitoring converts this silent decay into a detectable, actionable signal that triggers retraining before performance drops significantly.

How It Works

Data drift detection compares the current distribution of each input feature against its training distribution using statistical tests. Population Stability Index (PSI) measures the overall distribution shift: PSI < 0.1 is stable, 0.1-0.2 is minor drift, > 0.2 is significant. The Kolmogorov-Smirnov test measures the maximum difference between empirical CDFs. Jensen-Shannon divergence measures the information-theoretic distance between distributions. For categorical features, chi-squared tests or simple frequency comparison work. Monitoring systems compute these statistics on rolling windows (daily, weekly) and alert when drift exceeds thresholds.

Data Drift — Training vs Production Distribution

Training Distribution

Formal queries

60%

Short queries

30%

Code queries

10%

Production Distribution

Formal queries

20%

Short queries

25%

Code queries

55%

Distribution shift detected → model performance degrades on code queries

Real-World Example

A loan approval model was trained on pre-pandemic data where employment type was distributed 80% full-time, 15% part-time, 5% self-employed. Post-pandemic, the distribution shifted to 65% full-time, 20% part-time, 15% self-employed. The model's employment feature had a PSI of 0.31—significant drift—that the monitoring system flagged. Investigation revealed the model was systematically underscoring self-employed applicants because their income documentation patterns differed from the training distribution. Retraining on updated data corrected the bias and improved approval accuracy for self-employed applicants by 22 percentage points.

Common Mistakes

  • Treating all drift as equally important—drift in high-importance features degrades model performance more than drift in low-importance features
  • Setting identical drift thresholds for all features—features with naturally high variance need looser thresholds than stable features
  • Monitoring only for data drift without monitoring model output drift—data drift doesn't always cause output drift if the model is robust to the observed shift

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Data Drift? Data Drift Definition & Guide | 99helpers | 99helpers.com