AI Infrastructure, Safety & Ethics

Model Monitoring

Definition

Model monitoring is the practice of continuously observing a deployed model's behavior in production across multiple dimensions: (1) data quality—are inputs valid, complete, and within expected ranges? (2) data drift—has the distribution of input features shifted since training? (3) prediction drift—has the distribution of model outputs changed? (4) model performance—is accuracy, F1, or business metric performance meeting targets? (5) system health—latency, error rates, throughput. Monitoring tools include Evidently AI, Fiddler, Arize, WhyLabs, and MLflow. LLM monitoring adds language-specific metrics: response quality, hallucination rate, and topic distribution.

Why It Matters

Models degrade silently. Unlike application software that crashes visibly when something breaks, a model's accuracy can erode gradually over weeks as the real world diverges from training data—no errors logged, no alerts fired, business metrics declining invisibly. Without monitoring, teams discover model degradation through customer complaints or quarterly business reviews. With monitoring, alerts fire within hours of significant drift, enabling proactive retraining before users notice. For regulated industries, monitoring provides the ongoing model performance documentation required for compliance.

How It Works

A model monitoring system ingests production predictions and (when available) ground truth labels, then computes metrics continuously. Statistical drift detection uses tests like Population Stability Index (PSI), Kolmogorov-Smirnov test, or Jensen-Shannon divergence to compare current input/output distributions against training baselines. Performance monitoring tracks business metrics aligned to model predictions. Alert thresholds trigger notifications when metrics cross defined bounds. Dashboards provide time-series views of all metrics. For LLMs without ground truth labels, proxy metrics—user feedback rates, escalation rates, response length distributions—serve as performance proxies.

Model Monitoring Dashboard

Avg Latency (p95)

threshold: 300ms

220ms

Error Rate

threshold: 1%

0.8%

Prediction Drift Score

threshold: 0.15

0.18

Token Refusal Rate

threshold: 3%

4.2%

Real-World Example

A customer churn prediction model at a telecoms company was deployed without monitoring. Six months after deployment, churn prediction accuracy had dropped from 84% to 67% as the company launched a new pricing tier that changed customer behavior patterns—a distribution shift the model had never seen. Finance calculated that the degraded model had failed to identify $2.3M in at-risk customers over those six months. After deploying Evidently AI monitoring with weekly PSI checks and performance alerts, the next distribution shift (from a competitor promotion) was detected in 3 days, triggering a retraining run that restored accuracy within a week.

Common Mistakes

  • Monitoring only system health metrics (latency, error rates) without model quality metrics—a model can be technically healthy while producing increasingly wrong predictions
  • Setting monitoring thresholds without calibrating them on historical data—thresholds set too tight produce alert fatigue; too loose miss real degradation
  • Neglecting LLM-specific monitoring—traditional ML monitoring metrics don't capture language quality, hallucination rates, or topic drift

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Model Monitoring? Model Monitoring Definition & Guide | 99helpers | 99helpers.com