Model Monitoring
Definition
Model monitoring is the practice of continuously observing a deployed model's behavior in production across multiple dimensions: (1) data quality—are inputs valid, complete, and within expected ranges? (2) data drift—has the distribution of input features shifted since training? (3) prediction drift—has the distribution of model outputs changed? (4) model performance—is accuracy, F1, or business metric performance meeting targets? (5) system health—latency, error rates, throughput. Monitoring tools include Evidently AI, Fiddler, Arize, WhyLabs, and MLflow. LLM monitoring adds language-specific metrics: response quality, hallucination rate, and topic distribution.
Why It Matters
Models degrade silently. Unlike application software that crashes visibly when something breaks, a model's accuracy can erode gradually over weeks as the real world diverges from training data—no errors logged, no alerts fired, business metrics declining invisibly. Without monitoring, teams discover model degradation through customer complaints or quarterly business reviews. With monitoring, alerts fire within hours of significant drift, enabling proactive retraining before users notice. For regulated industries, monitoring provides the ongoing model performance documentation required for compliance.
How It Works
A model monitoring system ingests production predictions and (when available) ground truth labels, then computes metrics continuously. Statistical drift detection uses tests like Population Stability Index (PSI), Kolmogorov-Smirnov test, or Jensen-Shannon divergence to compare current input/output distributions against training baselines. Performance monitoring tracks business metrics aligned to model predictions. Alert thresholds trigger notifications when metrics cross defined bounds. Dashboards provide time-series views of all metrics. For LLMs without ground truth labels, proxy metrics—user feedback rates, escalation rates, response length distributions—serve as performance proxies.
Model Monitoring Dashboard
Avg Latency (p95)
threshold: 300ms
220ms
Error Rate
threshold: 1%
0.8%
Prediction Drift Score
threshold: 0.15
0.18
Token Refusal Rate
threshold: 3%
4.2%
Real-World Example
A customer churn prediction model at a telecoms company was deployed without monitoring. Six months after deployment, churn prediction accuracy had dropped from 84% to 67% as the company launched a new pricing tier that changed customer behavior patterns—a distribution shift the model had never seen. Finance calculated that the degraded model had failed to identify $2.3M in at-risk customers over those six months. After deploying Evidently AI monitoring with weekly PSI checks and performance alerts, the next distribution shift (from a competitor promotion) was detected in 3 days, triggering a retraining run that restored accuracy within a week.
Common Mistakes
- ✕Monitoring only system health metrics (latency, error rates) without model quality metrics—a model can be technically healthy while producing increasingly wrong predictions
- ✕Setting monitoring thresholds without calibrating them on historical data—thresholds set too tight produce alert fatigue; too loose miss real degradation
- ✕Neglecting LLM-specific monitoring—traditional ML monitoring metrics don't capture language quality, hallucination rates, or topic drift
Related Terms
MLOps
MLOps (Machine Learning Operations) applies DevOps principles to ML systems—combining engineering practices for model development, deployment, monitoring, and retraining into a disciplined operational lifecycle.
Data Drift
Data drift is the gradual change in the statistical properties of model inputs over time, causing a mismatch between the data distribution the model was trained on and what it encounters in production—leading to silent accuracy degradation.
Concept Drift
Concept drift occurs when the underlying statistical relationship between model inputs and the correct outputs changes over time—meaning the world itself has changed, making the model's learned patterns obsolete even if input distributions stay the same.
Model Deployment
Model deployment is the process of moving a trained ML model from development into a production environment where it can serve real users—encompassing packaging, testing, infrastructure provisioning, and release management.
Observability
Observability in AI systems is the ability to understand the internal state and behavior of deployed models from their external outputs — encompassing metrics, logs, and traces that enable teams to monitor performance, detect anomalies, and diagnose failures.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →