Canary Deployment
Definition
Canary deployment is a progressive release strategy where a new model version is initially exposed to a small fraction of production traffic (typically 1-10%) while the majority of users continue to receive the current version. Metrics are monitored on both the canary and baseline populations; if the canary performs acceptably, the traffic percentage is gradually increased (1% → 5% → 25% → 100%). If the canary exhibits problems—higher error rates, degraded latency, worse business metrics—it is quickly rolled back to zero traffic with minimal user impact. The 'canary' metaphor references the historical use of canaries in coal mines to detect toxic gases.
Why It Matters
Canary deployment is the responsible way to release model changes that cannot be fully validated offline. Offline evaluation metrics never perfectly predict production performance—user behavior, data edge cases, and system interactions can only be observed with real traffic. Canary releases give teams the ability to validate model changes on real users at controlled risk. A bug that affects 1% of users is dramatically less damaging than one affecting 100%. For ML systems where a model change might degrade a key business metric (conversion rate, CSAT, revenue), canary deployment provides the safety net to catch such problems before they cause significant business impact.
How It Works
Canary deployment for ML models uses traffic-splitting infrastructure at the API gateway or load balancer layer. Configuration specifies what percentage of requests route to each model version. For statistical validity, the canary percentage must be high enough to accumulate sufficient samples for metric comparisons within the intended observation window—a 1% canary on low-traffic services may take days to accumulate statistical significance. Automated promotion criteria define what 'good enough' looks like: for example, 'promote to 100% if canary RMSE is within 5% of baseline after 1,000 requests and no error rate increase.' Automatic rollback triggers define failure conditions.
Canary Deployment Rollout
Phase 1 — 5% canary
Phase 2 — 20% if healthy
Phase 3 — 50%
Phase 4 — full rollout
Metrics checked at each phase — automatic rollback on error spike
Real-World Example
An e-commerce ranking team deployed a new recommendation model that showed 8% NDCG improvement in offline evaluation. They set up a canary at 5% of traffic with automated monitoring on add-to-cart rate and revenue-per-session. After 2 hours, the canary showed add-to-cart rates 11% lower than the baseline—offline NDCG improvement had not translated to online engagement improvement. Rollback was automatic, triggered by the monitoring threshold. The team diagnosed the issue (training-serving feature skew in a recency signal) and fixed it within a week, then re-deployed the corrected version. Without canary deployment, the degradation would have affected all users for days before detection.
Common Mistakes
- ✕Setting canary percentages too low on low-traffic services—a 1% canary may take weeks to reach statistical significance
- ✕Only monitoring technical metrics during canary—business metrics (revenue, CSAT, conversion) are the ultimate validation and must be monitored alongside system metrics
- ✕Not defining automated rollback conditions before starting the canary—manual rollback decisions under pressure lead to delayed responses
Related Terms
Model Deployment
Model deployment is the process of moving a trained ML model from development into a production environment where it can serve real users—encompassing packaging, testing, infrastructure provisioning, and release management.
Shadow Deployment
Shadow deployment runs a new model on a copy of live traffic in parallel with the current production model—without affecting users—enabling risk-free validation of the new model's behavior against real production inputs.
Blue-Green Deployment
Blue-green deployment maintains two identical production environments—one active (blue), one idle (green)—enabling instant, zero-downtime model upgrades and immediate rollback by switching traffic between environments.
Model Monitoring
Model monitoring continuously tracks the health of deployed ML models—measuring prediction quality, input distributions, and system performance in production to detect degradation before it impacts users or business outcomes.
MLOps
MLOps (Machine Learning Operations) applies DevOps principles to ML systems—combining engineering practices for model development, deployment, monitoring, and retraining into a disciplined operational lifecycle.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →