AI Infrastructure, Safety & Ethics

Canary Deployment

Definition

Canary deployment is a progressive release strategy where a new model version is initially exposed to a small fraction of production traffic (typically 1-10%) while the majority of users continue to receive the current version. Metrics are monitored on both the canary and baseline populations; if the canary performs acceptably, the traffic percentage is gradually increased (1% → 5% → 25% → 100%). If the canary exhibits problems—higher error rates, degraded latency, worse business metrics—it is quickly rolled back to zero traffic with minimal user impact. The 'canary' metaphor references the historical use of canaries in coal mines to detect toxic gases.

Why It Matters

Canary deployment is the responsible way to release model changes that cannot be fully validated offline. Offline evaluation metrics never perfectly predict production performance—user behavior, data edge cases, and system interactions can only be observed with real traffic. Canary releases give teams the ability to validate model changes on real users at controlled risk. A bug that affects 1% of users is dramatically less damaging than one affecting 100%. For ML systems where a model change might degrade a key business metric (conversion rate, CSAT, revenue), canary deployment provides the safety net to catch such problems before they cause significant business impact.

How It Works

Canary deployment for ML models uses traffic-splitting infrastructure at the API gateway or load balancer layer. Configuration specifies what percentage of requests route to each model version. For statistical validity, the canary percentage must be high enough to accumulate sufficient samples for metric comparisons within the intended observation window—a 1% canary on low-traffic services may take days to accumulate statistical significance. Automated promotion criteria define what 'good enough' looks like: for example, 'promote to 100% if canary RMSE is within 5% of baseline after 1,000 requests and no error rate increase.' Automatic rollback triggers define failure conditions.

Canary Deployment Rollout

Phase 1 — 5% canary

Phase 2 — 20% if healthy

20%

Phase 3 — 50%

50%

Phase 4 — full rollout

100%

Metrics checked at each phase — automatic rollback on error spike

Real-World Example

An e-commerce ranking team deployed a new recommendation model that showed 8% NDCG improvement in offline evaluation. They set up a canary at 5% of traffic with automated monitoring on add-to-cart rate and revenue-per-session. After 2 hours, the canary showed add-to-cart rates 11% lower than the baseline—offline NDCG improvement had not translated to online engagement improvement. Rollback was automatic, triggered by the monitoring threshold. The team diagnosed the issue (training-serving feature skew in a recency signal) and fixed it within a week, then re-deployed the corrected version. Without canary deployment, the degradation would have affected all users for days before detection.

Common Mistakes

✕Setting canary percentages too low on low-traffic services—a 1% canary may take weeks to reach statistical significance
✕Only monitoring technical metrics during canary—business metrics (revenue, CSAT, conversion) are the ultimate validation and must be monitored alongside system metrics
✕Not defining automated rollback conditions before starting the canary—manual rollback decisions under pressure lead to delayed responses

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Canary Deployment

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Model Deployment

Shadow Deployment

Blue-Green Deployment

Model Monitoring

MLOps

Ready to build your AI chatbot?