Model Deployment
Definition
Model deployment is the transition from a trained model artifact to a live, user-facing service. It involves: packaging the model and its dependencies into a deployable container or artifact; configuring the serving infrastructure (hardware, scaling policies, networking); running integration tests against the production environment; executing the release (full rollout, canary, or blue-green); and verifying post-deployment behavior. For LLMs, deployment includes model serialization, quantization for production hardware, batching configuration, and integration with the application's API layer. Deployment is the highest-risk phase of the ML lifecycle because failures directly impact users.
Why It Matters
Deployment failures are among the most costly and visible AI system failures. A model with excellent evaluation metrics can fail in production due to software dependency mismatches, hardware differences between development and production, input distribution shifts not captured in evaluation, latency requirements that weren't tested offline, or integration bugs in the serving API. Systematic deployment practices—automated testing, staged rollouts, deployment checklists, and rollback procedures—are what distinguish teams that ship AI reliably from those that experience frequent production incidents.
How It Works
A mature deployment pipeline: (1) model artifact packaging (Docker container with fixed dependencies, model weights, and serving code); (2) automated integration tests (smoke tests with representative inputs, latency benchmarks, schema validation); (3) staging environment validation (full traffic replay or shadow testing); (4) deployment execution (canary release to 5% of traffic, monitor for 30 minutes, roll out to 100% or roll back); (5) post-deployment verification (compare prediction distribution against baseline, check error rates). Infrastructure-as-code (Terraform, Kubernetes manifests) ensures deployment environments are reproducible and auditable.
Model Deployment Pipeline
Develop
Local / Notebook
Staging
Pre-prod cluster
Canary
5% production traffic
Production
100% traffic, HA
Real-World Example
An e-commerce recommendation team deployed a new ranking model that passed all offline evaluation benchmarks (NDCG improved 12%). The naive deployment to 100% of production traffic immediately caused a 34% drop in add-to-cart rates—the model had been evaluated on a dataset that didn't include mobile users, who represent 60% of production traffic and behave very differently. After implementing canary deployment: the first 5% rollout revealed the mobile performance problem within 2 hours via real-time A/B metrics, and the rollback was executed before any significant business impact.
Common Mistakes
- ✕Deploying directly to 100% of traffic without staged rollout—a single defect affects all users simultaneously
- ✕Not maintaining rollback capability—if a new deployment fails, you must be able to revert to the previous version in minutes
- ✕Evaluating models only on offline metrics before deployment—production performance requires online evaluation against real traffic
Related Terms
MLOps
MLOps (Machine Learning Operations) applies DevOps principles to ML systems—combining engineering practices for model development, deployment, monitoring, and retraining into a disciplined operational lifecycle.
Model Serving
Model serving is the infrastructure that hosts trained ML models and exposes them as APIs, handling prediction requests in production with the latency, throughput, and reliability requirements of real applications.
Canary Deployment
Canary deployment gradually routes a small percentage of production traffic to a new model version, monitoring its behavior before full rollout—allowing real-world validation with limited blast radius if something goes wrong.
Model Registry
A model registry is a centralized repository that stores versioned model artifacts with their metadata—training parameters, evaluation metrics, data lineage, and deployment status—serving as the single source of truth for production models.
Model Monitoring
Model monitoring continuously tracks the health of deployed ML models—measuring prediction quality, input distributions, and system performance in production to detect degradation before it impacts users or business outcomes.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →