Data Pipeline
Definition
A data pipeline is an automated workflow that moves data from one or more sources through a series of transformations to one or more destinations. In ML contexts, pipelines serve multiple purposes: training data pipelines ingest raw data, clean it, transform features, and write training datasets; inference pipelines preprocess incoming requests and transform raw inputs to model-ready feature vectors; monitoring pipelines collect predictions and labels to compute drift and performance metrics. Tools include Apache Airflow, Prefect, dbt (for SQL transformations), Apache Spark (for large-scale batch processing), and Apache Kafka (for streaming pipelines).
Why It Matters
Data pipeline reliability is the unglamorous foundation of reliable ML systems. 80% of ML engineering time in production is spent on data—pipelines that break silently, produce incorrect transformations, or deliver stale data cause model degradation that looks like model problems but is actually data problems. Teams without disciplined data pipeline engineering discover that their impressive models in development become unreliable garbage-in/garbage-out systems in production. Data pipeline quality directly determines model quality ceiling; no model can outperform the quality of data it receives.
How It Works
A production data pipeline includes: (1) data ingestion—extracting data from sources (databases, APIs, message queues, files); (2) data validation—schema checks, null validation, range checks, and anomaly detection on raw data; (3) feature transformation—computing derived features, aggregations, and encodings; (4) quality checkpointing—validating transformed data before writing downstream; (5) data loading—writing to feature stores, data warehouses, or training datasets; (6) orchestration—scheduling, dependency management, retry logic, and alerting for failures. Great Expectations and dbt tests provide data quality assertions that catch pipeline problems automatically.
AI Data Pipeline (ELT/ETL)
1
Ingest
APIs, files, streams
2
Validate
Schema & quality checks
3
Transform
Clean, join, enrich
4
Load
Warehouse / vector DB
5
Monitor
Freshness & drift alerts
Real-World Example
A recommendation engine at a media company suffered from mysterious periodic accuracy drops every Monday morning. Investigation revealed a data pipeline bug: the weekend engagement events were processed with a different timezone handling than weekday events, causing the 'recency' feature to be computed incorrectly for all users on Monday until mid-morning when enough weekday events accumulated to dilute the bad data. The bug had existed for 4 months but was never connected to the Monday accuracy drops because there was no data quality monitoring. Adding Great Expectations assertions to the pipeline immediately surfaced the timezone discrepancy on the next Monday.
Common Mistakes
- ✕Building data pipelines without data quality validation—silent data corruption is more dangerous than pipeline failures because it has no error signal
- ✕Not making pipelines idempotent—if a pipeline can be safely re-run after failure without double-counting or duplicating data, debugging and recovery become trivial
- ✕Ignoring schema evolution—production data sources change over time; pipelines must handle schema changes gracefully rather than failing silently
Related Terms
MLOps
MLOps (Machine Learning Operations) applies DevOps principles to ML systems—combining engineering practices for model development, deployment, monitoring, and retraining into a disciplined operational lifecycle.
Feature Store
A feature store is a centralized data platform that computes, stores, and serves machine learning features consistently across both model training and production inference—eliminating training-serving skew and making feature reuse across models efficient.
Continuous Training
Continuous training automatically retrains ML models on fresh data when triggered by drift detection, schedule, or performance degradation—keeping models current with evolving real-world patterns without manual intervention.
Data Drift
Data drift is the gradual change in the statistical properties of model inputs over time, causing a mismatch between the data distribution the model was trained on and what it encounters in production—leading to silent accuracy degradation.
Model Monitoring
Model monitoring continuously tracks the health of deployed ML models—measuring prediction quality, input distributions, and system performance in production to detect degradation before it impacts users or business outcomes.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →