AI Infrastructure, Safety & Ethics

Data Pipeline

Definition

A data pipeline is an automated workflow that moves data from one or more sources through a series of transformations to one or more destinations. In ML contexts, pipelines serve multiple purposes: training data pipelines ingest raw data, clean it, transform features, and write training datasets; inference pipelines preprocess incoming requests and transform raw inputs to model-ready feature vectors; monitoring pipelines collect predictions and labels to compute drift and performance metrics. Tools include Apache Airflow, Prefect, dbt (for SQL transformations), Apache Spark (for large-scale batch processing), and Apache Kafka (for streaming pipelines).

Why It Matters

Data pipeline reliability is the unglamorous foundation of reliable ML systems. 80% of ML engineering time in production is spent on data—pipelines that break silently, produce incorrect transformations, or deliver stale data cause model degradation that looks like model problems but is actually data problems. Teams without disciplined data pipeline engineering discover that their impressive models in development become unreliable garbage-in/garbage-out systems in production. Data pipeline quality directly determines model quality ceiling; no model can outperform the quality of data it receives.

How It Works

A production data pipeline includes: (1) data ingestion—extracting data from sources (databases, APIs, message queues, files); (2) data validation—schema checks, null validation, range checks, and anomaly detection on raw data; (3) feature transformation—computing derived features, aggregations, and encodings; (4) quality checkpointing—validating transformed data before writing downstream; (5) data loading—writing to feature stores, data warehouses, or training datasets; (6) orchestration—scheduling, dependency management, retry logic, and alerting for failures. Great Expectations and dbt tests provide data quality assertions that catch pipeline problems automatically.

AI Data Pipeline (ELT/ETL)

Ingest

APIs, files, streams

Validate

Schema & quality checks

Transform

Clean, join, enrich

Load

Warehouse / vector DB

Monitor

Freshness & drift alerts

Real-World Example

A recommendation engine at a media company suffered from mysterious periodic accuracy drops every Monday morning. Investigation revealed a data pipeline bug: the weekend engagement events were processed with a different timezone handling than weekday events, causing the 'recency' feature to be computed incorrectly for all users on Monday until mid-morning when enough weekday events accumulated to dilute the bad data. The bug had existed for 4 months but was never connected to the Monday accuracy drops because there was no data quality monitoring. Adding Great Expectations assertions to the pipeline immediately surfaced the timezone discrepancy on the next Monday.

Common Mistakes

✕Building data pipelines without data quality validation—silent data corruption is more dangerous than pipeline failures because it has no error signal
✕Not making pipelines idempotent—if a pipeline can be safely re-run after failure without double-counting or duplicating data, debugging and recovery become trivial
✕Ignoring schema evolution—production data sources change over time; pipelines must handle schema changes gracefully rather than failing silently

Related Terms

MLOps

MLOps (Machine Learning Operations) applies DevOps principles to ML systems—combining engineering practices for model development, deployment, monitoring, and retraining into a disciplined operational lifecycle.

Feature Store

A feature store is a centralized data platform that computes, stores, and serves machine learning features consistently across both model training and production inference—eliminating training-serving skew and making feature reuse across models efficient.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →