Universal Config-Driven MLOps Framework
A Generalized Pipeline for Machine Learning Systems
1. Abstract
Machine learning models deployed in production environments suffer from performance degradation over time due to data drift, concept drift, seasonal patterns, and evolving business conditions. Existing MLOps solutions are often tightly coupled to specific frameworks, model types, or infrastructure choices, making them difficult to generalize across diverse real-world use cases.
This document proposes a Universal Config-Driven MLOps Framework — a generalized, configuration-first pipeline architecture designed to support multiple machine learning paradigms including Classification, Regression, Time-Series Forecasting, and Deep Learning models — without requiring code-level changes when switching between tasks.
The framework is built around a central config.yaml
file that governs every stage of the ML lifecycle: from data ingestion and versioning through feature
engineering, model training, evaluation, explainability, deployment, and continuous monitoring.
Orchestration is handled by Apache Airflow,
experiment tracking and model registry by MLflow,
data versioning by DVC,
model explainability by SHAP,
and CI/CD automation via GitHub Actions.
The pipeline incorporates automated drift detection, threshold-based retraining triggers, canary and blue-green deployment strategies, and a feature store layer to prevent training-serving skew. All components are observable through a unified monitoring layer tracking prediction drift, feature distribution shifts, and business KPIs in real time.
- A single YAML configuration that drives the entire ML pipeline without code changes
- A multi-paradigm training strategy supporting sklearn, XGBoost, LightGBM, and PyTorch
- An integrated data versioning layer (DVC) for full reproducibility
- A feature store between feature engineering and training to eliminate training-serving skew
- Automated drift detection and retraining with configurable thresholds
- End-to-end CI/CD pipeline via GitHub Actions for continuous delivery
Keywords: MLOps, Config-Driven Pipelines, Apache Airflow, MLflow, DVC, SHAP, Feature Store, Model Drift Detection, CI/CD, Reproducibility, AutoML, Production ML
2. Introduction
2.1 Motivation
Machine learning models degrade over time due to:
Input feature distributions change over time, causing model assumptions to break.
The relationship between features and target variable evolves, reducing prediction accuracy.
Cyclical patterns in data make static models unreliable across different periods.
Evolving business conditions alter the relevance and distribution of key features.
Without monitoring and retraining mechanisms, model performance declines.
2.2 MLOps Concept
MLOps applies DevOps principles to machine learning systems, enabling:
| Capability | Description |
|---|---|
| Automation | Automated training and deployment pipelines |
| Reproducibility | Full experiment tracking and artifact versioning |
| Monitoring | Continuous production performance evaluation |
| Governance | Controlled model promotion with approval gates |
3. System Architecture Overview
Pipeline Architecture Flow
└── Explainability (SHAP)
Layer Architecture Diagram
Architecture Components
| Component | Function |
|---|---|
| Airflow | Workflow orchestration |
| MLflow | Experiment tracking and model registry |
| Python ML Modules | Training and evaluation logic |
| Monitoring Module | Production performance tracking |
| Explainability Module | Model interpretation via SHAP |
| Feature Store | Versioned offline & online features |
| Data Versioning | Dataset version control (DVC) |
| Approval Gate | Automated quality enforcement |
| Human Review Layer | Manual validation and final approval |
4. Configuration-Driven Pipeline Design
The pipeline is configuration-driven to support multiple ML scenarios without code changes.
Example Configuration
- Flexibility — Switch models, metrics, and hyperparameters without touching pipeline code.
- Reproducibility — Every experiment config is versioned and logged for exact reproduction.
- Standardized Experimentation — Teams follow the same config schema, reducing onboarding friction.
- Multi-Scenario Support — Classification, regression, time-series, and deep learning share the same pipeline.
- Environment Portability — Same config works in dev, staging, and production with environment overrides.
5. Data Ingestion Layer
Collect raw data from external systems into the pipeline.
Supported Data Sources
| Source Type | Examples |
|---|---|
| Databases | PostgreSQL, MySQL, MongoDB |
| Data Warehouses | Snowflake, BigQuery, Redshift |
| Streaming Systems | Apache Kafka, Kinesis |
| Files | CSV, Parquet, JSON, Avro |
Ingestion Flow
| Parameter | Description |
|---|---|
data_source |
Input location / connection string |
refresh_frequency |
Ingestion interval (hourly, daily, weekly) |
schema |
Expected data structure for validation |
format |
File format or query template |
partitioning |
Date-based or key-based partitioning strategy |
Data Versioning Layer
Between Data Ingestion and Validation, production pipelines require data versioning to ensure reproducibility and rollback capabilities.
| Tool | Purpose |
|---|---|
| DVC | File-level dataset versioning, pipeline caching, remote storage |
| Delta Lake | Versioned tables with ACID transactions for large-scale data |
6. Data Versioning — DVC
Data Version Control (DVC) sits between Data Ingestion and Data Validation. It tracks dataset versions, pipeline stages, and experiment artifacts, enabling full reproducibility and rollback.
Why Data Versioning?
Reproducibility
Every model version is tied to an exact dataset version — retrain any experiment identically.
Rollback
If a new dataset degrades model performance, roll back to a previous data version instantly.
Auditability
Track who changed what data and when — critical for regulated environments.
DVC Core Workflow
DVC Setup & Initialization
Versioning a Dataset
Reproducing Any Data Version
DVC Pipeline Stages
DVC can also version the full pipeline — not just data but every transformation step:
DVC Key Commands Reference
| Command | Purpose |
|---|---|
dvc init |
Initialize DVC in a Git repository |
dvc add <file> |
Track a data file or directory |
dvc push |
Upload cached data to remote storage |
dvc pull |
Download data for the current Git commit |
dvc checkout |
Restore data files to match current .dvc pointers |
dvc repro |
Re-run pipeline stages that have changed |
dvc dag |
Visualize pipeline dependency graph |
dvc metrics show |
Display tracked evaluation metrics |
dvc metrics diff |
Compare metrics between commits |
Integration with MLflow
DVC handles data versioning; MLflow handles experiment tracking. Together they form a complete reproducibility stack:
| Concern | Tool | What It Tracks |
|---|---|---|
| Data Version | DVC | Dataset hash, remote path, Git commit |
| Code Version | Git | Source code, configs, .dvc pointer files |
| Experiment | MLflow | Parameters, metrics, model artifacts |
7. Data Validation
Data validation ensures input data quality before training.
| Rule | Description |
|---|---|
| Schema Validation | Required columns exist with correct types |
| Missing Value Detection | Flags incomplete records above threshold |
| Range Checks | Detects values outside expected bounds |
| Drift Detection | Distribution change vs. training baseline |
Drift Detection Methods
| Method | Description |
|---|---|
| Population Stability Index (PSI) | Distribution comparison between datasets |
| KL Divergence | Probability distribution divergence measure |
PSI Interpretation
| PSI Value | Interpretation |
|---|---|
| < 0.1 | Stable |
| 0.1 – 0.25 | Moderate Drift |
| > 0.25 | Significant Drift |
8. Feature Engineering
Feature engineering transforms raw data into model-ready features.
| Technique | Example |
|---|---|
| Normalization | Min-max or Z-score scaling of numeric features |
| Encoding | One-hot, label, or target encoding of categoricals |
| Aggregation | Grouped statistics (mean, std, count) |
| Lag Features | Time-series historical values |
Feature Store
A Feature Store sits between Feature Engineering and Training to prevent training-serving skew and manage features as independent assets.
| Tool | Purpose |
|---|---|
| Feast | Feature store |
| Tecton | Managed feature platform |
| Redis | Online feature serving |
Architecture Component:
- Versioned features: Ensures identical features are used across experiments.
- Offline training features: High-throughput batch serving for training models.
- Online inference features: Low-latency serving for real-time predictions.
8. Data Splitting Strategy
Different ML tasks require fundamentally different splitting strategies. Incorrect splitting can lead to data leakage, overfitting, and misleading evaluation results. Choosing the right strategy is critical for building trustworthy models.
Why Data Splitting Matters
Unbiased Evaluation
Test data must never influence training — ensures honest performance estimates.
Prevent Data Leakage
Information from the future or test set must not leak into training features.
Generalization
Models must perform well on unseen data, not just memorize the training set.
Splitting Strategies by Problem Type
| Strategy | Problem Type | Key Principle | When to Use |
|---|---|---|---|
| Random Split | Classification, Regression | Randomly shuffle and divide data | i.i.d. data with no temporal dependency |
| Stratified Split | Classification (imbalanced) | Preserve class distribution in each split | When target classes are unevenly distributed |
| Temporal Split | Time Series | Split by time — no future data in training | Forecasting, any time-dependent data |
| Group Split | Any (grouped data) | Keep all samples from same group in one split | Patient data, user sessions, store-level data |
| K-Fold CV | Any (small datasets) | Rotate train/test across K partitions | Limited data where every sample matters |
Classification / Regression Split
Standard Train-Val-Test Split
| Dataset | % | Purpose |
|---|---|---|
| Train | 70% | Model learns patterns from this data |
| Validation | 15% | Hyperparameter tuning and early stopping |
| Test | 15% | Final unbiased performance evaluation |
Stratified Splitting
When classes are imbalanced (e.g., 95% negative, 5% positive), random splitting may produce splits with zero minority class samples. Stratified splitting ensures each subset has the same class ratio as the full dataset.
⚠ Warning Never use stratified split for time-series data — it violates temporal ordering.
Time Series Splitting
Time-series data has temporal dependency — future data cannot be used to predict the past. Random shuffling would cause catastrophic data leakage.
Forward-Chaining (Expanding Window) Validation
Training window expands over time. Each fold adds more historical data while testing on the next unseen period.
Sliding Window Validation
Fixed-size training window slides forward. Useful when older data becomes less relevant.
Standard K-Fold
Divides data into K equal folds. Each fold serves as the test set once while the remaining K-1 folds form the training set. Final metric = average across all K folds.
| Variant | Description | Best For |
|---|---|---|
| K-Fold (K=5) | 5 equal partitions, rotate test set | General purpose, medium datasets |
| K-Fold (K=10) | 10 partitions, lower bias, higher variance | Small datasets needing maximum utilization |
| Stratified K-Fold | K-Fold preserving class ratios | Imbalanced classification problems |
| Group K-Fold | K-Fold ensuring groups stay together | Patient/user grouped data |
| Repeated K-Fold | K-Fold repeated N times with different shuffles | Most robust estimate, highest compute cost |
| Leave-One-Out (LOO) | K = number of samples | Very small datasets (<50 samples) |
Data Leakage Sources
| Pitfall | Description | Impact |
|---|---|---|
| Target Leakage | Using features derived from the target variable | Model appears perfect in training, fails in production |
| Train-Test Contamination | Fitting scaler/encoder on entire dataset before splitting | Test metrics are overly optimistic |
| Temporal Leakage | Using future data as features for past predictions | Model learns impossible patterns |
| Group Leakage | Same entity (patient/user) in both train and test | Model memorizes entities instead of learning patterns |
How to Prevent Leakage
- Split first, preprocess after — always split data before any transformation (scaling, encoding, imputation)
- Use pipelines — Scikit-learn Pipelines enforce correct fit/transform ordering
- Validate temporally — for any time-dependent data, always use time-based splits
- Check feature origins — verify no feature is computed using information from the target
- Group-aware splitting — if data has natural groups, use GroupKFold or GroupShuffleSplit
Choosing the Right Split Ratio
| Dataset Size | Recommended Split | Reasoning |
|---|---|---|
| < 1,000 samples | K-Fold CV (K=5 or 10) | Every sample is valuable — maximize training data |
| 1,000 – 100,000 | 70 / 15 / 15 | Standard split — enough data for reliable test metrics |
| 100,000 – 1M | 80 / 10 / 10 | Large test set not needed — more data for training helps |
| > 1M samples | 98 / 1 / 1 | Even 1% is ~10K+ samples — sufficient for evaluation |
9. Model Training
The training layer supports multiple algorithm types.
Supported Algorithms
| Problem Type | Algorithms |
|---|---|
| Classification | Logistic Regression, Random Forest, SVM, XGBoost |
| Regression | Linear Regression, XGBoost, LightGBM |
| Time Series | ARIMA, SARIMAX, Prophet |
| Deep Learning | CNN, LSTM, Transformer |
How Does the Model Know Its Error? — Loss Functions
During training, the model uses a loss function (also called cost function or objective function) to measure how far its predictions are from the actual values. The optimizer then adjusts model weights to minimize this loss.
The Training Loop:
↻ Repeat for every batch × every epoch until loss converges
Loss Functions by Problem Type
| Problem Type | Loss Function | Formula / Description | When to Use |
|---|---|---|---|
| Classification | Binary Cross-Entropy | $\mathcal{L} = -\bigl[y \cdot \log(\hat{y}) + (1-y) \cdot \log(1-\hat{y})\bigr]$ | Binary classification (2 classes) |
| Categorical Cross-Entropy | $\mathcal{L} = -\sum_{i=1}^{C} y_i \cdot \log(\hat{y}_i)$ | Multi-class classification | |
| Focal Loss | $\mathcal{L} = -\alpha_t (1 - p_t)^\gamma \cdot \log(p_t)$ | Severely imbalanced classes | |
| Regression | Mean Squared Error (MSE) | $\text{MSE} = \dfrac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ | Default regression — penalizes large errors heavily |
| Mean Absolute Error (MAE) | $\text{MAE} = \dfrac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$ | Robust to outliers | |
| Huber Loss | $L_\delta = \begin{cases} \frac{1}{2}(y-\hat{y})^2 & |y-\hat{y}| \le \delta \\ \delta|y-\hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$ | Best of both — smooth + robust | |
| Time Series | MSE / RMSE | $\text{RMSE} = \sqrt{\dfrac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$ | Standard time-series forecasting |
| Log-Cosh Loss | $\mathcal{L} = \sum_{i=1}^{n} \log\bigl(\cosh(\hat{y}_i - y_i)\bigr)$ | Smooth approximation to MAE, handles outliers | |
| Deep Learning | Cross-Entropy + Softmax | Softmax output → CE loss | Neural network classifiers |
| Contrastive / Triplet Loss | Learns similarity between embeddings | Siamese networks, embedding learning |
Loss Convergence During Training
The optimizer's goal is to reach the minimum loss. Training loss should decrease over epochs. If validation loss starts rising while training loss keeps dropping, the model is overfitting.
Regularization — Preventing Overfitting
| Technique | How It Works | Effect on Loss |
|---|---|---|
| L1 (Lasso) | $\mathcal{L}_{\text{total}} = \mathcal{L} + \lambda \sum|w_i|$ | Drives some weights to zero → feature selection |
| L2 (Ridge) | $\mathcal{L}_{\text{total}} = \mathcal{L} + \lambda \sum w_i^2$ | Shrinks all weights → prevents any single feature from dominating |
| Dropout | Randomly disables neurons during training | Forces network to learn redundant representations |
| Batch Normalization | Normalizes layer inputs | Stabilizes training, allows higher learning rates |
Learning Rate Schedules
The learning rate controls how big each weight update is. Too high → unstable training. Too low → slow convergence.
| Schedule | Behavior | Use Case |
|---|---|---|
| Constant | Fixed LR throughout | Simple models, quick experiments |
| Step Decay | Reduce LR by factor every N epochs | CNNs, standard deep learning |
| Cosine Annealing | LR follows cosine curve from high to low | Transformers, modern architectures |
| ReduceOnPlateau | Reduce LR when validation loss stops improving | Adaptive — most practical choice |
| Warmup + Decay | Slowly increase LR, then decay | Large models, Transformers |
Early Stopping
Monitors validation loss and stops training when it stops improving — prevents overfitting without manual epoch tuning.
Training Parameters
| Parameter | Description |
|---|---|
learning_rate |
Optimizer step size — controls convergence speed |
batch_size |
Samples per gradient update iteration |
epochs |
Full passes through the training dataset |
optimizer |
Weight update algorithm (SGD, Adam, RMSProp) |
Common Optimizers
| Optimizer | Use Case | Pros |
|---|---|---|
| SGD | Simple gradient descent | Simple, well-understood convergence |
| Adam | Adaptive learning rate | Fast convergence, handles sparse gradients |
| RMSProp | Recurrent neural networks | Handles non-stationary objectives well |
10. Model Evaluation
Evaluation metrics depend on problem type.
Classification Metrics
| Metric | Formula / Description |
|---|---|
| Accuracy | $\text{Accuracy} = \dfrac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$ |
| Precision | $\text{Precision} = \dfrac{\text{TP}}{\text{TP} + \text{FP}}$ |
| Recall | $\text{Recall} = \dfrac{\text{TP}}{\text{TP} + \text{FN}}$ |
| F1 Score | $F_1 = 2 \cdot \dfrac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$ |
| ROC-AUC | $\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}^{-1}(x))\, dx$ |
Regression Metrics
| Metric | Meaning |
|---|---|
| MAE | $\text{MAE} = \dfrac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$ |
| RMSE | $\text{RMSE} = \sqrt{\dfrac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$ |
| R² | $R^2 = 1 - \dfrac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$ |
| MAPE | $\text{MAPE} = \dfrac{100\%}{n}\sum_{i=1}^{n}\left|\dfrac{y_i - \hat{y}_i}{y_i}\right|$ |
Time-Series Metrics
| Metric | Purpose |
|---|---|
| WAPE | Business-level forecast accuracy (weighted) |
| MAPE | Percentage forecasting error |
| MASE | Benchmark comparison against naive forecast |
| Rolling WAPE | Recent/sliding window performance monitoring |
WAPE — Weighted Absolute Percentage Error
WAPE calculates aggregate absolute error as a proportion of total actual demand — robust to zero values and the standard business KPI in retail/supply-chain forecasting.
Bias / Mean Error (ME)
ME captures the direction of error — whether the model is systematically over- or under-forecasting. A model with low RMSE but high bias is dangerous in inventory planning.
| ME Value | Meaning | Action |
|---|---|---|
| ME > 0 | Over-forecasting | Check feature scaling / target leakage |
| ME < 0 | Under-forecasting | Review trend component or differencing |
| ME ≈ 0 | Unbiased | Healthy — proceed to deployment gate |
MASE — Mean Absolute Scaled Error
Compares model MAE against a naïve seasonal baseline. MASE < 1 means the model outperforms a simple persistence forecast.
Rolling WAPE
WAPE computed over a sliding window (e.g., 4-week rolling). Reveals whether forecast quality is stable or degrading in specific periods.
Complete Evaluation Stack — Which Metric at Which Stage
| Stage | Metric | Category | Why Needed |
|---|---|---|---|
| Training | MAE | Absolute Error | Pure error magnitude in original units |
| RMSE | Squared Error | Penalises large errors / spikes | |
| WAPE | Percentage Error | Business KPI — robust to zero values | |
| Validation | Rolling WAPE | Stability | Detects time-period degradation |
| RMSE (val) | Generalisation | Overfitting check | |
| Mean Error | Bias | Detects over/under-prediction direction | |
| Supplementary | MAPE | Percentage Error | Per-observation % — business-friendly |
| MASE | Baseline Comparison | Validates improvement over naïve forecast |
11. Explainability Layer
Explainability improves model transparency and builds stakeholder trust.
SHAP (SHapley Additive exPlanations)
SHAP values explain the contribution of each feature to a prediction:
$$f(x) = \varphi_0 + \sum_{i=1}^{M} \varphi_i$$
Where: $\varphi_0$ = base value (average model prediction), $\varphi_i$ = SHAP contribution of feature $i$
SHAP Feature Importance
| Plot Type | Purpose |
|---|---|
| Summary Plot | Global feature importance across all predictions |
| Waterfall Plot | Step-by-step single prediction explanation |
| Force Plot | Feature push/pull visualization for one prediction |
Explainability Types
| Type | Visualization | Use Case |
|---|---|---|
| Global | Summary plot, feature ranking | Model validation, governance |
| Local | Waterfall plot, force plot | Single prediction explanation |
Why SHAP?
- Explains individual predictions with mathematical guarantees (Shapley axioms)
- Detects feature leakage during validation — unexpected top features signal data issues
- Builds stakeholder and regulatory trust
- Enables drift detection via importance shifts over time
⚠ SHAP is NOT Causal
SHAP explains how the model uses features — not what happens if you change a real-world feature. Causal inference requires A/B testing, DAGs, propensity score matching, or do-calculus.
SHAP Drift Detection Process
- Compute SHAP on training baseline and store importance scores
- Periodically compute SHAP on current production data
- Compare distributions using KL divergence or PSI — if importance ranking shifts significantly, trigger investigation
12. Approval Gate & Human-in-the-Loop
The Approval Gate is a critical control checkpoint between model evaluation and model registry. It enforces quality standards and optionally routes the decision to a human reviewer.
12.1 Purpose
Quality Guard
Prevent underperforming models from reaching production
Compliance
Enforce business and regulatory compliance checks
Human Oversight
Enable expert review for high-stakes decisions
12.2 Automated Gate Checks
| Check | Condition |
|---|---|
| Primary Metric Threshold | New model metric > configured minimum |
| Regression Guard | No significant drop in secondary metrics |
| SHAP Drift Check | Feature importance stable vs. previous model |
| Data Quality Flag | No critical validation warnings in pipeline |
| Fairness Check | Bias metrics within acceptable bounds |
12.3 Human-in-the-Loop (HITL) Workflow
Model promoted to Staging/Production via MLflow
Model archived, alert triggered, team notified
Retraining triggered with reviewer comments
Reviewer Inputs Available
| Input | Description |
|---|---|
| Model Evaluation Report | All metrics vs. baseline comparison |
| SHAP Summary Plot | Global feature importance visualization |
| Drift Analysis Report | PSI and KL divergence results |
| Champion Comparison | Current production model vs. challenger |
| Audit Log | Previous approval decisions and context |
Escalation Rules (Mandatory Human Review)
- Model performance drop exceeds 5% vs. current production model
- SHAP feature importance shift detected above threshold
- First deployment of a new model type
- Regulatory or compliance-flagged domain (e.g., finance, healthcare)
- PSI > 0.25 detected in recent data validation
HITL Tools and Interfaces
| Tool | Role |
|---|---|
| MLflow UI | Model comparison and metric visualization |
| Airflow UI | Pipeline status and manual trigger controls |
| Custom Review Dashboard | Approval/reject/defer with comments |
| Email / Slack Alerts | Review request notifications |
| Audit Database | Logging all human decisions with timestamps |
Audit & Governance Log Fields
| Field | Description |
|---|---|
model_version |
MLflow run ID of the candidate model |
reviewer_id |
Username of the human reviewer |
decision |
approved / rejected / deferred |
decision_timestamp |
UTC timestamp |
notes |
Reviewer comments or auto summary |
gate_checks_passed |
List of automated checks and results |
13. Experiment Tracking with MLflow
MLflow records all experiment details for reproducibility.
| Category | Examples |
|---|---|
| Parameters | learning rate, batch size, epochs |
| Metrics | accuracy, RMSE, WAPE |
| Artifacts | SHAP plots, confusion matrices |
| Models | Serialized model files (pickle, ONNX) |
14. Model Registry
MLflow Registry manages the model lifecycle through stages.
| Stage | Description |
|---|---|
| None | Newly logged model — awaiting evaluation |
| Staging | Candidate model under review |
| Production | Active deployed model serving predictions |
| Archived | Deprecated / replaced model |
Promotion Criteria
- New model metric > current production metric
- No regression in secondary metrics
- No SHAP drift detected
- Approval Gate passed (automated + human review)
⚠ Critical: MLflow Tags are Always Strings
mlflow.set_tag() stores values as strings internally. Always pass "True"
not
True — otherwise the string comparison in Airflow's approval gate will fail silently
("True" != True).
A simple WAPE_new < WAPE_old check is insufficient. Unstable
models can
pass on one metric while regressing on others. The safe promotion gate requires:
- Meaningful delta margin (e.g., 0.5%) to prevent noise-driven over-promotion
- No RMSE increase
- Bias within tolerance
- No SHAP anomaly
| Tool | Role with MAPE/WAPE | Role with SHAP |
|---|---|---|
| Airflow | Schedules metric computation, gates pipeline on WAPE threshold, triggers retraining | Sequences SHAP task, passes importance via XCom for drift comparison |
| MLflow | Stores MAPE & WAPE per run, enables cross-experiment comparison, tags approval status | Stores SHAP plots as artifacts; baseline importance vector for drift detection |
15. Airflow Pipeline Orchestration
Airflow manages pipeline execution through Directed Acyclic Graphs (DAGs).
Example DAG
Scheduling
| Pipeline | Frequency |
|---|---|
| Training Pipeline | Weekly |
| Monitoring Pipeline | Daily |
DAG Task Breakdown
| # | Task ID | Responsibility |
|---|---|---|
| 1 | data_ingestion |
Pull from source systems, validate schema |
| 2 | data_validation |
Nulls, distributions, outlier checks, PSI drift |
| 3 | feature_engineering |
Lag features, rolling stats, encoding |
| 4 | data_splitting |
Split into Train (70%), Validation (15%), Test (15%) handling temporal nature |
| 5 | model_training |
Run training with configured hyperparameters |
| 6 | model_evaluation |
Compute MAE · RMSE · WAPE · Bias · MASE, log to MLflow |
| 7 | shap_analysis |
Generate SHAP plots, log artifacts to MLflow |
| 8 | human_review |
Data Scientist reviews MLflow run, sets human_approved="True" tag |
| 9 | approval_gate |
Reads human_approved + metric thresholds — halts DAG if any check fails |
| 10 | model_registration |
Push to MLflow Model Registry → Staging → Production |
| 11 | deployment |
Serve model to production endpoint via FastAPI |
✔ Airflow Guarantees: Cron scheduling · Dependency enforcement · Automatic retry
logic ·
Retraining automation via TriggerDagRunOperator · Email alerts on failure
Airflow does not compute metrics itself — it schedules and sequences the tasks that produce them:
| Airflow Task | Metric Produced | What Airflow Does With It |
|---|---|---|
evaluate_model_task |
MAPE, WAPE, RMSE, Bias | Runs evaluation script, pushes results via XCom to the next task |
approval_gate_task |
WAPE threshold check | Reads WAPE from XCom — if WAPE > threshold, marks task failed and halts DAG; pipeline stops before registration |
shap_analysis_task |
SHAP importance scores | Runs SHAP explainer, saves plots as artifacts, passes importance vector to monitoring sensor |
monitoring_sensor_task |
Rolling WAPE, MAPE trend, SHAP shift | Runs on schedule (e.g., daily at 02:00); compares live metrics against baseline — triggers retraining DAG if any threshold breached |
retrain_trigger_task |
MAPE rising / SHAP drift | Uses TriggerDagRunOperator to fire full training DAG — no manual
step needed |
📊 Airflow Role Summary: Airflow is the scheduler and gatekeeper. It decides when metrics are computed, whether the pipeline proceeds, and if retraining fires. The metrics are the decision signals; Airflow acts on them.
16. Monitoring System
Production monitoring continuously tracks model performance.
| Metric | Purpose |
|---|---|
| Rolling WAPE | Forecast degradation tracking |
| Accuracy Trend | Classification drift detection |
| RMSE Trend | Regression model degradation |
| Bias | Systematic prediction error detection |
| Prediction Drift | Prediction Distribution vs Training Prediction Distribution |
Drift Types
Changes in the input statistical distributions.
Changes in which features the model relies on (SHAP shift).
Model output distribution changing relative to the training baseline.
The core relationship between inputs and targets shifts.
Monitoring Signals — Thresholds & Actions
| Signal | Threshold Breach | Action |
|---|---|---|
| WAPE (rolling) | ↑ Rising trend | Trigger retraining DAG |
| RMSE | > trained baseline | Investigate data quality / anomalies |
| Bias (ME) | Persistently ≠ 0 | Retrain with updated exogenous features |
| Data Drift (PSI) | PSI / KL divergence ↑ | Re-validate features, repipeline data |
| SHAP Shift | Importance rank change | Pre-drift warning — act proactively |
17. Drift Detection
Drift detection monitors changes in input data distributions over time.
| Method | Purpose |
|---|---|
| PSI | Distribution shift between training and production data |
| KL Divergence | Statistical divergence between probability distributions |
| SHAP Shift | Feature importance change over time |
Example Trigger: PSI > 0.25 → Automatic model retraining initiated
via
Airflow
SHAP + MAPE Drift Matrix
Use this matrix to diagnose what type of degradation is happening based on two independent signals:
| Scenario | MAPE/WAPE | SHAP Status | Meaning | Action |
|---|---|---|---|---|
| Model Aging | ↑ Rising | Stable | Data drifted, model hasn't seen new patterns | Retrain on recent data |
| Concept Drift | ↑ Rising | Changed | Underlying patterns shifted fundamentally | Investigate features, retrain with new architecture |
| Early Warning | Stable | Changed | Feature behaviour shifting before metrics degrade | Act proactively — investigate now, retrain soon |
| Healthy | Stable | Stable | Model operating within acceptable bounds | No action needed — continue monitoring |
🔍 Airflow fires retraining DAG if ANY condition is true:
- Rolling WAPE trend rising beyond threshold
- RMSE > trained baseline threshold
- Bias (Mean Error) persistently ≠ 0
- Data drift detected (PSI / KL divergence)
- SHAP feature importance distribution shifted
18. Automated Retraining
Retraining occurs automatically when monitoring detects degradation.
Retraining Triggers
Forecast error trending upward beyond threshold
Classification accuracy below configured threshold
Regression error rising above baseline
Data distribution shift detected (PSI > 0.25)
Feature importance has changed significantly — model assumptions may be invalid.
Airflow automatically triggers the full training DAG when any trigger fires.
19. Deployment Strategies
| Method | Example | Best For |
|---|---|---|
| REST API | FastAPI, Flask | Real-time predictions |
| Batch Inference | Scheduled predictions | Large-scale offline scoring |
| Streaming | Kafka pipelines | Low-latency event-driven predictions |
Infrastructure
Docker
Containerized model serving
Kubernetes
Scalable orchestration
Cloud Services
AWS, GCP, Azure managed ML
20. Final Architecture View
The complete end-to-end framework.
Feature Store → Airflow pipeline (Split, Train, Eval, Explain, Approve, Register) → Registry → Serving
Managed centrally by MLflow Tracking.
Project Directory Structure
Compatible with: Regression · Time-series · Classification · Tree-based · Deep learning
Framework Benefits
- Standardized Policy: Consistent evaluation across all projects
- Explainable AI: SHAP for every model in registry
- Automated Lifecycle: Zero manual steps end-to-end (except optional HITL)
- Reduced Risk: Dual-metric threshold gating limits bad deployments
- Reproducibility: Every run fully tracked in MLflow
21. Enterprise Foundations: CI & Infrastructure
A machine learning model running on a laptop is a script. A model running in an automated, scalable, and secure environment is a product. This section defines the underlying enterprise architecture required to support the Universal MLOps Framework.
21.1 Continuous Integration (CI)
While Airflow handles Continuous Training (CT) of the models, CI tools handle the testing and validation of the Python code and DAG configurations.
| Component | Technology | Purpose |
|---|---|---|
| Source Control | Git / GitHub | Version control for train.py, config files, and DAGs. |
| CI Pipeline | GitHub Actions | Code linting (Flake8), type checking (mypy), and Unit Tests (PyTest) trigger on every pull request. |
| CD Pipeline | Pending | Continuous Delivery and Automated Deployment still remain to explore. |
21.2 Infrastructure Manageability
Consistency between development, staging, and production environments is non-negotiable.
Containerization (Docker)
Every training job and inference endpoint runs in an isolated, identical Docker container. "It works on my machine" is eliminated.
Future Expansions
Technologies like Kubernetes / Kubeflow and Terraform (IaC) for large-scale cluster management and automated provisioning still remain to explore.
📊 Business Value ROI:
By implementing these foundations alongside the CT framework, organizations can: (1) Reduce model time-to-market from months to hours. (2) Prevent million-dollar forecasting disasters via automated safety gates.
22. End-to-End Example: Ad Sales Demand Forecasting
Let’s walk through the entire MLOps pipeline with a real-world example — forecasting weekly ad sales demand. We’ll see exactly what happens at each stage, with Airflow orchestrating the pipeline and MLflow tracking everything.
📊 Business Problem: A retail company wants to predict ad-driven product demand for the next 7 days to optimize inventory and ad spend allocation.
📊 Data: 2 years of daily sales data with features: price, ad_spend, competitor_price, promotions, holidays, day_of_week.
📊 Models: ARIMA, SARIMAX, and LSTM are trained and compared.
Stage 1: Airflow DAG Definition
Airflow orchestrates the entire pipeline as a DAG. Every Monday at 2 AM, it triggers the full training pipeline.
Stage 2: Data Ingestion
📊 What Happens:
- Pull latest sales data from PostgreSQL
- Fetch ad spend data from marketing API
- Load competitor pricing from CSV export
- Merge all sources into a unified dataset
Stage 3: Data Validation
✅ Result: PSI = 0.08 (stable), 0 missing values, schema valid.
Stage 4: Feature Engineering
Stage 5: Model Training + MLflow Tracking
All three models are trained and every parameter, metric, and artifact is logged to MLflow.
Stage 6: Model Evaluation — MLflow Comparison
| Model | RMSE | WAPE | MAPE | Winner? |
|---|---|---|---|---|
| ARIMA | 142.3 | 12.4% | 14.1% | |
| SARIMAX | 98.7 | 7.2% | 8.5% | 📊 Best |
| LSTM | 115.6 | 9.8% | 11.2% |
Stage 7: SHAP Explainability
Top Features: ad_spend price is_holiday lag_7
Stage 8: Approval Gate + Human Review
✅ Automated Checks:
- WAPE 7.2% < threshold 10% → PASS
- RMSE 98.7 < baseline 120 → PASS
- PSI 0.08 < 0.25 → PASS
- SHAP drift: none detected → PASS
- Fairness: N/A (forecasting) → PASS
📊 Human Review:
- Reviewer:
data-science-lead - Reviewed SHAP plots → features look reasonable
- Compared with current production model → 15% improvement
- Decision: APPROVED
- Notes: "SARIMAX captures weekly seasonality well"
Stage 9: Model Registry — MLflow
Stage 10: Deployment
Stage 11: Monitoring + Retraining Trigger
⚠ Week 6: Rolling WAPE hits 11.2% → Airflow automatically triggers retraining pipeline → New SARIMAX trained on fresh data → Approval Gate → Deployed.
Complete Airflow DAG Wiring
23. Conclusion
This document presents a generalized MLOps framework capable of supporting diverse machine learning systems through a modular pipeline architecture. By integrating Airflow orchestration, MLflow experiment tracking, explainability tools, automated monitoring, and a structured Approval Gate with Human-in-the-Loop review, the framework enables continuous model improvement, reliable deployment, and scalable ML system management.
The Human Review Layer ensures that critical model promotions are subject to expert oversight, maintaining trust, explainability, and governance across all environments.