🧠
✦ ML Evaluation & Pipeline Governance ✦

MAPE · WAPE · SHAP
MLflow & Airflow

Production-Grade Model Evaluation, Explainability & Pipeline Governance
Human-in-the-Loop · Metric-Gated Orchestration · Drift-Aware Monitoring

MAPE WAPE SHAP MLflow Airflow Governance Monitoring Drift Detection
5Core Components
10Sections
Continuous Loop
Section 01
Introduction

Modern Machine Learning systems must satisfy five core principles to be considered production-grade intelligent systems:

  • Accuracy — Measurable performance aligned to business KPIs
  • Interpretability — Stakeholders must understand model decisions
  • Reproducibility — Experiments must be auditable and repeatable
  • Automation — Lifecycle managed without manual intervention
  • Continuous Monitoring — Drift detected proactively in production

Framework Components

Component Purpose Role in Lifecycle
MAPE Accuracy measurement Evaluation & Monitoring
WAPE Business-stable performance Evaluation & Monitoring
SHAP Interpretability & drift detection Explainability & Monitoring
MLflow Tracking & model registry Governance & Reproducibility
Airflow Orchestration & automation Scheduling & Retraining
— 01 —
Section 02
End-to-End MLOps Lifecycle

Airflow orchestrates every stage. MLflow tracks all experiments and models throughout.

🏢 Business Understanding
📥 Data Ingestion
✅ Data Validation
🧠 Model Training
📊 Model Evaluation (MAE · RMSE · MAPE · WAPE)
🔍 Explainability (SHAP)
📝 MLflow Logging
�👤 Human Review & Sign-off
🔐 Automated Metric Gate
📦 Model Registry
🚀 Deployment
📡 Monitoring (Performance + SHAP Drift)
🔄 Retraining Trigger
♻️ Continuous Loop
— 02 —
Section 03
Evaluation Metrics Stack

MAPE and WAPE alone are not sufficient for production-grade time-series forecasting. A complete evaluation stack covers percentage error, absolute error, squared error, bias, baseline comparison, and stability — each applied at the right pipeline stage.

Stage A — During Training

3.1 — Mean Absolute Error (MAE)

The average absolute difference between predicted and actual values. Scale-dependent but easy to interpret as pure error magnitude in original units.

Formula — MAE
$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}\left|A_i - P_i\right|$$
✔ Why MAE During Training?
Directly optimisable · Robust to outliers · Same units as the target variable — makes training loss interpretable

3.2 — Root Mean Squared Error (RMSE)

Squares each error before averaging, then takes the root. Large errors are penalised more heavily than in MAE — critical when outlier forecasts carry operational risk.

Formula — RMSE
$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(A_i - P_i\right)^2}$$
✔ Why RMSE During Training?
Penalises large spikes · Sensitive to outlier errors · Common benchmark for comparing model checkpoints

3.3 — Weighted Absolute Percentage Error (WAPE)

WAPE calculates the aggregate absolute error as a proportion of total actual demand — robust to zero values and the standard business KPI in retail/supply-chain forecasting.

Formula — WAPE
$$\text{WAPE} = \frac{\sum_{i=1}^{n}\left|A_i - P_i\right|}{\sum_{i=1}^{n}\left|A_i\right|} \times 100$$
✔ When to Prefer WAPE
Retail forecasting · Time-series demand · SKU-level forecasting · Intermittent demand data

Stage B — During Validation

3.4 — Rolling WAPE

WAPE computed over a sliding window (e.g., 4-week rolling). Reveals whether forecast quality is stable over time or degrading in specific periods.

Formula — Rolling WAPE (window w)
$$\text{WAPE}_t = \frac{\sum_{k=t-w}^{t}\left|A_k - P_k\right|}{\sum_{k=t-w}^{t}\left|A_k\right|} \times 100$$
✔ Why Rolling WAPE During Validation?
Detects seasonality-specific degradation · Avoids masking errors with overall aggregate · Mirrors production monitoring

3.5 — RMSE (Validation Split)

Running RMSE on the held-out validation set confirms whether the training-set RMSE is generalisable or the model is overfitting to training noise.

⚠ Watch For
A large gap between train-RMSE and val-RMSE signals overfitting — trigger regularisation or reduce model complexity

3.6 — Bias / Mean Error

Mean Error (ME) captures the direction of error — whether the model is systematically over- or under-forecasting. A model with low RMSE but high bias is dangerous in inventory planning.

Formula — Mean Error (Bias)
$$\text{ME} = \frac{1}{n}\sum_{i=1}^{n}\left(P_i - A_i\right)$$
ME Value Meaning Action
ME > 0 Over-forecasting Check feature scaling / target leakage
ME < 0 Under-forecasting Review trend component or differencing
ME ≈ 0 Unbiased Healthy — proceed to deployment gate

Stage C — Supplementary Metrics

3.7 — Mean Absolute Percentage Error (MAPE)

MAPE measures the per-observation percentage deviation of predictions from actual values. Best used alongside WAPE as a secondary business-facing metric.

Formula — MAPE
$$\text{MAPE} = \frac{1}{n}\sum_{i=1}^{n}\left|\frac{A_i - P_i}{A_i}\right| \times 100$$
⚠ Limitations
Undefined when Actual = 0 · Sensitive to near-zero values · Biased toward under-forecasting

3.8 — MASE (Baseline Comparison)

Mean Absolute Scaled Error compares model MAE against a naïve seasonal baseline. MASE < 1 means the model outperforms a simple persistence forecast.

Formula — MASE
$$\text{MASE} = \frac{\text{MAE}_{\text{model}}}{\text{MAE}_{\text{naïve}}}$$
✔ Why MASE?
Scale-independent · Penalises failure to beat naive baseline · Essential for justifying model complexity to stakeholders

Complete Evaluation Stack

Stage Metric Category Why Needed
Training MAE Absolute Error Pure error magnitude in original units
RMSE Squared Error Penalises large errors / spikes
WAPE Percentage Error Business KPI — robust to zero values
Validation Rolling WAPE Stability Detects time-period degradation
RMSE (val) Generalisation Overfitting check
Mean Error Bias Detects over/under-prediction direction
Supplementary MAPE Percentage Error Per-observation % — business-friendly
MASE Baseline Comparison Validates improvement over naïve forecast
— 03 —
Section 04
SHAP for Explainability

4.1 — Mathematical Foundation

SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory. Every prediction is decomposed into additive feature contributions:

SHAP Additive Decomposition
$$f(x) = \phi_0 + \sum_{i=1}^{M} \phi_i$$
  • φ₀ — Base value (expected model output over training data)
  • φᵢ — Shapley value: contribution of feature i

4.2 — Explainability Types

Type Visualization Use Case
Global Summary plot, feature ranking Model validation, governance
Local Waterfall plot, force plot Single prediction explanation

4.3 — Why SHAP?

  • Explains individual model predictions with mathematical guarantees
  • Detects feature leakage during validation
  • Builds stakeholder and regulatory trust
  • Enables drift detection via importance shifts
⚠ SHAP is NOT Causal
SHAP explains how the model uses features — not what happens if you change a real-world feature. Causal inference requires A/B testing · DAGs · Propensity score matching · Do-calculus
— 04 —
Section 05
Integration with MLflow

MLflow acts as the tracking and governance layer for all experiments, artifacts, and registered models.

5.1 — Logging Metrics

Python
# Log full metric stack inside an MLflow run with mlflow.start_run(): # Training metrics mlflow.log_metric("mae", mae_value) mlflow.log_metric("rmse", rmse_value) mlflow.log_metric("wape", wape_value) # Validation metrics mlflow.log_metric("rolling_wape", rolling_wape_value) mlflow.log_metric("val_rmse", val_rmse_value) mlflow.log_metric("bias", mean_error_value) # Supplementary mlflow.log_metric("mape", mape_value) mlflow.log_metric("mase", mase_value)

5.2 — Logging SHAP Artifacts

Python
mlflow.log_artifact("shap_summary.png") mlflow.log_artifact("waterfall_plot.png")

5.3 — Approval Tagging

Python
# ⚠ Tag values are always STRINGS in MLflow — use "True" not True mlflow.set_tag("evaluator_approved", "True") mlflow.set_tag("primary_metric", "wape")
⚠ Critical: Tags are Always Strings
mlflow.set_tag() stores values as strings internally. Always pass "True" not True — otherwise the string comparison in Airflow's approval gate will fail silently ("True" != True).

5.4 — Model Registration Gate

📦 All Conditions Must Pass (Automated + Human)
✔ WAPE ≤ threshold  ·  ✔ RMSE within range  ·  ✔ Bias (ME) ≈ 0
✔ MASE < 1  ·  ✔ SHAP validated  ·  ✔ evaluator_approved = True
✔ Human reviewer sign-off tag set in MLflow
Only then is the model registered in the MLflow Model Registry.
👤 Human-in-the-Loop (HITL) Review
Before automated registration, a Data Scientist or ML Engineer reviews the MLflow run — checks SHAP plots for unexpected feature behaviour, validates business alignment of WAPE, and sets the human_approved = True tag. This prevents regressions that pass metrics but fail business logic.
— 05 —
Section 06
Airflow Orchestration

Apache Airflow automates the entire ML lifecycle via DAGs, providing dependency control, scheduling, retries, and monitoring.

6.1 — DAG Stage Breakdown

# DAG Task Responsibility
1 Data Ingestion Pull from source systems, validate schema
2 Data Validation Nulls, distributions, outlier checks
3 Model Training Run training with configured hyperparameters
4 Evaluation Task Compute MAE · RMSE · WAPE · Bias · MASE, log to MLflow
5 SHAP Analysis Generate SHAP plots, log artifacts to MLflow
6 👤 Human Review Data Scientist reviews MLflow run — inspects SHAP plots, validates business alignment, sets human_approved = True tag
7 Automated Metric Gate Airflow reads human_approved + metric thresholds — halts pipeline if any check fails
8 Model Registration Push to MLflow Model Registry
9 Deployment Serve model to production endpoint
✔ Airflow Guarantees
Cron scheduling · Dependency enforcement · Automatic retry logic · Retraining automation via sensors
— 06 —
Section 07
Monitoring Architecture

7.1 — Performance Monitoring

Periodically recompute the full metric stack on live production data. The primary production signals are WAPE trend over time, RMSE, and Bias — a threshold breach on any triggers the automated retraining DAG.

Signal Threshold Breach Action
WAPE (rolling) ↑ Rising trend Trigger retraining DAG
RMSE > trained baseline Investigate data quality / anomalies
Bias (ME) Persistently ≠ 0 Retrain with updated exogenous features
Data Drift PSI / KL divergence ↑ Re-validate features, repipeline data
SHAP Shift Importance rank change Pre-drift warning — act proactively

7.2 — SHAP Drift Monitoring

  1. Compute SHAP on training baseline and store importance scores
  2. Compute SHAP on current production data
  3. Compare distributions using KL divergence or PSI

7.3 — Drift Matrix

Scenario MAPE SHAP Meaning
Model Aging ↑ Rising Stable Data drifted, retrain on recent data
Concept Drift ↑ Rising Changed Patterns shifted, investigate features
Early Warning Stable Changed Pre-degradation — act proactively
Healthy Stable Stable Operating within acceptable bounds

7.4 — Retraining Trigger

🔄 Airflow Fires Retraining DAG If ANY Condition Is True
 • Rolling WAPE trend rising beyond threshold
 • RMSE > trained baseline threshold
 • Bias (Mean Error) persistently ≠ 0
 • Data drift detected (PSI / KL divergence)
 • SHAP feature importance distribution shifted
— 07 —
Section 08
Generalized Reusable Pipeline

8.1 — Project Structure

/mlops_framework/
  ├── data/
  ├── training/
  ├── evaluation/
  │   ├── metrics.py
  │   ├── shap_analysis.py
  │   └── approval_config.yaml
  ├── monitoring/
  ├── deployment/
  └── mlruns/

Compatible with: Regression · Time-series · Classification · Tree-based · Deep learning

8.2 — Framework Benefits

📐

Standardized Policy

Consistent evaluation across all projects

🔍

Explainable AI

SHAP for every model in registry

⚙️

Automated Lifecycle

Zero manual steps end-to-end

📉

Reduced Risk

Dual-metric threshold gating

♻️

Reproducibility

Every run fully tracked in MLflow

📡

Drift-Aware

SHAP + metric monitoring loop

— 08 —
Section 09
Airflow & MLflow in Action

This section is tools-first, not tech-first. It shows exactly what Airflow does and what MLflow does at every metric checkpoint — MAPE, WAPE, and SHAP — in the real pipeline workflow.

9.1 — How Airflow Uses MAPE, WAPE & SHAP

Airflow does not compute metrics itself — it schedules and sequences the tasks that produce them. Here is how each metric fits into the DAG as a concrete Airflow operator:

DAG Task (Airflow Operator) Metric Produced What Airflow Does With It
evaluate_model_task MAPE, WAPE, RMSE, Bias Runs evaluation script, pushes results via XCom to the next task
approval_gate_task WAPE threshold check Reads WAPE from XCom — if WAPE > threshold, marks task failed and halts DAG; pipeline stops before registration
shap_analysis_task SHAP importance scores Runs SHAP explainer, saves plots as artifacts, passes importance vector to monitoring sensor
monitoring_sensor_task Rolling WAPE, MAPE trend, SHAP shift Runs on schedule (e.g., daily at 02:00); compares live metrics against stored baseline — triggers retraining DAG if any threshold is breached
retrain_trigger_task MAPE rising / SHAP drift signal Uses Airflow TriggerDagRunOperator to fire the full training DAG automatically — no manual step needed
🛠 Airflow Role Summary
Airflow is the scheduler and gatekeeper. It decides when metrics are computed, whether the pipeline proceeds based on WAPE/MAPE thresholds, and if retraining fires based on SHAP drift or metric degradation. The metrics are the decision signals; Airflow acts on them.

9.2 — Airflow DAG: Metric-Gated Pipeline

Python — Airflow DAG
# Airflow DAG — every task is metric-aware from airflow import DAG from airflow.operators.python import PythonOperator, BranchPythonOperator from airflow.operators.trigger_dagrun import TriggerDagRunOperator def evaluate(**ctx): wape = compute_wape(y_true, y_pred) mape = compute_mape(y_true, y_pred) ctx['ti'].xcom_push('wape', wape) ctx['ti'].xcom_push('mape', mape) def approval_gate(**ctx): wape = ctx['ti'].xcom_pull(task_ids='evaluate', key='wape') if wape > WAPE_THRESHOLD: raise ValueError(f"WAPE {wape:.1f}% exceeds threshold — pipeline halted") def shap_analysis(**ctx): shap_vals = generate_shap(model, X_val) save_shap_plots(shap_vals) # saved as ML artifact ctx['ti'].xcom_push('shap_importance', shap_vals.mean(axis=0).tolist()) # DAG wiring t_train >> t_evaluate >> t_shap >> t_gate >> t_register >> t_deploy

9.3 — How MLflow Uses MAPE, WAPE & SHAP

MLflow is the memory of the pipeline — it stores every metric, every artifact, and every model version so the team can compare, audit, and reproduce any run. Here is what MLflow concretely does with each metric:

MLflow Feature Used Metric / Artifact What MLflow Stores / Enables
mlflow.log_metric() MAPE, WAPE, RMSE, Bias Logs numeric values against the run ID — visible in the MLflow UI as time-series charts for every experiment
mlflow.log_artifact() SHAP summary plot, waterfall plot Stores PNG/HTML SHAP visuals attached to the run — reviewable by any stakeholder without re-running code
mlflow.set_tag() WAPE approval flag, primary_metric Tags the run with human-readable metadata — used by the approval gate to filter only evaluator_approved = True runs for promotion
Model Registry — Staging All thresholds passed (WAPE ✔ MAPE ✔ SHAP ✔ Human ✔) Promotion happens via Airflow, not the training script.
Flow: Script logs metricsAirflow readsAirflow decidesAirflow calls Registry API. Keeps governance centralised.
Model Registry — Production Robust champion gate (WAPE delta + RMSE + Bias + SHAP) Simple WAPE_new < WAPE_old is not enough — requires meaningful margin, no RMSE increase, bias within tolerance, no SHAP anomaly. See Section 9.6.

9.4 — MLflow Run: Metric + SHAP Logging

Python — MLflow Logging
with mlflow.start_run(run_name="sarimax_challenger_v3") as run: # ── Metrics that Airflow approval_gate_task reads via MLflow API ── mlflow.log_metric("wape", wape_value) # primary gate signal mlflow.log_metric("mape", mape_value) # secondary business metric mlflow.log_metric("rmse", rmse_value) mlflow.log_metric("rolling_wape", rolling_wape) # monitoring baseline mlflow.log_metric("bias", mean_error) # ── SHAP artifacts stored for audit and drift comparison ── shap_vals = shap.Explainer(model)(X_val) shap.summary_plot(shap_vals, show=False) plt.savefig("shap_summary.png") mlflow.log_artifact("shap_summary.png") # attached to this run # ── Tags: values MUST be strings in MLflow ── mlflow.set_tag("evaluator_approved", "True") # "True" not True mlflow.set_tag("primary_metric", "wape") mlflow.set_tag("shap_validated", "True") # ── Promotion via AIRFLOW (not here) — training script only logs ── # Airflow reads this run_id from XCom and calls the Registry API. # Do NOT call register_model() inside the training script.

9.5 — End-to-End: Airflow Triggers, MLflow Stores

🔁 Combined Tool Workflow
Airflow triggers the evaluate task on schedule → training script logs WAPE, MAPE, SHAP to MLflow → human reviewer sets tag → Airflow reads MLflow run metrics via Client API → Airflow (not the script) calls MLflow Registry API to register → daily monitoring DAG reads stored SHAP baseline from MLflow and compares → if drift or metric degradation, Airflow fires retrain DAG.

9.6 — Robust Production Promotion Logic

A simple WAPE_new < WAPE_old check is insufficient. Unstable models can pass on one metric while regressing on others. The safe promotion gate requires a meaningful delta margin, no RMSE increase, bias within tolerance, and no SHAP anomaly.

Python — Airflow Promotion Task
def promote_to_production(**ctx): # Fetch challenger and champion metrics from MLflow challenger = get_run_metrics(challenger_run_id) champion = get_run_metrics(current_production_run_id) # Robust multi-condition promotion guard wape_improved = challenger["wape"] < champion["wape"] - WAPE_DELTA rmse_stable = challenger["rmse"] <= champion["rmse"] bias_ok = abs(challenger["bias"]) <= BIAS_LIMIT no_shap_drift = challenger["shap_drift_score"] < SHAP_DRIFT_THRESHOLD if wape_improved and rmse_stable and bias_ok and no_shap_drift: client.transition_model_version_stage( name="forecasting_model", version=challenger_version, stage="Production" ) else: raise ValueError("Challenger failed promotion gate — champion retained")
⚠ Why Multi-Condition Matters
A model with WAPE_new < WAPE_old but rising RMSE or persistent bias will cause silent production degradation. The delta margin (e.g., 0.5%) prevents noise-driven over-promotion. The SHAP drift check catches feature behaviour anomalies before they affect customers.
Tool Its Role with MAPE / WAPE Its Role with SHAP
Airflow Schedules metric computation, gates pipeline on WAPE threshold, triggers retraining if MAPE/WAPE rising Sequences SHAP analysis task, passes importance vector as XCom for drift comparison
MLflow Stores MAPE & WAPE per run, enables cross-experiment comparison in UI, tags approval status Stores SHAP plots as artifacts per run; baseline SHAP importance vector retrieved for drift detection
— 09 —
Section 10
Final Conclusion

By integrating all pillars of this framework:

  • MAE / RMSE → Pure error magnitude and outlier sensitivity during training
  • WAPE / Rolling WAPE → Business-level aggregate error and stability monitoring
  • Bias (ME) / MASE → Direction of error and baseline comparison for validation
  • SHAP → Explainability, drift detection, and pre-degradation early warning
  • MLflow → Experiment tracking, full metric registry, and governance gate
  • Airflow → Full lifecycle automation and multi-signal retraining orchestration
✅ Outcome
We transform a machine learning model into a continuous, explainable, monitored, and production-grade intelligent system — aligned to both technical rigor and real-world business requirements.
— 10 —
Created by Wormbyte