MAPE · WAPE · SHAP
MLflow & Airflow
Production-Grade Model Evaluation, Explainability & Pipeline
Governance
Human-in-the-Loop · Metric-Gated Orchestration · Drift-Aware Monitoring
Modern Machine Learning systems must satisfy five core principles to be considered production-grade intelligent systems:
- Accuracy — Measurable performance aligned to business KPIs
- Interpretability — Stakeholders must understand model decisions
- Reproducibility — Experiments must be auditable and repeatable
- Automation — Lifecycle managed without manual intervention
- Continuous Monitoring — Drift detected proactively in production
Framework Components
| Component | Purpose | Role in Lifecycle |
|---|---|---|
| MAPE | Accuracy measurement | Evaluation & Monitoring |
| WAPE | Business-stable performance | Evaluation & Monitoring |
| SHAP | Interpretability & drift detection | Explainability & Monitoring |
| MLflow | Tracking & model registry | Governance & Reproducibility |
| Airflow | Orchestration & automation | Scheduling & Retraining |
Airflow orchestrates every stage. MLflow tracks all experiments and models throughout.
MAPE and WAPE alone are not sufficient for production-grade time-series forecasting. A complete evaluation stack covers percentage error, absolute error, squared error, bias, baseline comparison, and stability — each applied at the right pipeline stage.
Stage A — During Training
3.1 — Mean Absolute Error (MAE)
The average absolute difference between predicted and actual values. Scale-dependent but easy to interpret as pure error magnitude in original units.
3.2 — Root Mean Squared Error (RMSE)
Squares each error before averaging, then takes the root. Large errors are penalised more heavily than in MAE — critical when outlier forecasts carry operational risk.
3.3 — Weighted Absolute Percentage Error (WAPE)
WAPE calculates the aggregate absolute error as a proportion of total actual demand — robust to zero values and the standard business KPI in retail/supply-chain forecasting.
Stage B — During Validation
3.4 — Rolling WAPE
WAPE computed over a sliding window (e.g., 4-week rolling). Reveals whether forecast quality is stable over time or degrading in specific periods.
3.5 — RMSE (Validation Split)
Running RMSE on the held-out validation set confirms whether the training-set RMSE is generalisable or the model is overfitting to training noise.
3.6 — Bias / Mean Error
Mean Error (ME) captures the direction of error — whether the model is systematically over- or under-forecasting. A model with low RMSE but high bias is dangerous in inventory planning.
| ME Value | Meaning | Action |
|---|---|---|
| ME > 0 | Over-forecasting | Check feature scaling / target leakage |
| ME < 0 | Under-forecasting | Review trend component or differencing |
| ME ≈ 0 | Unbiased | Healthy — proceed to deployment gate |
Stage C — Supplementary Metrics
3.7 — Mean Absolute Percentage Error (MAPE)
MAPE measures the per-observation percentage deviation of predictions from actual values. Best used alongside WAPE as a secondary business-facing metric.
3.8 — MASE (Baseline Comparison)
Mean Absolute Scaled Error compares model MAE against a naïve seasonal baseline. MASE < 1 means the model outperforms a simple persistence forecast.
Complete Evaluation Stack
| Stage | Metric | Category | Why Needed |
|---|---|---|---|
| Training | MAE | Absolute Error | Pure error magnitude in original units |
| RMSE | Squared Error | Penalises large errors / spikes | |
| WAPE | Percentage Error | Business KPI — robust to zero values | |
| Validation | Rolling WAPE | Stability | Detects time-period degradation |
| RMSE (val) | Generalisation | Overfitting check | |
| Mean Error | Bias | Detects over/under-prediction direction | |
| Supplementary | MAPE | Percentage Error | Per-observation % — business-friendly |
| MASE | Baseline Comparison | Validates improvement over naïve forecast |
4.1 — Mathematical Foundation
SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory. Every prediction is decomposed into additive feature contributions:
- φ₀ — Base value (expected model output over training data)
- φᵢ — Shapley value: contribution of feature i
4.2 — Explainability Types
| Type | Visualization | Use Case |
|---|---|---|
| Global | Summary plot, feature ranking | Model validation, governance |
| Local | Waterfall plot, force plot | Single prediction explanation |
4.3 — Why SHAP?
- Explains individual model predictions with mathematical guarantees
- Detects feature leakage during validation
- Builds stakeholder and regulatory trust
- Enables drift detection via importance shifts
MLflow acts as the tracking and governance layer for all experiments, artifacts, and registered models.
5.1 — Logging Metrics
# Log full metric stack inside an MLflow run
with mlflow.start_run():
# Training metrics
mlflow.log_metric("mae", mae_value)
mlflow.log_metric("rmse", rmse_value)
mlflow.log_metric("wape", wape_value)
# Validation metrics
mlflow.log_metric("rolling_wape", rolling_wape_value)
mlflow.log_metric("val_rmse", val_rmse_value)
mlflow.log_metric("bias", mean_error_value)
# Supplementary
mlflow.log_metric("mape", mape_value)
mlflow.log_metric("mase", mase_value)
5.2 — Logging SHAP Artifacts
mlflow.log_artifact("shap_summary.png")
mlflow.log_artifact("waterfall_plot.png")
5.3 — Approval Tagging
# ⚠ Tag values are always STRINGS in MLflow — use "True" not True
mlflow.set_tag("evaluator_approved", "True")
mlflow.set_tag("primary_metric", "wape")
mlflow.set_tag() stores
values as strings internally. Always pass "True" not True — otherwise the
string comparison in Airflow's approval gate will fail silently ("True" != True).
5.4 — Model Registration Gate
✔ MASE < 1 · ✔ SHAP validated · ✔ evaluator_approved = True
✔ Human reviewer sign-off tag set in MLflow
Only then is the model registered in the MLflow Model Registry.
human_approved = True tag. This
prevents regressions that pass metrics but fail business logic.
Apache Airflow automates the entire ML lifecycle via DAGs, providing dependency control, scheduling, retries, and monitoring.
6.1 — DAG Stage Breakdown
| # | DAG Task | Responsibility |
|---|---|---|
| 1 | Data Ingestion | Pull from source systems, validate schema |
| 2 | Data Validation | Nulls, distributions, outlier checks |
| 3 | Model Training | Run training with configured hyperparameters |
| 4 | Evaluation Task | Compute MAE · RMSE · WAPE · Bias · MASE, log to MLflow |
| 5 | SHAP Analysis | Generate SHAP plots, log artifacts to MLflow |
| 6 | 👤 Human Review | Data Scientist reviews MLflow run — inspects SHAP plots, validates business alignment, sets human_approved = True tag
|
| 7 | Automated Metric Gate | Airflow reads human_approved + metric
thresholds — halts pipeline if any check fails |
| 8 | Model Registration | Push to MLflow Model Registry |
| 9 | Deployment | Serve model to production endpoint |
7.1 — Performance Monitoring
Periodically recompute the full metric stack on live production data. The primary production signals are WAPE trend over time, RMSE, and Bias — a threshold breach on any triggers the automated retraining DAG.
| Signal | Threshold Breach | Action |
|---|---|---|
| WAPE (rolling) | ↑ Rising trend | Trigger retraining DAG |
| RMSE | > trained baseline | Investigate data quality / anomalies |
| Bias (ME) | Persistently ≠ 0 | Retrain with updated exogenous features |
| Data Drift | PSI / KL divergence ↑ | Re-validate features, repipeline data |
| SHAP Shift | Importance rank change | Pre-drift warning — act proactively |
7.2 — SHAP Drift Monitoring
- Compute SHAP on training baseline and store importance scores
- Compute SHAP on current production data
- Compare distributions using KL divergence or PSI
7.3 — Drift Matrix
| Scenario | MAPE | SHAP | Meaning |
|---|---|---|---|
| Model Aging | ↑ Rising | Stable | Data drifted, retrain on recent data |
| Concept Drift | ↑ Rising | Changed | Patterns shifted, investigate features |
| Early Warning | Stable | Changed | Pre-degradation — act proactively |
| Healthy | Stable | Stable | Operating within acceptable bounds |
7.4 — Retraining Trigger
• RMSE > trained baseline threshold
• Bias (Mean Error) persistently ≠ 0
• Data drift detected (PSI / KL divergence)
• SHAP feature importance distribution shifted
8.1 — Project Structure
├── data/
├── training/
├── evaluation/
│ ├── metrics.py
│ ├── shap_analysis.py
│ └── approval_config.yaml
├── monitoring/
├── deployment/
└── mlruns/
Compatible with: Regression · Time-series · Classification · Tree-based · Deep learning
8.2 — Framework Benefits
Standardized Policy
Consistent evaluation across all projects
Explainable AI
SHAP for every model in registry
Automated Lifecycle
Zero manual steps end-to-end
Reduced Risk
Dual-metric threshold gating
Reproducibility
Every run fully tracked in MLflow
Drift-Aware
SHAP + metric monitoring loop
This section is tools-first, not tech-first. It shows exactly what Airflow does and what MLflow does at every metric checkpoint — MAPE, WAPE, and SHAP — in the real pipeline workflow.
9.1 — How Airflow Uses MAPE, WAPE & SHAP
Airflow does not compute metrics itself — it schedules and sequences the tasks that produce them. Here is how each metric fits into the DAG as a concrete Airflow operator:
| DAG Task (Airflow Operator) | Metric Produced | What Airflow Does With It |
|---|---|---|
| evaluate_model_task | MAPE, WAPE, RMSE, Bias | Runs evaluation script, pushes results via XCom to the next task |
| approval_gate_task | WAPE threshold check | Reads WAPE from XCom — if WAPE > threshold, marks task failed and halts DAG; pipeline stops before registration |
| shap_analysis_task | SHAP importance scores | Runs SHAP explainer, saves plots as artifacts, passes importance vector to monitoring sensor |
| monitoring_sensor_task | Rolling WAPE, MAPE trend, SHAP shift | Runs on schedule (e.g., daily at 02:00); compares live metrics against stored baseline — triggers retraining DAG if any threshold is breached |
| retrain_trigger_task | MAPE rising / SHAP drift signal | Uses Airflow TriggerDagRunOperator
to fire the full training DAG automatically — no manual step needed |
9.2 — Airflow DAG: Metric-Gated Pipeline
# Airflow DAG — every task is metric-aware
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
def evaluate(**ctx):
wape = compute_wape(y_true, y_pred)
mape = compute_mape(y_true, y_pred)
ctx['ti'].xcom_push('wape', wape)
ctx['ti'].xcom_push('mape', mape)
def approval_gate(**ctx):
wape = ctx['ti'].xcom_pull(task_ids='evaluate', key='wape')
if wape > WAPE_THRESHOLD:
raise ValueError(f"WAPE {wape:.1f}% exceeds threshold — pipeline halted")
def shap_analysis(**ctx):
shap_vals = generate_shap(model, X_val)
save_shap_plots(shap_vals) # saved as ML artifact
ctx['ti'].xcom_push('shap_importance', shap_vals.mean(axis=0).tolist())
# DAG wiring
t_train >> t_evaluate >> t_shap >> t_gate >> t_register >> t_deploy
9.3 — How MLflow Uses MAPE, WAPE & SHAP
MLflow is the memory of the pipeline — it stores every metric, every artifact, and every model version so the team can compare, audit, and reproduce any run. Here is what MLflow concretely does with each metric:
| MLflow Feature Used | Metric / Artifact | What MLflow Stores / Enables |
|---|---|---|
| mlflow.log_metric() | MAPE, WAPE, RMSE, Bias | Logs numeric values against the run ID — visible in the MLflow UI as time-series charts for every experiment |
| mlflow.log_artifact() | SHAP summary plot, waterfall plot | Stores PNG/HTML SHAP visuals attached to the run — reviewable by any stakeholder without re-running code |
| mlflow.set_tag() | WAPE approval flag, primary_metric | Tags the run with human-readable metadata — used by the approval gate to filter only evaluator_approved = True runs for promotion |
| Model Registry — Staging | All thresholds passed (WAPE ✔ MAPE ✔ SHAP ✔ Human ✔) |
Promotion happens via Airflow, not the training script. Flow: Script logs metrics → Airflow reads → Airflow decides → Airflow calls Registry API. Keeps governance centralised. |
| Model Registry — Production | Robust champion gate (WAPE delta + RMSE + Bias + SHAP) | Simple WAPE_new < WAPE_old is
not enough — requires meaningful margin, no RMSE increase, bias within tolerance, no SHAP anomaly. See
Section 9.6. |
9.4 — MLflow Run: Metric + SHAP Logging
with mlflow.start_run(run_name="sarimax_challenger_v3") as run:
# ── Metrics that Airflow approval_gate_task reads via MLflow API ──
mlflow.log_metric("wape", wape_value) # primary gate signal
mlflow.log_metric("mape", mape_value) # secondary business metric
mlflow.log_metric("rmse", rmse_value)
mlflow.log_metric("rolling_wape", rolling_wape) # monitoring baseline
mlflow.log_metric("bias", mean_error)
# ── SHAP artifacts stored for audit and drift comparison ──
shap_vals = shap.Explainer(model)(X_val)
shap.summary_plot(shap_vals, show=False)
plt.savefig("shap_summary.png")
mlflow.log_artifact("shap_summary.png") # attached to this run
# ── Tags: values MUST be strings in MLflow ──
mlflow.set_tag("evaluator_approved", "True") # "True" not True
mlflow.set_tag("primary_metric", "wape")
mlflow.set_tag("shap_validated", "True")
# ── Promotion via AIRFLOW (not here) — training script only logs ──
# Airflow reads this run_id from XCom and calls the Registry API.
# Do NOT call register_model() inside the training script.
9.5 — End-to-End: Airflow Triggers, MLflow Stores
9.6 — Robust Production Promotion Logic
A simple WAPE_new < WAPE_old check is
insufficient. Unstable models can pass on one metric while regressing on others. The safe promotion gate
requires a meaningful delta margin, no RMSE increase, bias within tolerance, and no SHAP
anomaly.
def promote_to_production(**ctx):
# Fetch challenger and champion metrics from MLflow
challenger = get_run_metrics(challenger_run_id)
champion = get_run_metrics(current_production_run_id)
# Robust multi-condition promotion guard
wape_improved = challenger["wape"] < champion["wape"] - WAPE_DELTA
rmse_stable = challenger["rmse"] <= champion["rmse"]
bias_ok = abs(challenger["bias"]) <= BIAS_LIMIT
no_shap_drift = challenger["shap_drift_score"] < SHAP_DRIFT_THRESHOLD
if wape_improved and rmse_stable and bias_ok and no_shap_drift:
client.transition_model_version_stage(
name="forecasting_model",
version=challenger_version,
stage="Production"
)
else:
raise ValueError("Challenger failed promotion gate — champion retained")
WAPE_new < WAPE_old but
rising RMSE or persistent bias will cause silent production degradation. The delta margin
(e.g., 0.5%) prevents noise-driven over-promotion. The SHAP drift check catches feature
behaviour anomalies before they affect customers.
| Tool | Its Role with MAPE / WAPE | Its Role with SHAP |
|---|---|---|
| Airflow | Schedules metric computation, gates pipeline on WAPE threshold, triggers retraining if MAPE/WAPE rising | Sequences SHAP analysis task, passes importance vector as XCom for drift comparison |
| MLflow | Stores MAPE & WAPE per run, enables cross-experiment comparison in UI, tags approval status | Stores SHAP plots as artifacts per run; baseline SHAP importance vector retrieved for drift detection |
By integrating all pillars of this framework:
- MAE / RMSE → Pure error magnitude and outlier sensitivity during training
- WAPE / Rolling WAPE → Business-level aggregate error and stability monitoring
- Bias (ME) / MASE → Direction of error and baseline comparison for validation
- SHAP → Explainability, drift detection, and pre-degradation early warning
- MLflow → Experiment tracking, full metric registry, and governance gate
- Airflow → Full lifecycle automation and multi-signal retraining orchestration