ML Evaluation & Pipeline Governance | MAPE · WAPE · SHAP · MLflow

Section 01

Introduction

Modern Machine Learning systems must satisfy five core principles to be considered production-grade intelligent systems:

Accuracy — Measurable performance aligned to business KPIs
Interpretability — Stakeholders must understand model decisions
Reproducibility — Experiments must be auditable and repeatable
Automation — Lifecycle managed without manual intervention
Continuous Monitoring — Drift detected proactively in production

Framework Components

Component	Purpose	Role in Lifecycle
MAPE	Accuracy measurement	Evaluation & Monitoring
WAPE	Business-stable performance	Evaluation & Monitoring
SHAP	Interpretability & drift detection	Explainability & Monitoring
MLflow	Tracking & model registry	Governance & Reproducibility
Airflow	Orchestration & automation	Scheduling & Retraining

— 01 —

Section 02

End-to-End MLOps Lifecycle

Airflow orchestrates every stage. MLflow tracks all experiments and models throughout.

🏢 Business Understanding

📥 Data Ingestion

✅ Data Validation

🧠 Model Training

📊 Model Evaluation (MAE · RMSE · MAPE · WAPE)

🔍 Explainability (SHAP)

📝 MLflow Logging

�👤 Human Review & Sign-off

🔐 Automated Metric Gate

📦 Model Registry

🚀 Deployment

📡 Monitoring (Performance + SHAP Drift)

🔄 Retraining Trigger

♻️ Continuous Loop

— 02 —

Section 03

Evaluation Metrics Stack

MAPE and WAPE alone are not sufficient for production-grade time-series forecasting. A complete evaluation stack covers percentage error, absolute error, squared error, bias, baseline comparison, and stability — each applied at the right pipeline stage.

Stage A — During Training

3.1 — Mean Absolute Error (MAE)

The average absolute difference between predicted and actual values. Scale-dependent but easy to interpret as pure error magnitude in original units.

Formula — MAE

$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}\left|A_i - P_i\right|$$

✔ Why MAE During Training?

Directly optimisable · Robust to outliers · Same units as the target variable — makes training loss interpretable

3.2 — Root Mean Squared Error (RMSE)

Squares each error before averaging, then takes the root. Large errors are penalised more heavily than in MAE — critical when outlier forecasts carry operational risk.

Formula — RMSE

$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(A_i - P_i\right)^2}$$

✔ Why RMSE During Training?

Penalises large spikes · Sensitive to outlier errors · Common benchmark for comparing model checkpoints

3.3 — Weighted Absolute Percentage Error (WAPE)

WAPE calculates the aggregate absolute error as a proportion of total actual demand — robust to zero values and the standard business KPI in retail/supply-chain forecasting.

Formula — WAPE

$$\text{WAPE} = \frac{\sum_{i=1}^{n}\left|A_i - P_i\right|}{\sum_{i=1}^{n}\left|A_i\right|} \times 100$$

✔ When to Prefer WAPE

Retail forecasting · Time-series demand · SKU-level forecasting · Intermittent demand data

Stage B — During Validation

3.4 — Rolling WAPE

WAPE computed over a sliding window (e.g., 4-week rolling). Reveals whether forecast quality is stable over time or degrading in specific periods.

Formula — Rolling WAPE (window w)

$$\text{WAPE}_t = \frac{\sum_{k=t-w}^{t}\left|A_k - P_k\right|}{\sum_{k=t-w}^{t}\left|A_k\right|} \times 100$$

✔ Why Rolling WAPE During Validation?

Detects seasonality-specific degradation · Avoids masking errors with overall aggregate · Mirrors production monitoring

3.5 — RMSE (Validation Split)

Running RMSE on the held-out validation set confirms whether the training-set RMSE is generalisable or the model is overfitting to training noise.

⚠ Watch For

A large gap between train-RMSE and val-RMSE signals overfitting — trigger regularisation or reduce model complexity

3.6 — Bias / Mean Error

Mean Error (ME) captures the direction of error — whether the model is systematically over- or under-forecasting. A model with low RMSE but high bias is dangerous in inventory planning.

Formula — Mean Error (Bias)

$$\text{ME} = \frac{1}{n}\sum_{i=1}^{n}\left(P_i - A_i\right)$$

ME Value	Meaning	Action
ME > 0	Over-forecasting	Check feature scaling / target leakage
ME < 0	Under-forecasting	Review trend component or differencing
ME ≈ 0	Unbiased	Healthy — proceed to deployment gate

Stage C — Supplementary Metrics

3.7 — Mean Absolute Percentage Error (MAPE)

MAPE measures the per-observation percentage deviation of predictions from actual values. Best used alongside WAPE as a secondary business-facing metric.

Formula — MAPE

$$\text{MAPE} = \frac{1}{n}\sum_{i=1}^{n}\left|\frac{A_i - P_i}{A_i}\right| \times 100$$

⚠ Limitations

Undefined when Actual = 0 · Sensitive to near-zero values · Biased toward under-forecasting

3.8 — MASE (Baseline Comparison)

Mean Absolute Scaled Error compares model MAE against a naïve seasonal baseline. MASE < 1 means the model outperforms a simple persistence forecast.

Formula — MASE

$$\text{MASE} = \frac{\text{MAE}_{\text{model}}}{\text{MAE}_{\text{naïve}}}$$

✔ Why MASE?

Scale-independent · Penalises failure to beat naive baseline · Essential for justifying model complexity to stakeholders

Complete Evaluation Stack

Stage	Metric	Category	Why Needed
Training	MAE	Absolute Error	Pure error magnitude in original units
	RMSE	Squared Error	Penalises large errors / spikes
	WAPE	Percentage Error	Business KPI — robust to zero values
Validation	Rolling WAPE	Stability	Detects time-period degradation
	RMSE (val)	Generalisation	Overfitting check
	Mean Error	Bias	Detects over/under-prediction direction
Supplementary	MAPE	Percentage Error	Per-observation % — business-friendly
Supplementary	MASE	Baseline Comparison	Validates improvement over naïve forecast

— 03 —

Section 04

SHAP for Explainability

4.1 — Mathematical Foundation

SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory. Every prediction is decomposed into additive feature contributions:

SHAP Additive Decomposition

$$f(x) = \phi_0 + \sum_{i=1}^{M} \phi_i$$

φ₀ — Base value (expected model output over training data)
φᵢ — Shapley value: contribution of feature i

4.2 — Explainability Types

Type	Visualization	Use Case
Global	Summary plot, feature ranking	Model validation, governance
Local	Waterfall plot, force plot	Single prediction explanation

4.3 — Why SHAP?

Explains individual model predictions with mathematical guarantees
Detects feature leakage during validation
Builds stakeholder and regulatory trust
Enables drift detection via importance shifts

⚠ SHAP is NOT Causal

SHAP explains how the model uses features — not what happens if you change a real-world feature. Causal inference requires A/B testing · DAGs · Propensity score matching · Do-calculus

— 04 —

Section 05

Integration with MLflow

MLflow acts as the tracking and governance layer for all experiments, artifacts, and registered models.

5.1 — Logging Metrics

Python
# Log full metric stack inside an MLflow run
with mlflow.start_run():
    # Training metrics
    mlflow.log_metric("mae", mae_value)
    mlflow.log_metric("rmse", rmse_value)
    mlflow.log_metric("wape", wape_value)
    # Validation metrics
    mlflow.log_metric("rolling_wape", rolling_wape_value)
    mlflow.log_metric("val_rmse", val_rmse_value)
    mlflow.log_metric("bias", mean_error_value)
    # Supplementary
    mlflow.log_metric("mape", mape_value)
    mlflow.log_metric("mase", mase_value)

5.2 — Logging SHAP Artifacts

Python
mlflow.log_artifact("shap_summary.png")
mlflow.log_artifact("waterfall_plot.png")

5.3 — Approval Tagging

Python
# ⚠ Tag values are always STRINGS in MLflow — use "True" not True
mlflow.set_tag("evaluator_approved", "True")
mlflow.set_tag("primary_metric", "wape")

⚠ Critical: Tags are Always Strings

mlflow.set_tag() stores values as strings internally. Always pass "True" not True — otherwise the string comparison in Airflow's approval gate will fail silently ("True" != True).

5.4 — Model Registration Gate

📦 All Conditions Must Pass (Automated + Human)

✔ WAPE ≤ threshold · ✔ RMSE within range · ✔ Bias (ME) ≈ 0
✔ MASE < 1 · ✔ SHAP validated · ✔ evaluator_approved = True
✔ Human reviewer sign-off tag set in MLflow
Only then is the model registered in the MLflow Model Registry.

👤 Human-in-the-Loop (HITL) Review

Before automated registration, a Data Scientist or ML Engineer reviews the MLflow run — checks SHAP plots for unexpected feature behaviour, validates business alignment of WAPE, and sets the human_approved = True tag. This prevents regressions that pass metrics but fail business logic.

— 05 —

Section 06

Airflow Orchestration

Apache Airflow automates the entire ML lifecycle via DAGs, providing dependency control, scheduling, retries, and monitoring.

6.1 — DAG Stage Breakdown

#	DAG Task	Responsibility
1	Data Ingestion	Pull from source systems, validate schema
2	Data Validation	Nulls, distributions, outlier checks
3	Model Training	Run training with configured hyperparameters
4	Evaluation Task	Compute MAE · RMSE · WAPE · Bias · MASE, log to MLflow
5	SHAP Analysis	Generate SHAP plots, log artifacts to MLflow
6	👤 Human Review	Data Scientist reviews MLflow run — inspects SHAP plots, validates business alignment, sets `human_approved = True` tag
7	Automated Metric Gate	Airflow reads `human_approved` + metric thresholds — halts pipeline if any check fails
8	Model Registration	Push to MLflow Model Registry
9	Deployment	Serve model to production endpoint

✔ Airflow Guarantees

Cron scheduling · Dependency enforcement · Automatic retry logic · Retraining automation via sensors

— 06 —

Section 07

Monitoring Architecture

7.1 — Performance Monitoring

Periodically recompute the full metric stack on live production data. The primary production signals are WAPE trend over time, RMSE, and Bias — a threshold breach on any triggers the automated retraining DAG.

Signal	Threshold Breach	Action
WAPE (rolling)	↑ Rising trend	Trigger retraining DAG
RMSE	> trained baseline	Investigate data quality / anomalies
Bias (ME)	Persistently ≠ 0	Retrain with updated exogenous features
Data Drift	PSI / KL divergence ↑	Re-validate features, repipeline data
SHAP Shift	Importance rank change	Pre-drift warning — act proactively

7.2 — SHAP Drift Monitoring

Compute SHAP on training baseline and store importance scores
Compute SHAP on current production data
Compare distributions using KL divergence or PSI

7.3 — Drift Matrix

Scenario	MAPE	SHAP	Meaning
Model Aging	↑ Rising	Stable	Data drifted, retrain on recent data
Concept Drift	↑ Rising	Changed	Patterns shifted, investigate features
Early Warning	Stable	Changed	Pre-degradation — act proactively
Healthy	Stable	Stable	Operating within acceptable bounds

7.4 — Retraining Trigger

🔄 Airflow Fires Retraining DAG If ANY Condition Is True

• Rolling WAPE trend rising beyond threshold
• RMSE > trained baseline threshold
• Bias (Mean Error) persistently ≠ 0
• Data drift detected (PSI / KL divergence)
• SHAP feature importance distribution shifted

— 07 —

Section 08

Generalized Reusable Pipeline

8.1 — Project Structure

/mlops_framework/
  ├── data/
  ├── training/
  ├── evaluation/
  │   ├── metrics.py
  │   ├── shap_analysis.py
  │   └── approval_config.yaml
  ├── monitoring/
  ├── deployment/
  └── mlruns/

Compatible with: Regression · Time-series · Classification · Tree-based · Deep learning

8.2 — Framework Benefits

📐

Standardized Policy

Consistent evaluation across all projects

🔍

Explainable AI

SHAP for every model in registry

⚙️

Automated Lifecycle

Zero manual steps end-to-end

📉

Reduced Risk

Dual-metric threshold gating

♻️

Reproducibility

Every run fully tracked in MLflow

📡

Drift-Aware

SHAP + metric monitoring loop

— 08 —

Section 09

Airflow & MLflow in Action

This section is tools-first, not tech-first. It shows exactly what Airflow does and what MLflow does at every metric checkpoint — MAPE, WAPE, and SHAP — in the real pipeline workflow.

9.1 — How Airflow Uses MAPE, WAPE & SHAP

Airflow does not compute metrics itself — it schedules and sequences the tasks that produce them. Here is how each metric fits into the DAG as a concrete Airflow operator:

DAG Task (Airflow Operator)	Metric Produced	What Airflow Does With It
evaluate_model_task	MAPE, WAPE, RMSE, Bias	Runs evaluation script, pushes results via XCom to the next task
approval_gate_task	WAPE threshold check	Reads WAPE from XCom — if WAPE > threshold, marks task failed and halts DAG; pipeline stops before registration
shap_analysis_task	SHAP importance scores	Runs SHAP explainer, saves plots as artifacts, passes importance vector to monitoring sensor
monitoring_sensor_task	Rolling WAPE, MAPE trend, SHAP shift	Runs on schedule (e.g., daily at 02:00); compares live metrics against stored baseline — triggers retraining DAG if any threshold is breached
retrain_trigger_task	MAPE rising / SHAP drift signal	Uses Airflow `TriggerDagRunOperator` to fire the full training DAG automatically — no manual step needed

🛠 Airflow Role Summary

Airflow is the scheduler and gatekeeper. It decides when metrics are computed, whether the pipeline proceeds based on WAPE/MAPE thresholds, and if retraining fires based on SHAP drift or metric degradation. The metrics are the decision signals; Airflow acts on them.

9.2 — Airflow DAG: Metric-Gated Pipeline

Python — Airflow DAG
# Airflow DAG — every task is metric-aware
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.trigger_dagrun import TriggerDagRunOperator

def evaluate(**ctx):
    wape = compute_wape(y_true, y_pred)
    mape = compute_mape(y_true, y_pred)
    ctx['ti'].xcom_push('wape', wape)
    ctx['ti'].xcom_push('mape', mape)

def approval_gate(**ctx):
    wape = ctx['ti'].xcom_pull(task_ids='evaluate', key='wape')
    if wape > WAPE_THRESHOLD:
        raise ValueError(f"WAPE {wape:.1f}% exceeds threshold — pipeline halted")

def shap_analysis(**ctx):
    shap_vals = generate_shap(model, X_val)
    save_shap_plots(shap_vals) # saved as ML artifact
    ctx['ti'].xcom_push('shap_importance', shap_vals.mean(axis=0).tolist())

# DAG wiring
t_train  >> t_evaluate >> t_shap >> t_gate >> t_register >> t_deploy

9.3 — How MLflow Uses MAPE, WAPE & SHAP

MLflow is the memory of the pipeline — it stores every metric, every artifact, and every model version so the team can compare, audit, and reproduce any run. Here is what MLflow concretely does with each metric:

MLflow Feature Used	Metric / Artifact	What MLflow Stores / Enables
mlflow.log_metric()	MAPE, WAPE, RMSE, Bias	Logs numeric values against the run ID — visible in the MLflow UI as time-series charts for every experiment
mlflow.log_artifact()	SHAP summary plot, waterfall plot	Stores PNG/HTML SHAP visuals attached to the run — reviewable by any stakeholder without re-running code
mlflow.set_tag()	WAPE approval flag, primary_metric	Tags the run with human-readable metadata — used by the approval gate to filter only evaluator_approved = True runs for promotion
Model Registry — Staging	All thresholds passed (WAPE ✔ MAPE ✔ SHAP ✔ Human ✔)	Promotion happens via Airflow, not the training script. Flow: Script logs metrics → Airflow reads → Airflow decides → Airflow calls Registry API. Keeps governance centralised.
Model Registry — Production	Robust champion gate (WAPE delta + RMSE + Bias + SHAP)	Simple `WAPE_new < WAPE_old` is not enough — requires meaningful margin, no RMSE increase, bias within tolerance, no SHAP anomaly. See Section 9.6.

9.4 — MLflow Run: Metric + SHAP Logging

Python — MLflow Logging
with mlflow.start_run(run_name="sarimax_challenger_v3") as run:

    # ── Metrics that Airflow approval_gate_task reads via MLflow API ──
    mlflow.log_metric("wape", wape_value) # primary gate signal
    mlflow.log_metric("mape", mape_value) # secondary business metric
    mlflow.log_metric("rmse", rmse_value)
    mlflow.log_metric("rolling_wape", rolling_wape) # monitoring baseline
    mlflow.log_metric("bias", mean_error)

    # ── SHAP artifacts stored for audit and drift comparison ──
    shap_vals = shap.Explainer(model)(X_val)
    shap.summary_plot(shap_vals, show=False)
    plt.savefig("shap_summary.png")
    mlflow.log_artifact("shap_summary.png") # attached to this run

    # ── Tags: values MUST be strings in MLflow ──
    mlflow.set_tag("evaluator_approved", "True")  # "True" not True
    mlflow.set_tag("primary_metric", "wape")
    mlflow.set_tag("shap_validated", "True")

    # ── Promotion via AIRFLOW (not here) — training script only logs ──
    # Airflow reads this run_id from XCom and calls the Registry API.
    # Do NOT call register_model() inside the training script.

9.5 — End-to-End: Airflow Triggers, MLflow Stores

🔁 Combined Tool Workflow

Airflow triggers the evaluate task on schedule → training script logs WAPE, MAPE, SHAP to MLflow → human reviewer sets tag → Airflow reads MLflow run metrics via Client API → Airflow (not the script) calls MLflow Registry API to register → daily monitoring DAG reads stored SHAP baseline from MLflow and compares → if drift or metric degradation, Airflow fires retrain DAG.

9.6 — Robust Production Promotion Logic

A simple WAPE_new < WAPE_old check is insufficient. Unstable models can pass on one metric while regressing on others. The safe promotion gate requires a meaningful delta margin, no RMSE increase, bias within tolerance, and no SHAP anomaly.

Python — Airflow Promotion Task
def promote_to_production(**ctx):
    # Fetch challenger and champion metrics from MLflow
    challenger = get_run_metrics(challenger_run_id)
    champion   = get_run_metrics(current_production_run_id)

    # Robust multi-condition promotion guard
    wape_improved  = challenger["wape"] < champion["wape"] - WAPE_DELTA
    rmse_stable    = challenger["rmse"] <= champion["rmse"]
    bias_ok        = abs(challenger["bias"]) <= BIAS_LIMIT
    no_shap_drift  = challenger["shap_drift_score"] < SHAP_DRIFT_THRESHOLD

    if wape_improved and rmse_stable and bias_ok and no_shap_drift:
        client.transition_model_version_stage(
            name="forecasting_model",
            version=challenger_version,
            stage="Production"
        )
    else:
        raise ValueError("Challenger failed promotion gate — champion retained")

⚠ Why Multi-Condition Matters

A model with WAPE_new < WAPE_old but rising RMSE or persistent bias will cause silent production degradation. The delta margin (e.g., 0.5%) prevents noise-driven over-promotion. The SHAP drift check catches feature behaviour anomalies before they affect customers.

Tool	Its Role with MAPE / WAPE	Its Role with SHAP
Airflow	Schedules metric computation, gates pipeline on WAPE threshold, triggers retraining if MAPE/WAPE rising	Sequences SHAP analysis task, passes importance vector as XCom for drift comparison
MLflow	Stores MAPE & WAPE per run, enables cross-experiment comparison in UI, tags approval status	Stores SHAP plots as artifacts per run; baseline SHAP importance vector retrieved for drift detection

— 09 —

Section 10

Final Conclusion

By integrating all pillars of this framework:

MAE / RMSE → Pure error magnitude and outlier sensitivity during training
WAPE / Rolling WAPE → Business-level aggregate error and stability monitoring
Bias (ME) / MASE → Direction of error and baseline comparison for validation
SHAP → Explainability, drift detection, and pre-degradation early warning
MLflow → Experiment tracking, full metric registry, and governance gate
Airflow → Full lifecycle automation and multi-signal retraining orchestration

✅ Outcome

We transform a machine learning model into a continuous, explainable, monitored, and production-grade intelligent system — aligned to both technical rigor and real-world business requirements.

— 10 —

MAPE · WAPE · SHAPMLflow & Airflow

Framework Components

Stage A — During Training

3.1 — Mean Absolute Error (MAE)

3.2 — Root Mean Squared Error (RMSE)

3.3 — Weighted Absolute Percentage Error (WAPE)

Stage B — During Validation

3.4 — Rolling WAPE

3.5 — RMSE (Validation Split)

3.6 — Bias / Mean Error

Stage C — Supplementary Metrics

3.7 — Mean Absolute Percentage Error (MAPE)

3.8 — MASE (Baseline Comparison)

Complete Evaluation Stack

4.1 — Mathematical Foundation

4.2 — Explainability Types

4.3 — Why SHAP?

5.1 — Logging Metrics

5.2 — Logging SHAP Artifacts

5.3 — Approval Tagging

5.4 — Model Registration Gate

6.1 — DAG Stage Breakdown

7.1 — Performance Monitoring

7.2 — SHAP Drift Monitoring

7.3 — Drift Matrix

7.4 — Retraining Trigger

8.1 — Project Structure

8.2 — Framework Benefits

Standardized Policy

Explainable AI

Automated Lifecycle

Reduced Risk

Reproducibility

Drift-Aware

9.1 — How Airflow Uses MAPE, WAPE & SHAP

9.2 — Airflow DAG: Metric-Gated Pipeline

9.3 — How MLflow Uses MAPE, WAPE & SHAP

9.4 — MLflow Run: Metric + SHAP Logging

9.5 — End-to-End: Airflow Triggers, MLflow Stores

9.6 — Robust Production Promotion Logic

MAPE · WAPE · SHAP
MLflow & Airflow