MLOps Strategy | Smart Pump System

MLOps Strategy Overview

Welcome to the MLOps Strategy presentation for the Smart Pump Monitoring & Optimization System.

🟦 Apache Airflow

🔵 MLflow

🐋 Docker

🐙 GitHub Actions

Why These Four Tools?

Tool	Why I Chose It	What It Solves
Apache Airflow	Industry-standard for ML workflow orchestration. Native Python. DAG-based dependencies.	Eliminates manual retraining. Handles retries, failure alerting, scheduling.
MLflow	Lightweight, self-hosted, no cloud dependency. Integrates with any Python ML library.	Tracks every experiment, versions every model, enforces promotion gates.
Docker	Ensures every environment — dev, staging, production — is identical.	Prevents "works on my machine" failures. Easy to redeploy anywhere.
GitHub Actions	Already in the code hosting platform. Free for standard workflows.	Automates testing before any code lands in production.

Deliberately Not Using

DVC	MLflow already handles data + model versioning for this scale.
Kubeflow	Kubernetes is overkill; Docker Compose is sufficient.
SageMaker / Vertex AI	On-premises Linux server; no cloud budget.
Weights & Biases	MLflow self-hosted covers all our tracking needs.

🎯 Mission Statement 🎯

Every model trained by this team should be automatically tracked, fairly evaluated against what's already in production, and deployed only if it genuinely improves. If it doesn't improve, we stay safe. If the system breaks, we recover in under 5 minutes.

1. My Role & Boundaries

As the MLOps Engineer, my responsibility is to build and maintain the infrastructure that makes the ML system reliable, reproducible, and continuously improving over time.

I do not own the data ingestion, sensor logic, PLC integration, dashboards, or alert rules.

My job is to make sure any model trained by this team can be tracked, versioned, deployed, and automatically retrained — without manual effort.

Responsibility	My Deliverable
Orchestration	Airflow DAGs for all ML workflows
Experiment Tracking	MLflow setup, logging standards, model registry
Containerization	Docker Compose stack for all services
CI/CD	GitHub Actions pipelines for testing and deployment
Model Governance	Promotion rules, versioning policy, rollback procedures
ML System Health	Monitoring drift, retraining triggers, performance degradation

2. MLOps Architecture Overview

A visual representation of the boundaries and data logic within the MLOps scope.

graph TD classDef scope fill:#1e293b,stroke:#3b82f6,stroke-width:2px,color:#f8fafc; classDef tool fill:#f1f5f9,stroke:#64748b,stroke-width:1px,color:#1e293b; classDef decision fill:#fee2e2,stroke:#ef4444,stroke-width:1px,color:#991b1b; MLOps["MLOps Layer (My Scope)"]:::scope Orchestration[" Orchestration (Apache Airflow)
- DAG design
- Schedules
- Retries
- Alerts"]:::tool Registry[" Experiment & Model Registry (MLflow)
- Run logging
- Versioning
- Registry
- Promotion"]:::tool Container[" Containerization & CI/CD
- Docker Services
- Lint & Test
- GitHub Actions deploy"]:::tool MLOps --> Orchestration MLOps --> Registry MLOps --> Container Opt["Optimization Execution
Decided by another team member.
I will implement the infrastructure for whichever approach is chosen."]:::decision Orchestration --> Opt Registry --> Opt Container --> Opt

3. Airflow — Orchestration Strategy

My DAG Design Plan

I will build and maintain 4 core DAGs:

DAG	Schedule	What It Does	Why This Schedule
weekly_retrain_dag	Every Sunday at 02:00	Full model retrain + evaluation + conditional promotion	Sunday = least operationally busy day. 02:00 AM = dead zone. Model ready before Monday.
data_quality_dag	Daily at 00:30	Validates data freshness and completeness before training	Daily = catches bad data early. Runs before 02:00 retrain.
alert_sweep_dag	Every 5 minutes	Scans readings_wide flag columns, writes to alerts_log	5 min is the minimum sustained-condition window that covers all alert rules without spamming logs.
model_validation_dag	Triggered by retrain DAG	Compares challenger vs. production model; decides promotion	Event-triggered = must only run after a fresh retrain.

weekly_retrain_dag Flow

graph TD classDef task fill:#e0f2fe,stroke:#0284c7,stroke-width:1px,color:#0c4a6e; classDef eval fill:#ffedd5,stroke:#c2410c,stroke-width:1px,color:#7c2d12; Start([" Trigger: Sunday 02:00 "]) --> ext[" task_1: data_extraction "]:::task ext --> val[" task_2: data_validation "]:::task val --> feat[" task_3: feature_engineering "]:::task feat --> train[" task_4: train_model
Algorithm decided by ML Engineer
Log params & metrics to MLflow"]:::task train --> comp[" task_5: compare_vs_production "]:::eval comp -->|BETTER RMSE| prom[" task_6a: promote_model
to Production"]:::task comp -->|NOT BETTER| arch[" task_6b: archive_run
to Staging only"]:::task

Airflow Engineering Decisions

Decision	Choice	Reasoning
Executor	LocalExecutor	Sufficient for our workload; no need for Celery/Kubernetes.
Retry policy	2 retries, 5 min delay	Handles transient DB/network issues without spamming alerts.
Failure alerting	Email on DAG failure	Immediate notification to team on broken retraining.
Backfill	Disabled for retrain DAG	Retraining with stale triggers is not meaningful.
Data skip guard	AirflowSkipException	Prevents training on bad data silently.

4. MLflow — Experiment & Model Strategy

Experiment Structure

I will organize MLflow into 2 experiments — one per model purpose.

⚠️ Note: The specific algorithm for each model is the ML Engineer's decision, not mine. My job is to build tracking infrastructure that works regardless of the chosen algorithm.

MLflow Experiments
│
├── 📁 tank_forecasting_model
│     Every Sunday retrain = 1 run
│     Params: [decided by ML Engineer — logged as-is]
│     Metrics: rmse_val, mae_val, rmse_train
│     Artifacts: model.pkl, forecast_plot.png
│
└── 📁 pump_efficiency_model
      Every Sunday retrain = 1 run
      Params: [decided by ML Engineer — logged as-is]
      Metrics: rmse_val, mae_val, r2_score
      Artifacts: model.pkl, feature_importance.png

Recommended Professional Setup

Component	Cadence	Reasoning
🔮 Forecast Model	Weekly retrain	Demand patterns at a pump station are stable and seasonal. Weekly is sufficient.
⚡ Efficiency Model	Weekly retrain	Pump degradation is gradual. Weekly retraining captures the trend.
📡 Drift Monitoring	Daily check	Abnormal sensor behaviour can appear any day. Daily catches this early.
📊 Predictions / Output	Hourly / per-minute	Operational usage — depends on optimization approach decided by team.

Logging Standards Enforced via mlops_utils.py

mlflow.log_param("model_type", ...)  # e.g. "forecasting" or "efficiency"
mlflow.log_param("training_days", 30)
mlflow.log_param("training_date", ...)

mlflow.log_metric("rmse_val", ...)
mlflow.log_metric("mae_val", ...)

mlflow.log_artifact("model.pkl")
mlflow.set_tag("promoted", "false")
mlflow.set_tag("data_start", ...)

Promotion Gate Rules

Rule	Threshold	Action if Failed
Validation RMSE must improve	New RMSE < Prod RMSE − 2%	Reject promotion, stay on current prod
No NaN in predictions (holdout)	0 NaN allowed	Hard reject, flag for investigation
Training data volume	min 36,000 rows	Skip training, alert team
Run time limit	< 30 min wall clock	Timeout, fail DAG

5. Docker — Deployment Strategy

Containerized Services

The docker-compose.yaml manages these 4 core services:

services:
  postgres:          # Shared DB — not my data design, but I host it
    image: postgres:15
    port: 5432

  airflow-webserver: # DAG UI and API
    build: ./airflow
    port: 8080

  airflow-scheduler: # DAG execution engine
    build: ./airflow

  mlflow:            # Experiment tracking UI + model registry
    build: ./mlflow
    port: 5000
    volumes:
      - ./mlruns:/mlruns    # Persisted locally

Streamlit interface and optimization layers are owned by other members. I provide the postgres and mlflow services as dependencies they consume.

Volume & Persistence Strategy

Data	Persistence Method
PostgreSQL data	Named Docker volume — survives container restarts
MLflow run data	Bind mount ./mlruns to host filesystem
Airflow logs	Bind mount ./logs to host filesystem
Airflow DAGs	Bind mount ./dags — live sync without restart

6. GitHub Actions — CI/CD Strategy

⚠️ Not Yet Decided

The CI/CD tooling, pipeline structure, and deployment approach will be defined once the team finalizes the development workflow.

My responsibility will be to set up and maintain whatever CI/CD pipeline is agreed upon — covering code quality checks, testing, and automated deployment to the Linux server.

7. Model Governance Strategy

The Core Question Answered Every Week

"Is the model we're using in production still the best one available?"

Governance Policy

Policy	Rule
One production model at a time	Only one version of each model type lives in Production stage.
Never delete runs	All MLflow runs are archived, never deleted — full audit trail.
Promotion requires validation	No model goes to Production without passing the validation DAG.
A/B testing not required	System is advisory; champion/challenger swaps are instantaneous.
Rollback is one command	mlflow models transition-stage

Performance Tracking Table (model_performance)

Every week, I populate this Postgres table which the Streamlit team consumes for visualizations:

run_id model_type rmse_val mae_val promoted (0/1) training_date

8. Monitoring the ML System Itself

The rest of the team monitors the pump. I monitor the ML system.

What I Watch & Alert On

Signal	How I Detect It	Action / Alert
Model performance drift	Weekly RMSE trend in DB	Alert if RMSE rises >15% week-over-week
Data volume drop	Row count check in DAG	Skip training + Email alert
Retraining failure	Airflow DAG failure notification	Investigate + manual trigger if needed
Serving/output outage	Missing rows in predictions table	Alert responsible team member
Stale production model	Age check (> 14 days)	Auto-trigger unscheduled retrain

9. Rollback & Recovery Plan

Scenario: Bad Model Promoted

Total recovery time: < 5 minutes.

Detection: RMSE spike in DB OR dashboard team reports bad predictions.
Step 1: Identify the previous Production run_id in MLflow.
Step 2: Run mlflow models transition-stage to push old version to Production.
Step 3: Serving layer picks up new Prod model on next scheduled run.
Step 4: Demote the bad model to Archived stage.

Scenario: Airflow Scheduler Down

docker-compose restart airflow-scheduler

# If data corruption:
docker-compose down
docker-compose up -d
# DAG state is stored in Postgres — recovers automatically

Scenario: MLflow Server Down

docker-compose restart mlflow

# MLflow data is bind-mounted → no data loss on restart.
# Impact on the serving layer depends on the execution approach 
# decided by the responsible team member.

10. What I Own vs What I Consume

Clear boundaries prevent duplicate work and system conflicts:

System Component	Status	Notes
Airflow DAGs	✅ 100% Mine	Full creation and monitoring
MLflow Server & Registry	✅ 100% Mine	Infrastructure + promotion rules
Docker Compose Stack	✅ 100% Mine	Base platform setup
Infrastructure for Optimization	✅ Mine (Once Confirmed)	I set it up once the approach is decided
model_performance table	✅ I Populate It	Written from MLflow pipeline
Optimization Execution Approach	❌ Not Mine	Decided by responsible team member
Training Algorithms	❌ Not Mine	Written by ML Engineer
readings_wide table	❌ Not Mine	I consume from Data Eng
Streamlit Dashboard	❌ Not Mine	Dashboard Team