MLOps Strategy Overview

Welcome to the MLOps Strategy presentation for the Smart Pump Monitoring & Optimization System.

🟦 Apache Airflow
🔵 MLflow
🐋 Docker
🐙 GitHub Actions

Why These Four Tools?

Tool Why I Chose It What It Solves
Apache Airflow Industry-standard for ML workflow orchestration. Native Python. DAG-based dependencies. Eliminates manual retraining. Handles retries, failure alerting, scheduling.
MLflow Lightweight, self-hosted, no cloud dependency. Integrates with any Python ML library. Tracks every experiment, versions every model, enforces promotion gates.
Docker Ensures every environment — dev, staging, production — is identical. Prevents "works on my machine" failures. Easy to redeploy anywhere.
GitHub Actions Already in the code hosting platform. Free for standard workflows. Automates testing before any code lands in production.

Deliberately Not Using

DVC MLflow already handles data + model versioning for this scale.
Kubeflow Kubernetes is overkill; Docker Compose is sufficient.
SageMaker / Vertex AI On-premises Linux server; no cloud budget.
Weights & Biases MLflow self-hosted covers all our tracking needs.

🎯 Mission Statement 🎯


Every model trained by this team should be automatically tracked, fairly evaluated against what's already in production, and deployed only if it genuinely improves. If it doesn't improve, we stay safe. If the system breaks, we recover in under 5 minutes.

1. My Role & Boundaries

As the MLOps Engineer, my responsibility is to build and maintain the infrastructure that makes the ML system reliable, reproducible, and continuously improving over time.

I do not own the data ingestion, sensor logic, PLC integration, dashboards, or alert rules.

My job is to make sure any model trained by this team can be tracked, versioned, deployed, and automatically retrained — without manual effort.
Responsibility My Deliverable
Orchestration Airflow DAGs for all ML workflows
Experiment Tracking MLflow setup, logging standards, model registry
Containerization Docker Compose stack for all services
CI/CD GitHub Actions pipelines for testing and deployment
Model Governance Promotion rules, versioning policy, rollback procedures
ML System Health Monitoring drift, retraining triggers, performance degradation

2. MLOps Architecture Overview

A visual representation of the boundaries and data logic within the MLOps scope.

graph TD classDef scope fill:#1e293b,stroke:#3b82f6,stroke-width:2px,color:#f8fafc; classDef tool fill:#f1f5f9,stroke:#64748b,stroke-width:1px,color:#1e293b; classDef decision fill:#fee2e2,stroke:#ef4444,stroke-width:1px,color:#991b1b; MLOps["MLOps Layer (My Scope)"]:::scope Orchestration[" Orchestration (Apache Airflow) 
- DAG design
- Schedules
- Retries
- Alerts"]:::tool Registry[" Experiment & Model Registry (MLflow) 
- Run logging
- Versioning
- Registry
- Promotion"]:::tool Container[" Containerization & CI/CD 
- Docker Services
- Lint & Test
- GitHub Actions deploy"]:::tool MLOps --> Orchestration MLOps --> Registry MLOps --> Container Opt["Optimization Execution
Decided by another team member.
I will implement the infrastructure for whichever approach is chosen."]:::decision Orchestration --> Opt Registry --> Opt Container --> Opt

3. Airflow — Orchestration Strategy

My DAG Design Plan

I will build and maintain 4 core DAGs:

DAG Schedule What It Does Why This Schedule
weekly_retrain_dag Every Sunday at 02:00 Full model retrain + evaluation + conditional promotion Sunday = least operationally busy day. 02:00 AM = dead zone. Model ready before Monday.
data_quality_dag Daily at 00:30 Validates data freshness and completeness before training Daily = catches bad data early. Runs before 02:00 retrain.
alert_sweep_dag Every 5 minutes Scans readings_wide flag columns, writes to alerts_log 5 min is the minimum sustained-condition window that covers all alert rules without spamming logs.
model_validation_dag Triggered by retrain DAG Compares challenger vs. production model; decides promotion Event-triggered = must only run after a fresh retrain.

weekly_retrain_dag Flow

graph TD classDef task fill:#e0f2fe,stroke:#0284c7,stroke-width:1px,color:#0c4a6e; classDef eval fill:#ffedd5,stroke:#c2410c,stroke-width:1px,color:#7c2d12; Start([" Trigger: Sunday 02:00 "]) --> ext[" task_1: data_extraction "]:::task ext --> val[" task_2: data_validation "]:::task val --> feat[" task_3: feature_engineering "]:::task feat --> train[" task_4: train_model 
Algorithm decided by ML Engineer
Log params & metrics to MLflow"]:::task train --> comp[" task_5: compare_vs_production "]:::eval comp -->|BETTER RMSE| prom[" task_6a: promote_model 
to Production"]:::task comp -->|NOT BETTER| arch[" task_6b: archive_run 
to Staging only"]:::task

Airflow Engineering Decisions

Decision Choice Reasoning
Executor LocalExecutor Sufficient for our workload; no need for Celery/Kubernetes.
Retry policy 2 retries, 5 min delay Handles transient DB/network issues without spamming alerts.
Failure alerting Email on DAG failure Immediate notification to team on broken retraining.
Backfill Disabled for retrain DAG Retraining with stale triggers is not meaningful.
Data skip guard AirflowSkipException Prevents training on bad data silently.

4. MLflow — Experiment & Model Strategy

Experiment Structure

I will organize MLflow into 2 experiments — one per model purpose.

⚠️ Note: The specific algorithm for each model is the ML Engineer's decision, not mine. My job is to build tracking infrastructure that works regardless of the chosen algorithm.
MLflow Experiments
│
├── 📁 tank_forecasting_model
│     Every Sunday retrain = 1 run
│     Params: [decided by ML Engineer — logged as-is]
│     Metrics: rmse_val, mae_val, rmse_train
│     Artifacts: model.pkl, forecast_plot.png
│
└── 📁 pump_efficiency_model
      Every Sunday retrain = 1 run
      Params: [decided by ML Engineer — logged as-is]
      Metrics: rmse_val, mae_val, r2_score
      Artifacts: model.pkl, feature_importance.png

Recommended Professional Setup

Component Cadence Reasoning
🔮 Forecast Model Weekly retrain Demand patterns at a pump station are stable and seasonal. Weekly is sufficient.
Efficiency Model Weekly retrain Pump degradation is gradual. Weekly retraining captures the trend.
📡 Drift Monitoring Daily check Abnormal sensor behaviour can appear any day. Daily catches this early.
📊 Predictions / Output Hourly / per-minute Operational usage — depends on optimization approach decided by team.

Logging Standards Enforced via mlops_utils.py

mlflow.log_param("model_type", ...)  # e.g. "forecasting" or "efficiency"
mlflow.log_param("training_days", 30)
mlflow.log_param("training_date", ...)

mlflow.log_metric("rmse_val", ...)
mlflow.log_metric("mae_val", ...)

mlflow.log_artifact("model.pkl")
mlflow.set_tag("promoted", "false")
mlflow.set_tag("data_start", ...)

Promotion Gate Rules

Rule Threshold Action if Failed
Validation RMSE must improve New RMSE < Prod RMSE − 2% Reject promotion, stay on current prod
No NaN in predictions (holdout) 0 NaN allowed Hard reject, flag for investigation
Training data volume min 36,000 rows Skip training, alert team
Run time limit < 30 min wall clock Timeout, fail DAG

5. Docker — Deployment Strategy

Containerized Services

The docker-compose.yaml manages these 4 core services:

services:
  postgres:          # Shared DB — not my data design, but I host it
    image: postgres:15
    port: 5432

  airflow-webserver: # DAG UI and API
    build: ./airflow
    port: 8080

  airflow-scheduler: # DAG execution engine
    build: ./airflow

  mlflow:            # Experiment tracking UI + model registry
    build: ./mlflow
    port: 5000
    volumes:
      - ./mlruns:/mlruns    # Persisted locally
Streamlit interface and optimization layers are owned by other members. I provide the postgres and mlflow services as dependencies they consume.

Volume & Persistence Strategy

Data Persistence Method
PostgreSQL data Named Docker volume — survives container restarts
MLflow run data Bind mount ./mlruns to host filesystem
Airflow logs Bind mount ./logs to host filesystem
Airflow DAGs Bind mount ./dags — live sync without restart

6. GitHub Actions — CI/CD Strategy

⚠️ Not Yet Decided

The CI/CD tooling, pipeline structure, and deployment approach will be defined once the team finalizes the development workflow.

My responsibility will be to set up and maintain whatever CI/CD pipeline is agreed upon — covering code quality checks, testing, and automated deployment to the Linux server.

7. Model Governance Strategy

The Core Question Answered Every Week

"Is the model we're using in production still the best one available?"

Governance Policy

Policy Rule
One production model at a time Only one version of each model type lives in Production stage.
Never delete runs All MLflow runs are archived, never deleted — full audit trail.
Promotion requires validation No model goes to Production without passing the validation DAG.
A/B testing not required System is advisory; champion/challenger swaps are instantaneous.
Rollback is one command mlflow models transition-stage

Performance Tracking Table (model_performance)

Every week, I populate this Postgres table which the Streamlit team consumes for visualizations:

run_id model_type rmse_val mae_val promoted (0/1) training_date

8. Monitoring the ML System Itself

The rest of the team monitors the pump. I monitor the ML system.

What I Watch & Alert On

Signal How I Detect It Action / Alert
Model performance drift Weekly RMSE trend in DB Alert if RMSE rises >15% week-over-week
Data volume drop Row count check in DAG Skip training + Email alert
Retraining failure Airflow DAG failure notification Investigate + manual trigger if needed
Serving/output outage Missing rows in predictions table Alert responsible team member
Stale production model Age check (> 14 days) Auto-trigger unscheduled retrain

9. Rollback & Recovery Plan

Scenario: Bad Model Promoted

Total recovery time: < 5 minutes.

  1. Detection: RMSE spike in DB OR dashboard team reports bad predictions.
  2. Step 1: Identify the previous Production run_id in MLflow.
  3. Step 2: Run mlflow models transition-stage to push old version to Production.
  4. Step 3: Serving layer picks up new Prod model on next scheduled run.
  5. Step 4: Demote the bad model to Archived stage.

Scenario: Airflow Scheduler Down

docker-compose restart airflow-scheduler

# If data corruption:
docker-compose down
docker-compose up -d
# DAG state is stored in Postgres — recovers automatically

Scenario: MLflow Server Down

docker-compose restart mlflow

# MLflow data is bind-mounted → no data loss on restart.
# Impact on the serving layer depends on the execution approach 
# decided by the responsible team member.

10. What I Own vs What I Consume

Clear boundaries prevent duplicate work and system conflicts:

System Component Status Notes
Airflow DAGs ✅ 100% Mine Full creation and monitoring
MLflow Server & Registry ✅ 100% Mine Infrastructure + promotion rules
Docker Compose Stack ✅ 100% Mine Base platform setup
Infrastructure for Optimization ✅ Mine (Once Confirmed) I set it up once the approach is decided
model_performance table ✅ I Populate It Written from MLflow pipeline
Optimization Execution Approach ❌ Not Mine Decided by responsible team member
Training Algorithms ❌ Not Mine Written by ML Engineer
readings_wide table ❌ Not Mine I consume from Data Eng
Streamlit Dashboard ❌ Not Mine Dashboard Team