MLOps Strategy Overview
Welcome to the MLOps Strategy presentation for the Smart Pump Monitoring & Optimization System.
Why These Four Tools?
| Tool | Why I Chose It | What It Solves |
|---|---|---|
| Apache Airflow | Industry-standard for ML workflow orchestration. Native Python. DAG-based dependencies. | Eliminates manual retraining. Handles retries, failure alerting, scheduling. |
| MLflow | Lightweight, self-hosted, no cloud dependency. Integrates with any Python ML library. | Tracks every experiment, versions every model, enforces promotion gates. |
| Docker | Ensures every environment — dev, staging, production — is identical. | Prevents "works on my machine" failures. Easy to redeploy anywhere. |
| GitHub Actions | Already in the code hosting platform. Free for standard workflows. | Automates testing before any code lands in production. |
Deliberately Not Using
| DVC | MLflow already handles data + model versioning for this scale. |
| Kubeflow | Kubernetes is overkill; Docker Compose is sufficient. |
| SageMaker / Vertex AI | On-premises Linux server; no cloud budget. |
| Weights & Biases | MLflow self-hosted covers all our tracking needs. |
🎯 Mission Statement 🎯
Every model trained by this team should be automatically tracked, fairly evaluated against what's already in production, and deployed only if it genuinely improves. If it doesn't improve, we stay safe. If the system breaks, we recover in under 5 minutes.
1. My Role & Boundaries
As the MLOps Engineer, my responsibility is to build and maintain the infrastructure that makes the ML system reliable, reproducible, and continuously improving over time.
My job is to make sure any model trained by this team can be tracked, versioned, deployed, and automatically retrained — without manual effort.
| Responsibility | My Deliverable |
|---|---|
| Orchestration | Airflow DAGs for all ML workflows |
| Experiment Tracking | MLflow setup, logging standards, model registry |
| Containerization | Docker Compose stack for all services |
| CI/CD | GitHub Actions pipelines for testing and deployment |
| Model Governance | Promotion rules, versioning policy, rollback procedures |
| ML System Health | Monitoring drift, retraining triggers, performance degradation |
2. MLOps Architecture Overview
A visual representation of the boundaries and data logic within the MLOps scope.
- DAG design
- Schedules
- Retries
- Alerts"]:::tool Registry[" Experiment & Model Registry (MLflow)
- Run logging
- Versioning
- Registry
- Promotion"]:::tool Container[" Containerization & CI/CD
- Docker Services
- Lint & Test
- GitHub Actions deploy"]:::tool MLOps --> Orchestration MLOps --> Registry MLOps --> Container Opt["Optimization Execution
Decided by another team member.
I will implement the infrastructure for whichever approach is chosen."]:::decision Orchestration --> Opt Registry --> Opt Container --> Opt
3. Airflow — Orchestration Strategy
My DAG Design Plan
I will build and maintain 4 core DAGs:
| DAG | Schedule | What It Does | Why This Schedule |
|---|---|---|---|
| weekly_retrain_dag | Every Sunday at 02:00 | Full model retrain + evaluation + conditional promotion | Sunday = least operationally busy day. 02:00 AM = dead zone. Model ready before Monday. |
| data_quality_dag | Daily at 00:30 | Validates data freshness and completeness before training | Daily = catches bad data early. Runs before 02:00 retrain. |
| alert_sweep_dag | Every 5 minutes | Scans readings_wide flag columns, writes to alerts_log | 5 min is the minimum sustained-condition window that covers all alert rules without spamming logs. |
| model_validation_dag | Triggered by retrain DAG | Compares challenger vs. production model; decides promotion | Event-triggered = must only run after a fresh retrain. |
weekly_retrain_dag Flow
Algorithm decided by ML Engineer
Log params & metrics to MLflow"]:::task train --> comp[" task_5: compare_vs_production "]:::eval comp -->|BETTER RMSE| prom[" task_6a: promote_model
to Production"]:::task comp -->|NOT BETTER| arch[" task_6b: archive_run
to Staging only"]:::task
Airflow Engineering Decisions
| Decision | Choice | Reasoning |
|---|---|---|
| Executor | LocalExecutor | Sufficient for our workload; no need for Celery/Kubernetes. |
| Retry policy | 2 retries, 5 min delay | Handles transient DB/network issues without spamming alerts. |
| Failure alerting | Email on DAG failure | Immediate notification to team on broken retraining. |
| Backfill | Disabled for retrain DAG | Retraining with stale triggers is not meaningful. |
| Data skip guard | AirflowSkipException | Prevents training on bad data silently. |
4. MLflow — Experiment & Model Strategy
Experiment Structure
I will organize MLflow into 2 experiments — one per model purpose.
MLflow Experiments
│
├── 📁 tank_forecasting_model
│ Every Sunday retrain = 1 run
│ Params: [decided by ML Engineer — logged as-is]
│ Metrics: rmse_val, mae_val, rmse_train
│ Artifacts: model.pkl, forecast_plot.png
│
└── 📁 pump_efficiency_model
Every Sunday retrain = 1 run
Params: [decided by ML Engineer — logged as-is]
Metrics: rmse_val, mae_val, r2_score
Artifacts: model.pkl, feature_importance.png
Recommended Professional Setup
| Component | Cadence | Reasoning |
|---|---|---|
| 🔮 Forecast Model | Weekly retrain | Demand patterns at a pump station are stable and seasonal. Weekly is sufficient. |
| ⚡ Efficiency Model | Weekly retrain | Pump degradation is gradual. Weekly retraining captures the trend. |
| 📡 Drift Monitoring | Daily check | Abnormal sensor behaviour can appear any day. Daily catches this early. |
| 📊 Predictions / Output | Hourly / per-minute | Operational usage — depends on optimization approach decided by team. |
Logging Standards Enforced via mlops_utils.py
mlflow.log_param("model_type", ...) # e.g. "forecasting" or "efficiency"
mlflow.log_param("training_days", 30)
mlflow.log_param("training_date", ...)
mlflow.log_metric("rmse_val", ...)
mlflow.log_metric("mae_val", ...)
mlflow.log_artifact("model.pkl")
mlflow.set_tag("promoted", "false")
mlflow.set_tag("data_start", ...)
Promotion Gate Rules
| Rule | Threshold | Action if Failed |
|---|---|---|
| Validation RMSE must improve | New RMSE < Prod RMSE − 2% | Reject promotion, stay on current prod |
| No NaN in predictions (holdout) | 0 NaN allowed | Hard reject, flag for investigation |
| Training data volume | min 36,000 rows | Skip training, alert team |
| Run time limit | < 30 min wall clock | Timeout, fail DAG |
5. Docker — Deployment Strategy
Containerized Services
The docker-compose.yaml manages these 4 core services:
services:
postgres: # Shared DB — not my data design, but I host it
image: postgres:15
port: 5432
airflow-webserver: # DAG UI and API
build: ./airflow
port: 8080
airflow-scheduler: # DAG execution engine
build: ./airflow
mlflow: # Experiment tracking UI + model registry
build: ./mlflow
port: 5000
volumes:
- ./mlruns:/mlruns # Persisted locally
Volume & Persistence Strategy
| Data | Persistence Method |
|---|---|
| PostgreSQL data | Named Docker volume — survives container restarts |
| MLflow run data | Bind mount ./mlruns to host filesystem |
| Airflow logs | Bind mount ./logs to host filesystem |
| Airflow DAGs | Bind mount ./dags — live sync without restart |
6. GitHub Actions — CI/CD Strategy
⚠️ Not Yet Decided
The CI/CD tooling, pipeline structure, and deployment approach will be defined once the team finalizes the development workflow.
My responsibility will be to set up and maintain whatever CI/CD pipeline is agreed upon — covering code quality checks, testing, and automated deployment to the Linux server.
7. Model Governance Strategy
The Core Question Answered Every Week
"Is the model we're using in production still the best one available?"
Governance Policy
| Policy | Rule |
|---|---|
| One production model at a time | Only one version of each model type lives in Production stage. |
| Never delete runs | All MLflow runs are archived, never deleted — full audit trail. |
| Promotion requires validation | No model goes to Production without passing the validation DAG. |
| A/B testing not required | System is advisory; champion/challenger swaps are instantaneous. |
| Rollback is one command | mlflow models transition-stage |
Performance Tracking Table (model_performance)
Every week, I populate this Postgres table which the Streamlit team consumes for visualizations:
8. Monitoring the ML System Itself
The rest of the team monitors the pump. I monitor the ML system.
What I Watch & Alert On
| Signal | How I Detect It | Action / Alert |
|---|---|---|
| Model performance drift | Weekly RMSE trend in DB | Alert if RMSE rises >15% week-over-week |
| Data volume drop | Row count check in DAG | Skip training + Email alert |
| Retraining failure | Airflow DAG failure notification | Investigate + manual trigger if needed |
| Serving/output outage | Missing rows in predictions table | Alert responsible team member |
| Stale production model | Age check (> 14 days) | Auto-trigger unscheduled retrain |
9. Rollback & Recovery Plan
Scenario: Bad Model Promoted
Total recovery time: < 5 minutes.
- Detection: RMSE spike in DB OR dashboard team reports bad predictions.
- Step 1: Identify the previous Production run_id in MLflow.
- Step 2: Run mlflow models transition-stage to push old version to Production.
- Step 3: Serving layer picks up new Prod model on next scheduled run.
- Step 4: Demote the bad model to Archived stage.
Scenario: Airflow Scheduler Down
docker-compose restart airflow-scheduler
# If data corruption:
docker-compose down
docker-compose up -d
# DAG state is stored in Postgres — recovers automatically
Scenario: MLflow Server Down
docker-compose restart mlflow
# MLflow data is bind-mounted → no data loss on restart.
# Impact on the serving layer depends on the execution approach
# decided by the responsible team member.
10. What I Own vs What I Consume
Clear boundaries prevent duplicate work and system conflicts:
| System Component | Status | Notes |
|---|---|---|
| Airflow DAGs | ✅ 100% Mine | Full creation and monitoring |
| MLflow Server & Registry | ✅ 100% Mine | Infrastructure + promotion rules |
| Docker Compose Stack | ✅ 100% Mine | Base platform setup |
| Infrastructure for Optimization | ✅ Mine (Once Confirmed) | I set it up once the approach is decided |
| model_performance table | ✅ I Populate It | Written from MLflow pipeline |
| Optimization Execution Approach | ❌ Not Mine | Decided by responsible team member |
| Training Algorithms | ❌ Not Mine | Written by ML Engineer |
| readings_wide table | ❌ Not Mine | I consume from Data Eng |
| Streamlit Dashboard | ❌ Not Mine | Dashboard Team |