Back to Glossary

What is MLOps?

MLOps (machine learning operations) is the set of practices that takes a model from a notebook to real users—and keeps it healthy afterward. Think of it as DevOps for ML, with a few extras: data, labeling, experiments, and constant retraining.

In plain terms, MLOps connects people (data scientists, data/ML engineers, product teams) and process (versioning, testing, reviews) with tooling (pipelines, registries, monitors) so you can ship models reliably, repeatably, and safely.

What a mature MLOps loop usually includes:

  1. Data & labeling: sourcing data, cleaning it, defining a label schema/ontology, and tracking dataset versions.
  2. Reproducible training: code, data, and environment all versioned; experiments tracked; results comparable.
  3. Automated validation: unit tests for features, checks for label leakage, bias/fairness tests, offline metrics gates (e.g., F1, AUC).
  4. Model registry & CI/CT/CD: store approved models, promote them through stages, and use continuous integration/continuous training/continuous delivery to push safe updates.
  5. Deployment patterns: batch scoring, real-time APIs, canary/shadow/A/B releases, rollback on failure.
  6. Monitoring in production: data drift, concept drift, feature skews, latency, throughput, and business KPIs—plus alerting and dashboards.
  7. Feedback & retraining: capture user corrections or human review, add them back to the dataset, and trigger scheduled or event-driven retrains.
  8. Security & governance: access control, PII handling, lineage, audit logs, and compliance.

Common building blocks you’ll hear about: orchestration (Airflow, Kubeflow), experiment tracking (MLflow, Weights & Biases), model registries, feature stores (e.g., Feast), containers (Docker), and Kubernetes for scaling. You don’t need all of them on day one—the goal is a simple, repeatable path from data to deployment to monitoring.

Example

Scenario: An online marketplace wants better product categorization.

  1. Data & labels: Pull product titles, images, and seller descriptions. Define the category taxonomy. Label a seed set; add quality checks.
  2. Training pipeline: A scheduled job builds features, trains models, logs metrics, and registers the best candidate.
  3. Automated tests: The pipeline runs unit tests on feature code, checks for data drift against the last snapshot, and enforces accuracy/F1 thresholds before promotion.
  4. Deployment: Roll out the new model behind a canary—10% of traffic first. Watch latency and precision/recall. If healthy, ramp to 100%; if not, auto-rollback.
  5. Monitoring & feedback: Track misroutes from customer support and catalog editors as labeled feedback. When drift increases (e.g., new seasonal products), the system kicks off retraining using the fresh labels. Results are reviewed, then promoted through the registry.

Outcome: Categorization improves steadily without heroics. Releases are boring (in a good way), and the team can trace every prediction back to the exact data, code, and model that produced it.