Multivariate Datastrophe: Methods to Detect Obscure Drift in Your Production Data

Magdalena Kowalczuk

Wednesday 16:10 in Titanium3

Objectives: The goal of this talk is to explain what multivariate data drift is and to present methods for detecting it. You will learn about the challenges associated with multivariate drift and explore practical solutions to identify it

  1. Model’s Aren’t Forever: 91% of models in production degrade over time. I will explain the importance of continuous model monitoring and discuss common approaches and pitfalls.

  2. All Ways Your Perfectly Fine ML Model Can Fail: Machine learning models can fail for numerous reasons. I will describe the root causes of ML model failure, including data quality, data drift and the final boss of every model - concept drift. I will explore simple methods to detect univariate data drift and delve into the more complex challenge of concept drift, which can significantly impact your model’s performance in production.

  3. Drift Happens - Multivariate Data Drift: Multivariate data drift occurs when the relationships between multiple variables in your data change over time. This type of drift can be challenging to detect with standard methods. I will provide an explanation of what multivariate data drift is and why traditional techniques may fall short in identifying it.

  4. Two Clever Ways to Detect Multivariate Data Drift: Domain Classifier: This method involves training a classifier to distinguish between data from different time periods. If the classifier can accurately separate the data, it indicates that the distribution has changed, signaling potential drift. PCA Reconstruction Error: Principal Component Analysis (PCA) can be used to reduce the dimensionality of your data. By comparing the reconstruction error over time, you can detect changes in the underlying data distribution that may indicate drift.

Magdalena Kowalczuk

data & ml fan with a soft spot for OSS - driven by curiosity, eager self-learner, hackathons enjoyer, PyData and PyLadiesCon speaker and volunteer, exploring and creating content about post-deployment data science at NannyML, in my free time contributing to Narwhals and hosting open source sprints