How to manage data drift

Frankline Ononiwu
3 min readJan 29, 2022

How to keep your model performing well after you deploy

Nothing lasts forever. Especially not machine learning models. Most models in production exist in a dynamic environment involving many variables, including some that we can’t control. In this case, the performance of the model will change too. It's easy for the model to drift away in a sea of constantly flowing data.

When a machine learning model is deployed in production, the main concern of data scientists is the model pertinence over time. Is the model still capturing the pattern of new incoming data, and is it still performing as well as during its design phase? Hence, the need to understand the effects of data drift. Data drift is a change in data over time, such as data collected once a day.

Data drift is a result of unexpected and undocumented changes to input data. It is the most common cause of problems in production ML systems. Changes in data cause unwanted misclassification where for instance some spam is classified as non-spam. It breaks processes and corrupts data, but can also reveal new opportunities for data use. Our goal is to understand how to identify data drift, and how to come up with a monitoring plan to help identify data drift and performance degradation early on.

The main types of data drift

1) Covariate Shift (Shift in the independent variables)

Covariate shift refers to the change in the distribution of the input variables present in the training and the test data. It is the most common type of data drift. It describes the change of the properties of the independent variables. In this case, it is not the definition of a spammer that changes, but the values of the features we are using to define them.

2) Prior Probability Shift (Shift in the target variable)

Which is: the input data has changed. The distribution of the variables is meaningfully different. As a result, the trained model is not relevant for this new data. This can also be referred to as label drift.

It would still perform well on the data that is similar to the “old” one! The model is fine, as much as the model “in a vacuum” can be. But in practical terms, it became dramatically less useful since we are dealing with a new feature space. For instance, a user no longer considers sales notifications from the box as spam.

How to Detect Data Drift

The best approach to detect data drift is by monitoring a model’s statistical properties, the model’s predictions, and their correlation with other factors. The ultimate measure is the model quality metric which can be accuracy, mean error rate, or some downstream business KPI, such as click-through rate. It’s good to consider future assumptions about the behavior of the data and use models that support some sort of adaptive learning mechanism.

Another way to detect data drift is through training. As part of your pipeline, you could implement a system that periodically trains your models after some time, or once it detects a drift using some of the methods aforementioned. Another alternative could be using streaming models that update their weights as new data arrives. A model that falls into this category is Spark’s streaming linear regression algorithm, which is, an implementation of the linear regression model that continually updates its trained parameters.

A non-technical solution to detect drifts is by improving communication within the teams that, interact with the prediction model. Some data drift cases can be attributed to changes the organization introduces to the product. For example, a product team changed the options on a form. With a good line of communication across the teams, the system could be prepared to handle the upcoming changes in the data.

This discussion is the first part of a series on data drift.

References

ML Drift-How to Identify Issues Before They Become Problems. Amy Hodler, MLOps Meetup #89

https://arxiv.org/pdf/1704.00362.pdf

--

--

Frankline Ononiwu

I am a Biomedical Scientist turned Data Scientist. I have a strong interest in MLOps and NLP. I love explaining technical terms in non-technical ways