5 SECOND-SUMMARY:
- This blog post is part of a series dedicated to the various phases of the implementation of a DS solution and is a continuation of ‘Data Science Assessment: how to analyse a project’s viability‘ and ‘Data Science Assessment: how to create machine learning models‘.
After having decided on the best type of machine learning model and building it, we need to be aware that a working model will probably have to deal with different patterns of data from those used to train it. This situation will cause a performance decay from previous results, so the goal is detecting these changes quickly, identifying reasons for the discrepancies and finally making changes if needed, to maintain performance as best as possible.Let’s see more about machine learning model monitoring.
Machine learning model monitoring: types of drift
A machine learning model need to be monitored, not just in ways that traditional software requires (ensuring the process is healthy and running as expected), but paying special attention to ML-specific problems, such as the problem of drift.
There are three main types of drift which affect ML problems, covariate, prior probability and concept. Over the next few paragraphs we’ll define these 3 kinds of drift, using an email spam detection model for example.
Covariate drift
Covariate drift happens when the input distribution p(x) varies but the functional relation between the features and target p(y|x) remains unchanged.
In the case of email spam detection, this can happen when new terms appear because of new software being used, or an update of strategies used by spammers. These are changes to the content of emails, but do not affect whether an email is spam or not.
Prior probability drift
Prior probability drift refers to a change in the distribution of the target, that is, the class prior probability p(y) varies from training to test, but p(x|y) remains unaltered.
In our example, you might encounter this when there is an uptick in spam emails that follow the same patterns from the training dataset.
Concept drift
The concept drift refers to cases where definition of the target changes, that is, the case where p(x) is not altered but, p(y|x) varies from training to test.
This can arise from a change in company policy, which may now consider email from a certain domain to be considered spam.
Combination of effects
Often there will be more than one of the above happening at the same time. When a new kind of phishing email appears, we get covariate drift due to the new email format that wasn’t in the training set, prior probability drift due to the increase in the number of spam emails and concept drift, as the patterns that indicate that an email is spam weren’t in the training data.
Detecting drift
Identifying these drifts is important, to understand the quality of your model and to allow mitigation of the downsides of model performance degradation. To ensure the stability of models, it is essential to continuously monitor and adjust them as needed. One approach is to set up tools to detect drift and retrain the models regularly. This will help ensure that the models maintain their effectiveness and accuracy over time.
There are different strategies to detect different types of drift.
Covariate drift
Detecting covariate drift is probably the most straightforward, not requiring you to have access to the ground truth of your test set, however, it can also be the noisiest option, often identifying drift that doesn’t affect model performance.
The most common strategy is to use a statistical test comparing the distribution of each feature on the reference dataset and the new data. While this doesn’t identify a change in dependence between features, provided the individual distributions remain the same, this is by design as the number of pairs of features grows quadratically with number features, so it would be very noisy to look at all of these. If the interaction between two specific features is particularly important to a specific model, it might be worth considering monitoring it with a statistical test on the joint distribution.
Prior probability drift
Identifying prior probability drift requires access to the ground truth on the new data. You won’t usually have access to this right away, which makes it a less reactive method, but prior probability drift is often more important that covariate drift. The way you identify it is with the same methods as covariate drift, applying it to your target label instead of features.
Concept drift
Identifying concept drift is very important, as even a model that extrapolates well is going to be affected by it. Much like prior probability drift, it is easiest to identify when you have access to the ground truth on the new data: when the model starts performing worse on new data than in the training data, and there is no indication of covariate or prior probability drift, then it is probably because of concept drift.
How to reduce the effects of data drift?
If data drift is frequently affecting the model, it should trigger an investigation into the root cause. It’s possible that a select few features are suffering from covariate drift, and that feature engineering could be used to mitigate this effect, by removing seasonal effects or working with relative instead of absolute values.
In summary, when a model’s performance decreases due to drift, you must retrain the model. But when do you do this? One approach is to automate the detection process using a set of metrics or events. This ensures the model is immediately updated when it fails to achieve the desired performance. Semi-automated detection is another approach that can be used to monitor models. In this approach, conditions are set and when they are met, an alert is given to the developers, indicating that action is necessary. This approach is useful when the drift is not easily found by an automated system, such as in the case of complex models or when the data is highly variable.
Final Thoughts
The performance of a machine learning model is, as a rule, not static and might decrease due to data-related situations, for instance how the model is handling new data, discrepancies in data used for testing and training, and even how data is perceived.
To mitigate issues associated with data drift it’s important to be aware and understand the three kinds of drift we might encounter in machine learning, then we can identify what drift type we’re facing and from there we get clues as to what the root cause (or causes) might be where a course of action will be then decided and applied.
It’s important to understand the needs of the organisations we work with and conclude what the best monitoring approach is, since a fully automated solution is not always possible in contexts of highly complex data.
Our team fully understands how crucial machine learning model monitoring is, and understanding and addressing drifts in models is. It’s part of the commitment of Xpand IT in developing the most effective, accurate machine learning models in the area.
Data Scientist