How to avoid common pitfalls in machine learning projects

11 December 2018 / Nick Priestley
In October, my colleagues and I took part in a Hackathon at Rolls Royce (aerospace). While it was a fun challenge, it got me thinking about some of the common pitfalls and traps that you face when taking on a real-world Machine Learning project. Here are my key takeaways to help you avoid common machine learning mistakes.


Rolls Royce was looking for new strategic partners to supplement their internal projects and so they set a challenge to 'test and expand your skills in AI, advanced analytics, and deep tech' which we liked the sound of, so we entered a team! Most of the details of the challenge were unknown until the day, but in our team, we had most angles covered including experience in Computer Vision, Natural Language Processing and building predictive models using structured data as well as considerable experiencing clearly presenting our work, developing compelling apps and doing all of that within a business environment.


What is a Hackathon?

A Hackathon is defined as an event, typically lasting several days, in which a large number of people meet to engage in collaborative computer programming. It’s often in the form of a competition hosted by a company to find new talent, both internally and externally.
We were tasked with building a model for early detection of turbine blades that would fail the inspection and measurement process. Specifically, the moment weight (a physics term!) of each blade from an engine is taken at the end of manufacture, and those over a given threshold are non-compliant and can't be used. The rejected blades can often be reworked, but there is a significant cost benefit to doing this early in the manufacturing lifecycle, making early identification of likely non-compliance highly desirable.
During the event, we worked with various structured datasets to build a model, and a user interface that could interact with our model. The results of our work would be judged by a panel of executives at Rolls Royce who would scrutinise our model and software, paying an equal attention to a cost benefit and commercial based business plan that we presented to them.
Whilst I'm sad to report we didn't win on this occasion, we had some fantastic feedback from the Rolls Royce team, and it was great fun taking on a challenge in a sector completely alien to us. There were also some key takeaways that I think businesses could take advantage of, below I lay out the key stages of the project and how we tackled each of them. Hopefully, some of this may resonate with you and the projects you are working on…


Stage one: Data

When taking on a machine learning project, data scientists always harp on about the importance of data preparation. Data Wrangling, Data Munging, or whatever you want to call it, it takes up the majority of your time and it's, arguably, the most important piece of the puzzle. People often think that this just amounts to data cleansing. While there's a certain amount of this, usually the majority of effort falls under the umbrella of 'sample design'. Decisions that you make along the way that can have a huge impact on the quality of your model.
When designing a sample for a supervised machine learning project, the obvious place to start, is with the 'labels' i.e. the thing you want to predict. In our case, we were guided towards predicting the moment weight as this is the measure of success or failure, but this wasn't completely straight-forward...
The manufacturing dataset was broken down into different processes, and in what was known to us only as 'Process D' the team already intercept and rework blades that they predict will fail inspection. The blades are reworked, and carry on through the process, some of these will still fail, as will other blades that weren't intercepted. Herein lies the problem of Selection Bias. A problem that we're all too familiar with at Jaywing.
We have done a lot of work in credit risk modelling, and, in a way, intercepted components are a bit like loan applications that have been rejected. The problem being that you don't know how things would have played out were the blades unaltered (or the loans accepted). For the intercepted blades, there were a few ways to deal with this:
A. Leave them out of your sample
B. Include them but, infer the moment weight
C. Create a binary model (predicting pass/fail) including them as fails
Selection bias becomes a problem when you train your model on data that is not representative of the population intended to be analysed (i.e. the data available when making real predictions). So, by removing the very worst blades from our sample, the model may not be very good at detecting them in a real-world scenario. Therefore, most people agree that option A would be a bad idea. Option C was the most tempting given the short timeframes. But option B is ultimately what we went with. It's important to keep one eye on the business question that you're trying to answer. We were predicting inspection failure so that components can be reworked. The actual moment weight is presumably significant in how blades are reworked.
As a way of keeping the 'rejects' in our sample, we would usually use sophisticated inference methods to deduce a label. Luckily in our dataset, the intercepted blades had a corresponding predicted moment weight based on Rolls Royce's own model. We used this as our proxy label in place of our own reject inference.
The remaining data points are what you call the 'independent variables', and in machine learning you apply Feature Engineering to the variables to maximise the performance of your model.
What exactly is this you may ask?
Basically, transforming the data to make it more useful to your model. Taking example from the credit risk world again, you might have a mortgage where you know the loan amount and the house value. An obvious thing to do is to create a new variable for the Loan to Value ratio – the intuition being the more you have to lose the less likely you are to default. To that end, in Rolls Royce data there were a series of measurements taken before Process D, and again afterwards.
We figured out that these values alone were not super predictive, it was the difference between the measurements that was key. The raw measurements could be fed into a deep learning model, and in theory it can figure out the importance of the differences by itself. But this is only true if you have a very large training dataset. The Rolls Royce dataset was actually very small (relative to the datasets that we normally work on), so adding derived, engineered, features makes sense.

Stage two: Making a robust model

As part of the sample design you need to think about making a robust model. A common problem when building predictive models is it may work great on your development and test sample, but without care it’s possible that when a model moves into production, all of a sudden performance is mediocre or worse, it completely drops off a cliff. What might be the cause?
Data leakage is a common culprit. In computer vision it's particularly easy for this to happen. Take for example scan images used in cancer diagnosis. I heard of a case recently where some of the images used to train a model had markings (drawings) on them, made by the consultants. These markings were highly indicative of a suspected problem. The machine learning model had learned the significance of these markings which inflated the model performance. When applied to unaltered images it did not perform so well. Another subtler example I have heard of, was a model trained on images from multiple scanners. The model had learned to detect which scanner the image had come from due to minor differences in the calibration. In the training data, some of the scanners were used at centres where the patients were at later stages of diagnosis.
So, the model was effectively encoding the scanner location making it look better than it really was, and unable to generalise well on scans from different machines. These kinds of biases exist everywhere, and you need to know how to spot them.
To make a robust model for Rolls Royce we had to both remove and transform the variables. Any variables not available at the point of observation were removed - for instance there were measurements taken at the end of manufacture. These were no good for a model used in early detection non-compliant blades.
We also discovered several issues relating to time confounding. The data had been collected over several years, and during this period Rolls Royce had implemented their own solution which drastically improved the overall success rate. So, a sequential variable such as the blade serial number effectively encodes the date it was made. If included, a model could say, for example, anything before x serial number had a higher probability of failure than anything after that - information not available for new blades.
Another confounding variable was the cast number. It was a highly predictive variable, indicating that a problem with a specific cast is likely to affect other blades formed from the same cast. Casts only have a limited lifespan and so by including it in our model it would have been cheating. In real-time model inference (in contrast to the training data which covered a long period of historic data) , it would not have access to success rate of all blades moulded from this cast. Imagine a brand new cast, the model would have no information about how good it is historically, this is known as The Cold Start Problem.
An easy fix to include this variable without throwing it away altogether is to deduce the known failure so far of a given cast at the time of manufacture. This little nugget is an example of where domain knowledge is important, you can't just treat a dataset as a bunch of numbers. 


Stage three: Model selection

When working with structured (tabular) data there are a few obvious choices of approach. It's hard to predict which will work best so there’s inevitably a bit of experimentation involved. Random Forests, Gradient Boosted Decision Trees (GBTs) and Feed-Forward Neural Networks are usually pretty effective (at the time of writing).
Until recently Neural Networks had not been effective with tabular data, but we've had a lot of success with them in the credit risk world at Jaywing recently, even on small datasets (<10k samples). Despite their internal complexity, the lack of feature engineering required for Neural Networks makes them relatively easy to work these days. By simply passing your continuous features though as raw values, and using Embeddings or one-hot-encoding for anything categorical, you can go very far, very quickly. Random Forests and GBTs on the other hand still need a fair bit of data work before you can get started – correct handling missing values for example. During the challenge we had varying degrees of success with our models. Random Forests didn’t perform so well, but both GBTs and Neural Networks using our very own Archetype product were very effective.

Stage four: Metrics

It's worth talking about metrics, they matter more than most people realise, particularly in evaluation of your model performance, and in how you explain the power of a model to non-technical decision makers. The first metric to consider is one used to select variables for your model. Before building a model you are often faced with more variables than is practical to use. So, a common technique to discard ineffective variables is to look at their individual (univariate) importance. The metric used for univariate importance depends on your label. When modelling a binary outcome (i.e. “Yes or no”), Information Value is a good choice. For continuous outcomes - like predicting moment weight - then something like R2 works well.
It's important not to blindly trust univariate importance though, it doesn’t capture interactions between variables. So, for instance had we discarded the measurements from Process D, without first realising that it was the difference between two of the measurements that was of greater significance, then we may well have thrown away some of our most useful variables.
A more important choice of metric is the one used to evaluate your model. For binary prediction tasks, you often want a balance between Precision and Recall, not just one or the other. Precision is proportion of predicted positive cases that are actually positive, and recall is the proportion of true positive outcomes that the model successfully identifies as positive. There is generally a trade-off between the two.  For instance, in cancer diagnosis you probably care less about misdiagnosing that someone has cancer than missing them altogether.
Accuracy is a very common metric used when evaluating model performance. It’s fine sometimes, but it’s important to grasp how incredibly misleading it can be. For example, if we are predicting non-compliance of blades, and the average success rate is 97% then if our model simply predicts that every blade will be compliant then we'd have a model that's right 97 out of a 100 times i.e. 97% accurate. Our go-to choice of metric for continuous outcomes is R2, which gives you an overall view of the average error rate of the individual predictions. But that’s just a statistic and when presenting our data back to the executives at Rolls Royce we instead translated the impact into a breakdown of the cost saving by the reduction in non-compliant blades, also bearing in mind the false positives.

Stage five: It’s all for nothing if you can’t get buy in

The final point worth a mention is buy-in. Many models never see the light of day due to lack of buy-in. You need to engage with the business and the key-stakeholders and, in particular, any people whose job will be directly affected by implementation of your model. There’s often a sense of both fear and scepticism. You should engage with the end-users and executive team directly, so you can nurture and understand any fears - sometimes they will be valid. To tackle scepticism, it helps to be empirical. It’s hard to deny hard facts and data!
To conclude, it was great fun taking on a challenge and trying our hand at modelling a process in a  completely different sector. The power of AI and machine learning techniques is that you can apply them to almost any industry. It’s been described as like having a super power. While this is no-doubt over-hyping the benefits, I largely agree with the sentiment.
But as they say, with great power comes great responsibility, and I hope I’ve highlighted the importance of being responsible. It’s becoming easier and easier to make a predictive model. But take a moment to think about the common traps, pitfalls and inherent biases that come with predictive modelling.
If you would like help with machine learning projects, or if you're starting your own data science team and want help avoiding the common traps then please get in touch: [email protected].