I’ve been a Machine Learning Engineer for the past 7-months. And of the few projects I’ve worked on, there are a few things which have come up every time.
Of course, it’s never the same because the data is different. But the principles of one problem can often be used for another.
Here are a few things I have to remind myself of every time I start to go to work with a new dataset.
A) Look at the data. No really, look at the data.
When you first get a dataset, the first thing you should do is go through it and formulate a series of questions.
Don’t look for answers straight away, doing this early could result in a roadblock of your exploratory process.
‘What does this column relate to?’
‘Would this effect that?’
‘Should I find out more about this variable?’
‘Why are these numbers like this?’
‘Do these samples always appear that way?’
You can start to answer them on your second time going over the data. And if they’re questions you can’t quite answer yourself, turn to the experts.
B) Talk to the experts
Data is data. It doesn’t lie. It is what it is. But that doesn’t mean some of the conclusions you draw won’t be biased by your own intuition.
Say you’re trying to predict prices of houses. There may be some things you already know. House prices took a dip in 2008, houses with white fences earn more, etc.
But it’s important not to treat these as hard assumptions. Before you start to build the world’s best housing model, you may want to ask some questions of people with experience.
Because it will save you time. After you’ve formulated your question list in Part A, asking a subject matter expert may save you hours of preprocessing.
‘Oh, we don’t use that metric anymore, you can disregard it.’
‘That number you’re looking at actually relates to something else.’
C) Make sure you’re answering the right question
When you start building a model, make sure you have the problem you’re trying to solve mapped out in your head.
This should be discussed with the client, with the project manager and any other major contributors.
What is the ideal outcome of the model?
Of course, the goal may change as you iterate through different options but having something to aim towards is always a good start.
There's nothing worse than spending two weeks building a 99% accurate model and then showing the client your work only to realise you were modelling the wrong thing.
Measure twice. Cut once. Actually, this saying doesn't really work for machine learning because you'll be making plenty of models. But you get the point.
D) Feature engineer, feature encoding and data preprocessing
What kind of data is there?
Is it only numerical?
Are there categorical features which could be incorporated into the model?
Heads up, categorical features can be considered any type of data which isn't immediately available in numerical form.
In the problem of trying to predict housing prices, you might have number of bathrooms as a numerical feature and the suburb of the house as a categorical (a non-number category) feature of the data.
There are different ways to deal with both of these.
For numerical the main way is to make sure it's all in the same format. For example, imagine the year of a car was manufactured.
Is 99' (the year 99) four times greater than 18' (the year 2018)?
You might want to change these to 1999 and 2018 to make sure the model captures how close these two numbers actually are.
The goal of categorical features is to turn them into numbers. How could you turn house suburbs into numbers?
Say you had Park Ridge, Greenville and Ascot.
Could you say Park Ridge = 1, Greenville = 2 and Ascot = 3?
But doesn't mean Park Ridge + Greenville = Ascot?
That doesn't make sense.
A better option would be to one-hot-encode them. This means giving a value a 1 for what it is and 0's for what it isn't.
There are many other options to turn categorical variables into numbers and figuring the best way how is the fun part.
E) Test fast, iterate, update
Can you create a simpler metric to measure in the beginning?
There might be an ideal scenario you're working towards but is there a simpler model you can put together to test your thinking?
Start with the simplest model possible and gradually add complexity.
Don't be afraid to be wrong, again, again, again and again. Better to be wrong in testing than in production.
If in doubt, run the code. Just like data, code doesn't lie. It'll do exactly as you tell it.
The quicker you figure out what doesn't work, the quicker you can find what does.
F) Keep revisiting the main objective
The main problems which arise from machine learning projects are often not the data or the model, it's the communication between the parties involved.
Communication is always the key.
Working through a problem can end up in you being stuck down a rabbit hole. You wanted to try one thing, which led to another and another and now you're not even sure what problem you're working on.
This isn't necessarily a bad thing, some of the best solutions are found this way.
But remember not everyone will be able to understand your train of thought. If in doubt, over communicate.
Ask yourself, 'Am I still on the right track?' Because if you can't answer it, how do you think others will go?
G) A couple shout outs
My quality of work got an upgrade after I stumbled across these two amazing resources.
A notebook by Daniel Formoso (awesome name) which goes through a data science classification task from start to finish using scikit-learn, TensorFlow and a bunch of other techniques.
CatBoost, a rocket-powered open-source gradient boosting on decision trees library. In other words, an epic algorithm which improved all my results on a recent by 10%.
You could combine these two and build a pretty robust foundation for your next machine learning project.