Now our two Toyotas are similar to each other because they both have 1’s for Toyota but differ on their make.
One-hot-encoding works well to encode category values into numbers but has a downfall. Notice how the number of values used to describe a car increased from 2 to 5.
This is where the term high dimensionality gets used. There are now more parameters describing what each car is than there is the number of cars.
For a computer to learn meaningful results, you want the ratio to be high in the opposite way.
In other words, you’d prefer to have 6,000 examples of cars and only 6 ways of describing them rather than the other way round.
But of course, it doesn’t always work out this way. You may end up with 6,000 cars and 1,000 different ways of describing them because Klipklop has seen 500 different types of makes and models.
This is the issue of high cardinality – when you have many different ways of describing something but not many examples of each.
For an ideal price prediction system, you’d want something like 1,000 Toyota Corollas, 1,000 BMW X5s and 1,000 Toyota Camrys.
Ok, enough about cars.
What about our stock price problem? How could you incorporate a news headline into a model?
Again, you could do this a number of ways. But we’ll start with a binary representation.
You were born before the year 2000, true or false?
Let’s say you answered true. You get a 1. Everyone born after the year 2000 gets a 0. This is binary encoding in a nutshell.
For our stock price prediction, let’s break our news headlines into two categories – good and bad. Good headlines get a 1 and bad headlines get a 0.
With this information, we could scan the web, collecting headlines as they come in and feeding these into our model. Eventually, with enough examples, it would start to get a feel of the stock price changes based on the value it received for the headline.
And with the model, you start to notice a trend – every time a bad headline comes out, the stock price goes down. No surprises.
We’ve used a simple example here and binary encodings don’t exactly capture the intensity of a good or bad headline. What about neutral, very good or very bad? This is where our the previously discussed ordinal encoding could come in.
-2 for very bad headlines, -1 for bad, 0 for neutral, 1 for good and 2 for very good. Now it makes sense that very bad + very good = neutral.
There are more complex ways to bring words into a machine learning model but we’ll leave those for a future article.
The important thing to note is that there are many different ways seemingly non-numerical information can be converted into something a computer can understand.
What can you do?
Machine learning engineers and data scientists spend much of their time trying to think like Sandy the fish.
Sandy knows she’ll be safe staying with the other school of fish but she also knows there’s plenty to learn from exploring the unknown.
It’s easy to lean on only numerical information to draw insights from. But there’s so much more information hidden in diverse ways.
By using a combination of numerical and categorical information, more realistic and helpful models of the world can be built.
It’s one thing to model the stock market using price information, but it’s a whole other game when you add news headlines to the mix.
If you’re looking to start harnessing the power of your data with techniques like machine learning and data science, there are a few things you can to get the most of it.
Normalising your data
If you’re collecting data, what format is it stored in?
The format itself isn’t necessarily as important as the uniformity. Collect it but make sure it’s all stored in the same way.
This applies for numerical and categorical data, but especially for categorical data.
More is better
The ideal dataset has a good balance between cardinality and dimensionality.
In other words, plenty of examples of each particular sample.
Machines aren’t quite as good as humans when it comes to learning (yet). We can see Harold
the pig once and remember what a pig looks like, whereas, a computer needs thousands of examples of a picture of a pig to remember what a pig looks like.
A general rule of a thumb for machine learning is that more (quality) data equals better models.
Document what each piece of information relates to
As more and more data is collected, it’s important to be able to understand what each piece of information relates to.
At Max Kelsen, before any kind of machine learning model is run, the engineers spend plenty of time liaising with subject matter experts who are familiar with the data set.
Why is this important?
Because a machine learning engineer may be able to build a model which is 99% accurate but it’s useless if it’s predicting the wrong thing. Or worse, 99% accurate on the wrong data.
Documenting your data well can help prevent these kinds of misfires.
It doesn’t matter whether you’ve got numerical data, categorical data or a combination of both – if you’re looking to get more out of it, Max Kelsen can help.