Being your own biggest sceptic, the value in trying things which might not work and why communication is harder than technical problems.Read More
There's a famous saying in the world of data science.
All models are wrong but some are useful. — George E. P. Box
It may not only be data science either. But it doesn't matter.
When you're exploring a dataset, you're looking for clues. Sometimes they'll be right there, other times they won't.
You'll build models and they won't work. But you keep going.
You keep exploring, you keep looking for answers and you keep asking questions.
You prove yourself wrong over and over. And that's what you practice. You practice being wrong. You develop the most important skill a data scientist can have. A willingness to be wrong.
Not because the goal is to be wrong. But because being wrong gives you an opportunity to figure out what not to do.
Being wrong means you tried something which might not work.
Being wrong is the badge of the explorer.
Because being wrong and learning from it enables you to get closer to being useful.
“Are you making progress or completing activities?” he said, “That’s what I ask myself at the end of each day.”
“I’m writing that down.”
We kept talking. Not much more worth writing down though.
“Let me know what you get up to.”
“Okay, I will.”
“Have a good day mate. Goodbye.”
Too many activities can feel like progress. That’s what he was talking about. You could be working yourself to the bone but the list never gets any smaller.
Maybe it’s time to get a new list.
One which leads to progress instead of a whole bunch of activities being checked off at the end of the day.
I catch myself when I’m writing a list each morning. On the days where there are only two or three things, write, workout, read, I go to add more add more as a habit. But would more activities lead to progress?
If your goal is to progress, you must decide which activities lead to it and which don’t. It’s hard and you’ll never be able to do it for sure but you can make a decision to. A decision to step back a decision to think about what does add to progress and cut what doesn’t.
In my latest video, I share how I got Google Cloud Professional Data Engineer Certified. I passed the exam without meeting any of the prerequisites. How? A few activities which led to progress. But the certification isn’t the real progress. The real progress comes from doing something with the skills the certificate requires. More on that in the future.
We were hosting a Meetup on robotics in Australia and it was question time.
Someone asked a question.
“How do I get into artificial intelligence and machine learning from a different background?”
Nick turned and called my name.
“Where’s Dan Bourke?”
I was backstage and talking to Alex. I walked over.
“Here he is,” Nick continued, “Dan comes from a health science background, he studied nutrition, then drove Uber, learned machine learning online and has now been with Max Kelsen as a machine learning engineer for going on a year.”
Nick is the CEO and Co-founder of Max Kelsen.
I stood and kept listening.
“He has documented his journey online and if you have any questions, I’m sure he’d be happy to help.”
The questions finished and I went back to the food.
Ankit came over. He told me about the project he was working on to use machine learning to try and understand student learning better. He was combining lecture attendance rates, time spent on the online learning portal, quiz results, plus a few other things. He’d even built a front-end web portal to interact with the results.
Ankit’s work inspired me. It made me want to do better.
Then a few more people started coming over and asking questions about how to get into machine learning. All from different fields.
This is the hard part. I still see myself as a beginner. I am a beginner.
Am I the right mentor?
The best mentor is someone who’s 1-2 years in front of you. Someone who has just been through what you’re about to go through. Any longer and the advice gets fuzzy. You want it when it’s fresh.
My brother is getting into machine learning. Here’s what I’ve been saying to him.
A) Get some Python foundations (3-4 months)
The language doesn’t really matter. It could be R, Java, Python, whatever. What matters is picking one and sticking with it.
If you’re starting out, you’ll find it hard to go wrong with Python.
And if you want to get into applied machine learning, code is compulsory.
Pick a foundations course from online and follow it through for a couple of months. Bonus points if it’s geared towards teaching data science at the same time. DataCamp is great for this.
It’ll get hard at times but that’s the point. Learning a programming language is like learning another language and another way of thinking at the same time.
But you’ve done it before. Remember when you were 3? Probably not. But people all around you were using words and sounds you’d never heard before. Then after a while, you started using them too.
B) Start making things when you’re not ready
Apply what you’ve learned as soon as you can.
No matter how many courses you’ve completed, you’ll never be 100% ready.
Don’t get lured into completing more courses as a sign of competence.
This is one thing I’d change if I went back and started again.
Find a project of your own to work on and learn through being wrong.
Back to your 3-year-old self. Every 3rd word you said would’ve been wrong. No sentence structure, no grammar either. Everything just came out.
C) There’s a lot out there so reduce the clutter
There are plenty of courses out there. All of them great.
It’s hard to find a bad one.
But here’s the thing. Since there are so many, it can be hard to choose. Another trap which can hold you back.
To get around this, I made my own AI Masters Degree. My own custom track to follow.
You can copy it if you want. But I encourage you to spend a few days doing research of your own and seeing what’s best for you.
As a heads up, three resources I’ve found most aligned to what I do day-to-day are, the Hands-On Machine Learning Book, the fastai Machine Learning course and the Applied Data Science with Python course on Coursera.
Bookmark these for after you’ve had a few months Python experience.
D) Research is pointless if you can’t apply it
You’ll see articles and papers coming out every day about new machine learning methods.
There’s no way to keep up with them all and it’ll only hold you back from getting your foundations set.
Most of the best machine learning techniques have been around for decades. What’s changed has been an increase in computing power and the availability of data.
Don’t be distracted by the new.
If you’re starting out, stick to getting your foundations first. Then expand your knowledge as your project requires.
E) A little every day
3-year-old you was a learning machine (a machine learner?).
In a couple of years, you went from no words to talking with people who had been speaking for decades.
Because you practised a little per day.
Then the compound interest kicked in.
1% better every day = 3700% better at the end of the year.
If you miss a day, no matter, life happens. Resume when you can.
Soon enough you’ll start to speak the language of data.
F) Don’t beat yourself up for not knowing something
“Have you ever built a recommendation engine?”
“We’ve got a project that requires one as a proof of concept, think you can figure it out?”
Most people think learning stops after high-school or college. It doesn’t.
The scenario above happened the other week. I’d never built a recommendation engine. Then I did.
Failure isn’t bad if you’re failing at something you’ve done before. You’ve been walking your whole life but you don’t beat yourself up when you trip on your own feet. It happens. You keep walking.
But failing at something new is tough. You’ve never done it before.
Learning machine learning kind of goes like this.
1st year: You suck.
2nd year: You're better than the year before but you think you suck even more because you realise how much you don’t know.
3rd year: ???? (I’m not there yet)
Embrace the suck.
How much will beating yourself up for not knowing something help you for learning more?
Learning something new takes time. Every day is day one.
How would your 3-year-old self react to not knowing a word?
You’d laugh. Throw your hands in the air and then crawl around for a bit.
It’s the same now. Except you can walk.
It was happening. Date night. While everyone was out spending big, I was going to be the outlier and cook at home. I’ve optimised plenty of cost functions before. Tonight would be no different.
I asked a friend for help. I had no idea what to cook.
You’d think after working so close together for the past 2–3 years I’d know a thing or two about my date.
Thai green chicken curry it was.
Food was like data I thought. You get a bunch of things together, combine them in some way and get something out at the other end. Something like that anyway.
I ordered 1kg worth of chicken and split it up into a training, validation and test sets. I’d preprocessed all the other ingredients a day before.
700g worth of chicken seemed like a lot for two people. But my friend sent me some good parameters to work with. After applying dropout, things seemed to normalise.
The recipe had the perfect learning rate. I even added in some of my own parallelisation by getting two pans on the go. Meat in one, vegetables in the other. I’d ensemble them together when they were ready.
“How’s this taste?” I asked my brother.
Perfect accuracy. I was sceptical. Maybe I was overfitting on the training set. Accuracy wasn’t a good measure for cooking anyway. There was still 150g of chicken for validation.
“What about this?” I asked the same brother.
“8/10, the first one was better.”
“What do you think?” I asked my other brother.
“9/10, I’m a fan.”
I’d set up cross-validation. Ideally, I’d like to do more than 2-folds but it was too late. Dinner was in a few hours. Reinitialising the kitchen would take far too long.
The real test was about to begin. No instructions this time. If I hadn’t learned anything with the previous 850g of chicken, I was screwed.
What if the recipe didn’t hold together when it came time for the demonstration?
It was like going to show a client your work. You’d spent the last 6 weeks curating the most upvoted Stack Overflow Pandas functions together into an elegant looking pipeline only to have it fall over.
“I’m not sure what happened, it was working earlier on my end.”
Or worse. Someone forgot the dongle. The dongle. Always the dongle.
But you see that one was of the reasons I asked her to dinner. The uniqueness. She needed a dongle but didn’t need one at the same time.
She was shy, but there was this confidence about her. Those curves, the sparkle of the screen.
The table was set. Candles, flowers. We sat down. The conversation was one way as usual. I didn’t mind. It gave me time to air out my thoughts. No judgements. Love is the absence of judgement.
I started thinking. Maybe I was getting too deep too soon. What if I was adding all these layers to our relationship but all she wanted was something straight up and down. Something linear.
“Are you hungry?”
Maybe I said something wrong.
I went over and started cooking. It was show time. Cooking at home was the right choice. It meant I could run everything locally. In the same environment I’d practised on the training and validation set.
It was the same. Chicken in one pan. Vegetables in another. If it was anything like the validation set, it would take about 3-minutes to converge.
The sauce smelled amazing. Even better than before. Maybe the perfect accuracy on the training set wasn’t too far off.
I served up two bowls and sat down.
“This is Thai green chicken curry.”
I started eating. It tasted good. Really good.
What had I done wrong I wondered. I’d been through all the ingredients, got the inputs of others, checked my metrics.
Then it hit me. I’d been working with the wrong data the whole time. I’d spent all this time trying to provide an answer when I should’ve been asking more questions.
She didn’t want food. She didn’t need food. She was a laptop. Mine was up there but laptops don’t eat even the best Thai green curry.
The one thing she needed, I couldn’t provide.
I had a hunch but something blinded me. I was already in deep. It’s hard to step back when you’re already 15 layers in.
I plugged in the charger and she lit up like a Christmas tree. Mr Charger could provide.
“We can still be friends.”
I waited to hear it but it never came.
I went upstairs and cleared the table. Blew the candles out, put the extra bowl in the fridge.
When I came back downstairs they were still together.
I wasn’t silly. I knew there was a high probability they’d stay together. It hurt. It was a kind of loss to which I had no prior.
I told my friend the next day.
He took his hand off his chin and started talking.
“Look at this way, at least next time you can take more of a Bayesian approach.”
Blog posts are great.
Many of the resources you find online for data science and machine learning are great.
After what I've been looking up the past few months, I've struggled to find something bad. Only things which were slightly outside of the scope I was after.
I get asked a lot what the best place to learn is.
I can't answer it.
Because I haven't tried everywhere.
I can only speak of what I've tried.
But you could choose anything. Follow it. Put the effort in. Build off it. Then repeat.
And you'll always win.
Because investing in companies and businesses may make you money.
But investing in yourself always pays off.
Our class went on an excursion. We played with different kinds of food compounds which could shape themselves around the outside of a balloon. And then got taught about these tools which could output very small drops.
‘What are these called?’ I asked.
We got back to school. The teacher turned and asked what I thought of the trip.
‘I liked the tour but it was very focused on science.’
‘That’s what it was all about.’
She was right. We went to a science institute.
The same teacher asked me to be captain of debating. It was tradition to get up and talk in front of the school. I got up and gave a talk. Everyone clapped but my speech wasn’t as good as I wanted it to be.
I was set out to do law. I’d see lawyers on the TV. All it looked like was a form of debating where everyone wears suits and says ‘objection!’ Followed by something smart.
I thought, ‘I could do that.’
A few episodes of Law & Order and everyone becomes a lawyer.
We got our grades, I got 7/25, lower was better. Not as good as I hoped but I expected it. Most of my senior year was devoted to running our Call of Duty team. We were number one in Australia.
The letters came, it was time to choose what to study at university. I read the headings in bold and left the rest to read later. I was set out to do law.
We were on the waterfront riding scooters. There was a girl there I knew from primary school. I had a crush on her in grade four. For Easter, my mum gave me two chocolates to take in, a big one and a small one. The big one was for my teacher, Mrs Thompson. When I got to school I gave the big one to the girl. But she still liked Tony Black.
She was smart. That’s why I liked her.
‘What are you studying?’ I asked.
‘Biomedical science, it’s what you study before getting into medicine.’
‘Oh, that’s what I’m doing.’
I wasn’t. I hadn’t filled out the form. I was set out to do law.
I got home and checked the study guide. Biomedical science required a score of 11/25. I was eligible. I put it down as my number one preference. Same as the girl.
The email came a few weeks later. I got into my number one preference. A Bachelor of Science majoring in Biomedical Science.
We went to orientation day together. I spent $450 on textbooks. I used my mum's card. There was a biology one with 1200 pages. It had a red spine and a black cover. The latest edition.
Our timetables were the same. 30-something contact hours per week. I lived 45-minutes from university by car. 90-minutes by train and bus. The first lecture of the week was at 8 am on Monday. BIOL1020. Why someone chose this time for a lecture still confuses me.
The lecturer started.
‘30% of you will fail this course.’
‘That won’t be me.’
It was me.
My report card in high school went something like this.
Maths - B
Extension Maths - C
Physics - B
Religion - A+ (most of religion was storytelling, debating helped with this)
English - B
Geography - B
Sports - A
Not a single biology course. I was set out for law.
I took the same course the next year. I passed. It took me a year to get some foundations in biology. By then the girl was already through to second year. She was smart. That’s why I liked her.
Being a doctor sounded cool.
‘I’m going to be a doctor,’ I told people at parties.
But by end of my second year, my grades were still poor.
The Dean of Science emailed me. Not him. One of his secretaries. But it said I had to go and see him. My grades were bad. The email was the warning. Improve or we’ll kick you out.
I met with the Dean. He told me I could change courses if I wanted to. I changed to food science and nutrition. Still within the health world but less biology. I wasn’t set out for law.
My grades improved and I graduated three years later. Five years to do a three-year degree.
People asked when I finished.
‘What are you going to do with your nutrition degree?’
I thought it was a good plan.
I was working at Apple. They paid for language courses. I signed up for Japanese and Chinese. Japanese twice a week. Chinese once a week.
My study routine was solid. The main skill I learned at university was learning how to learn.
I was getting pretty good. When Chinese customers came in, I’d ask them if they had a backup of their iPhone in Chinese.
‘Nĭ yŏu méiyŏu beifan?’
They loved it.
I passed the level 2 Japanese exam the night before flying to Japan. Being solo for a month meant plenty of walking. Plenty of listening to podcasts. Most of them were about technology or health. Two things I’m interested in. And all the ones about technology kept mentioning machine learning.
On the trains between cities, I’d read articles online.
I went to Google.
‘What is machine learning?’
‘How to learn machine learning?’
I quit Apple two months after getting back from Japan. Travelling gave me a new perspective. Cliche but true.
My friend quit too. We worked on an internet startup for a couple of months. AnyGym, the Airbnb of fitness facilities. It failed. Partly due to lack of meaning, partly due to the business model of gyms depending on people not showing up. We wanted to do the opposite.
Whilst building the website, the internet was exploding with machine learning.
I did more research. The same Google searches.
‘What is machine learning?’
‘How to learn machine learning?’
Udacity’s Deep Learning Nanodegree came up. The trailer videos looked epic and the colours of the website were good on the eye. I read everything on the page and didn’t understand most of it. I got to the bottom and saw the sign-up price, thought about it, scrolled back to the top and then back to the bottom. I closed my laptop.
The prerequisites contained some words I’d never heard of.
Python programming, statistics and probability, linear algebra.
More research. Google again.
‘How to learn Python?’
‘What is linear algebra?’
I had some savings from Apple but they were supposed to last a while. Signing up for the Nanodegree would take a big chunk out.
I signed up. Class started in 3-weeks.
Back to the internet. It was time to learn Python.
‘How hard could it be?’ I thought.
Treehouse’s Python course looked good. I enrolled. I went through it fast. 3-4 hours every day.
Emails came through for the Deep Learning Nanodegree. There was a Slack channel for introductions. I joined it and starting reading.
‘Hey everyone, I’m Sanjay, I’m a software engineering at Google.’
‘Hello, I’m Yvette, I live in San Francisco and am a data scientist at Intuit.’
I kept reading. More of the same.
Mine went something like this.
‘Nice to meet you all! I’m Daniel, I started learning programming 3-weeks ago.’
After seeing the experience level of others, I emailed Udacity support asking what the refund policy was. ‘Two weeks,’ they said. I didn’t reply.
Four months later, I graduated from the Deep Learning Foundations Nanodegree. It was hard. All my assignments were either a couple of days late or right on time. I was learning Python and math I needed as I needed it.
I wanted to keep building upon the knowledge I’d gained. So I explored the internet for more courses like the Deep Learning Nanodegree. I found a few, Andrew Ng’s deeplearning.ai, the Udacity AI Nanodegree, fast.ai and put them together.
My self-created AI Masters Degree was born. I named it that because it’s easier than saying, ‘I’m stringing together a bunch of courses.’ Plus, people kind of understand what a Masters Degree is.
8-months into it I got a message from Ashlee on LinkedIn.
‘Hey Dan, what you’re posting is great, would you like to meet Mike?’
I met Mike.
‘If you’re into technology and health, you should meet Cam.’
I met Cam. I told him I was into technology and health and what I had been studying.
‘Would you like to come in on Thursday to see what it’s like?’
I went in on Thursday.
It was a good day. The team were exploring some data with Pandas.
‘Should I come back next Thursday?’ I asked.
A couple of Thursday’s later I sat down with the CEO and lead Machine Learning Engineer. They offered me a role. I accepted.
One of our biggest projects is in healthcare. Immunotherapy Outcome Prediction (IOP). The goal is to use genome data to better predict who is most likely to respond to immunotherapy. Right now about it’s effective in about 42% of people. But the hard part is figuring out which 42%.
To help with the project we hired a biologist and a neuroscientist and a few others.
Before joining, they hadn’t done much machine learning at all. But thanks to the resources available online and a genuine curiosity to learn more, they’ve produced some world class work.
We had a phone call with the head of Google’s Genomics team the other day.
‘I’m really impressed by your work.’
They’ve done an amazing job. But compliments should always be accepted with a grain of salt and a smile. Results on paper and results in the real world are two different things.
The team know that.
Can a biology student get into AI and machine learning?
I’m not a good example because I failed biology. Almost twice.
But I sit across from two who have done it.
You’ve already got it. The same one which led you to learn more about biology. Be curious and have the courage to be wrong.
Biology textbooks get rewritten every 5-years or so right?
Back to day one BIOL1020. The lecturer had another saying.
‘What you learn this year will probably be wrong in 5-years.’
It’s the same in machine learning. Except the math. Math sticks around.
Pink singlet, dyed red hair, plated grey beard, no shoes, John Lennon glasses. What a character. Imagine the stories he’d have. He parked his moped and walked into the cafe.
This cafe is a local favourite. But the chairs aren’t very comfortable. So I’ll keep this short (spoiler: by short, I mean short compared to the amount of time you’ll actually spend doing EDA).
When I first started as a Machine Learning Engineer at Max Kelsen, I’d never heard of EDA. There are a bunch of acronyms I’ve never heard of.
I later learned EDA stands for exploratory data analysis.
It’s what you do when you first encounter a data set. But it’s not a once off process. It’s a continual process.
The past few weeks I’ve been working on a machine learning project. Everything was going well. I had a model trained on a small amount of the data. The results were pretty good.
It was time to step it up and add more data. So I did. Then it broke.
I filled up the memory on the cloud computer I was working on. I tried again. Same issue.
There was a memory leak somewhere. I missed something. What changed?
Maybe the next sample of data I pulled in had something different to the first. It did. There was an outlier. One sample which had 68 times the amount of purchases as the mean (100).
Back to my code. It wasn’t robust to outliers. It took the outliers value and applied to the rest of the samples and padded them with zeros.
Instead of having 10 million samples with a length of 100, they all had a length of 6800. And most of that data was zeroes.
I changed the code. Reran the model and training began. The memory leak was patched.
The guy with the pink singlet came over. He tells me his name is Johnny.
‘The girls got up me for not saying hello.’
‘You can’t win,’ I said.
‘Too right,’ he said.
We laughed. The girls here are really nice. The regulars get teased. Johnny is a regular. He told me he has his own farm at home. And his toenails were painted pink and yellow, alternating, pink, yellow, pink, yellow.
Back to it.
What happened? Why the break in the EDA story?
Apart from introducing you to the legend of Johnny, I wanted to give an example of how you can think the road ahead is clear but really, there’s a detour.
EDA is one big detour. There’s no real structured way to do it. It’s an iterative process.
Why do EDA?
When I started learning machine learning and data science, much of it (all of it) was through online courses. I used them to create my own AI Masters Degree. All of them provided excellent curriculum along with excellent datasets.
The datasets were excellent because they were ready to be used with machine learning algorithms right out of the box.
You’d download the data, choose your algorithm, call the
.fit() function, pass it the data and all of a sudden the loss value would start going down and you’d be left with an accuracy metric. Magic.
This was how the majority of my learning went. Then I got a job as a machine learning engineer. I thought, finally, I can apply what I’ve been learning to real-world problems.
The client sent us the data. I looked at it. WTF was this?
Words, time stamps, more words, rows with missing data, columns, lots of columns. Where were the numbers?
‘How do I deal with this data?’ I asked Athon.
‘You’ll have to do some feature engineering and encode the categorical variables,’ he said, ‘I’ll Slack you a link.’
I went to my digital mentor. Google. ‘What is feature engineering?’
Google again. ‘What are categorical variables?’
Athon sent the link. I opened it.
There it was. The next bridge I had to cross. EDA.
You do exploratory data analysis to learn more about the more before you ever run a machine learning model.
You create your own mental model of the data so when you run a machine learning model to make predictions, you’ll be able to recognise whether they’re BS or not.
Rather than answer all your questions about EDA, I designed this post to spark your curiosity. To get you to think about questions you can ask of a dataset.
Where do you start?
How do you explore a mountain range?
Do you walk straight to the top?
How about along the base and try and find the best path?
It depends on what you’re trying to achieve. If you want to get to the top, it’s probably good to start climbing sometime soon. But it’s also probably good to spend some time looking for the best route.
Exploring data is the same. What questions are you trying to solve? Or better, what assumptions are you trying to prove wrong?
You could spend all day debating these. But best to start with something simple, prove it wrong and add complexity as required.
Making your first Kaggle submission
You’ve been learning data science and machine learning online. You’ve heard of Kaggle. You’ve read the articles saying how valuable it is to practice your skills on their problems.
Despite all the good things you’ve heard about Kaggle. You haven’t made a submission yet.
That was me. Until I put my newly acquired EDA skills to work.
You decide it’s time to enter a competition of your own.
You’re on the Kaggle website. You go to the ‘Start Here’ section. There’s a dataset containing information about passengers on the Titanic. You download it and load up a Jupyter Notebook.
What do you do?
What question are you trying to solve?
‘Can I predict survival rates of passengers on the Titanic, based on data from other passengers?’
This seems like a good guiding light.
An EDA checklist
Every morning, I consult with my personal assistant on what I have to do for the day. My personal assistant doesn’t talk much. Because my personal assistant is a notepad. I write down a checklist.
If a checklist is good enough for pilots to use every flight, it’s good enough for data scientists to use with every dataset.
My morning lists are non-exhaustive, other things come up during the day which have to be done. But having it creates a little order in the chaos. It’s same with the EDA checklist below.
An EDA checklist
1. What question(s) are you trying to solve (or prove wrong)?
2. What kind of data do you have and how do you treat different types?
3. What’s missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?
We’ll go through each of these.
What would you add to the list?
What question(s) are you trying to solve?
I put an (s) in the subtitle. Ignore it. Start with one. Don’t worry, more will come along as you go.
For our Titanic dataset example it’s:
Can we predict survivors on the Titanic based on data from other passengers?
Too many questions will clutter your thought space. Humans aren’t good at computing multiple things at once. We’ll leave that to the machines.
Sometimes a model isn’t required to make a prediction.
Before we go further, if you’re reading this on a computer, I encourage you to open this Juypter Notebook and try to connect the dots with topics in this post. If you’re reading on a phone, don’t fear, the notebook isn’t going away. I’ve written this article in a way you shouldn’t need the notebook but if you’re like me, you learn best seeing things in practice.
What kind of data do you have and how to treat different types?
You’ve imported the Titanic training dataset.
Let’s check it out.
Column by column, there’s: numbers, numbers, numbers, words, words, numbers, numbers, numbers, letters and numbers, numbers, letters and numbers and NaNs, letters. Similar to Johnny’s toenails.
Let’s separate the features out into three boxes, numerical, categorical and not sure.
In the numerical bucket we have,
The categorical bucket contains
And in not sure we have
Now we’ve broken the columns down into separate buckets, let’s examine each one.
The Numerical Bucket
Remember our question?
‘Can we predict survivors on the Titanic based on data from other passengers?’
From this, can you figure out which column we’re trying to predict?
Survived column. And because it’s the column we’re trying to predict, we’ll take it out of the numerical bucket and leave it for the time being.
Think for a second. If you were trying to predict whether someone survived on the Titanic, do you think their unique
PassengerId would really help with your cause?
Probably not. So we’ll leave this column to the side for now too. EDA doesn’t always have to be done with code, you can use your model of the world to begin with and use code to see if it’s right later.
These are numbers but there’s something different about them. Can you pick it up?
Parch even mean? Maybe we should’ve read the docs more before trying to build a model so quickly.
Google. ‘Kaggle Titanic Dataset’.
Pclass is the ticket class, 1 = 1st class, 2 = 2nd class and 3 = 3rd class.
SibSp is the number of siblings a passenger has on board. And
Parch is the number of parents someone had on board.
This information was pretty easy to find. But what if you had a dataset you’d never seen before. What if a real estate agent wanted help predicting house prices in their city. You check out their data and find a bunch of columns which you don’t understand.
You email the client.
They respond. ‘
Tnum is the number of toilets in a property.’
Good to know.
When you’re dealing with a new dataset, you won’t always have information available about it like Kaggle provides. This is where you’ll want to seek the knowledge of an SME.
Another acronym. Great.
SME stands for subject matter expert. If you’re working on a project dealing with real estate data, part of your EDA might involve talking with and asking questions of a real estate agent. Not only could this save you time, but it could also influence future questions you ask of the data.
Since no one from the Titanic is alive anymore (RIP (rest in peace) Millvina Dean, the last survivor), we’ll have to become our own SMEs.
There’s something else unique about
Parch. Even though they’re all numbers, they’re also categories.
Think about it like this. If you can group data together in your head fairly easily, there’s a chance it’s part of a category.
Pclass column could be labelled, First, Second and Third and it would maintain the same meaning as 1, 2 and 3.
Remember how machine learning algorithms love numbers? Since
Parch are already all in numerical form, we’ll leave them how they are. The same goes for
Phew. That wasn’t too hard.
The Categorical Bucket
In our categorical bucket, we have
These are categorical variables because you can separate passengers who were female from those who were male. Or those who embarked on C from those who embarked from S.
To train a machine learning model, we’ll need a way of converting these to numbers.
How would you do it?
Pclass? 1st = 1, 2nd = 2, 3rd = 3.
How would you do this for
Perhaps you could do something similar for
Sex. Female = 1 and male = 2.
Embarked, S = 1 and C = 2.
We can change these using the
.LabelEncoder() function from the
We’ve made some good progress towards turning our categorical data into all numbers but what about the rest of the columns?
Challenge: Now you know Pclass could easily be a categorical variable, how would you turn Age into a categorical variable?
The Not Sure Bucket
Cabin are left.
If you were on Titanic, do you think your name would’ve influenced your chance of survival?
It’s unlikely. But what other information could you extract from someone's name?
What if you gave each person a number depending on whether their title was Mr., Mrs. or Miss.?
You could create another column called Title. In this column, those with Mr. = 1, Mrs. = 2 and Miss. = 3.
What you’ve done is created a new feature out of an existing feature. This is called feature engineering.
Converting titles to numbers is a relatively simple feature to create. And depending on the data you have, feature engineering can get as extravagant as you like.
How does this new feature affect the model down the line? This will be something you’ll have to investigate.
For now, we won’t worry about the
Name column to make a prediction.
The first few examples don’t look very consistent at all. What else is there?
These aren’t very consistent either. But think again. Do you think the ticket number would provide much insight as to whether someone survived?
Maybe if the ticket number related to what class the person was riding in, it would have an effect but we already have that information in
To save time, we’ll forget the
Ticket column for now.
Your first pass of EDA on a dataset should have the goal of not only raising more questions about the data but to get a model built using the least amount of information possible so you’ve got have a baseline to work from.
Now, what do we do with Cabin?
You know, since I’ve already seen the data, my spidey-senses are telling me it’s a perfect example for the next section.
Challenge: I’ve only listed a couple examples of numerical and categorical data here. Are there any other types of data? How do they differ to these?
What’s missing from the data and how do you deal with it?
missingno.matrix(train, figsize = (30,10))
Cabin column looks like Johnny’s shoes. Not there. There are a fair few missing values in
How do you predict something when there’s no data?
I don’t know either.
So what are our options when dealing with missing data?
The quickest and easiest way would be to remove every row with missing values. Or remove the
Age column entirely.
But there’s a problem here. Machine learning models like more data. Removing large amounts of data will likely decrease the ability of our model to predict whether a passenger survived or not.
Imputing values. In other words, filling up the missing data with values calculated from other data.
How would you do this for the
Could you fill missing values with average age?
There are drawbacks to this kind of value filling. Imagine you had 1000 total rows, 500 of which are missing values. You decide to fill the 500 missing rows with the average age of 36.
Your data becomes heavily stacked with the age of 36. How would that influence predictions on people 36-years-old? Or any other age?
Maybe for every person with a missing age value, you could find other similar people in the dataset and use their age. But this is time-consuming and also has drawbacks.
There are far more advanced methods for filling missing data out of scope for this post. It should be noted, there is no perfect way to fill missing values.
If the missing values in the
Age column is a leaky drain pipe the
Cabin column is a cracked dam. Beyond saving. For your first model,
Cabin is a feature you’d leave out.
Challenge: The Embarked column has a couple of missing values. How would you deal with these? Is the amount low enough to remove them?
Where are the outliers and why you should be paying attention to them?
‘Did you check the distribution?’ Athon asked.
‘I did with the first set of data but not the second set…’ It hit me.
There it was. The rest of the data was being shaped to match the outlier.
If you look at the number of occurrences of unique values within a dataset, one of the most common patterns you’ll find is Zipf’s law. It looks like this.
Remembering Zipf’s law can help to think about outliers (values towards the end of the tail don’t occur often and are potential outliers).
The definition of an outlier will be different for every dataset. As a general rule of thumb, you may consider anything more than 3 standard deviations away from the mean might be considered an outlier.
Or from another perspective.
How do you find outliers?
Distribution. Distribution. Distribution. Distribution. Four times is enough (I’m trying to remind myself here).
During your first pass of EDA, you should be checking what the distribution of each of your features is.
A distribution plot will help represent the spread of different values of data you have across. And more importantly, help to identify potential outliers.
Why should you care about outliers?
Keeping outliers in your dataset may turn out in your model overfitting (being too accurate). Removing all the outliers may result in your model being too generalised (it doesn’t do well on anything out of the ordinary). As always, best to experiment iteratively to find the best way to deal with outliers.
Challenge: Other than figuring out outliers with the general rule of thumb above, are there any other ways you could identify outliers? If you’re confused about a certain data point, is there someone you could talk to? Hint: the acronym contains the letters M E S.
Getting more out of your data with feature engineering
The Titanic dataset only has 10 features. But what if your dataset has hundreds? Or thousands? Or more? This isn’t uncommon.
During your exploratory data analysis process, once you’ve started to form an understanding AND you’ve got an idea of the distributions AND you’ve found some outliers AND you’ve dealt with them, the next biggest chunk of your time will be spent on feature engineering.
Feature engineering can be broken down into three categories: adding, removing and changing.
The Titanic dataset started out in pretty good shape. So far, we’ve only had to change a few features to be numerical in nature.
However, data in the wild is different.
Say you’re working on a problem trying to predict the changes in banana stock requirements of a large supermarket chain across the year.
Your dataset contains a historical record of stock levels and previous purchase orders. You're able to model these well but you find there are a few times throughout the year where stock levels change irrationally. Through your research, you find during a yearly country-wide celebration, banana week, the stock levels of bananas plummet. This makes sense. To keep up with the festivities, people buy more bananas.
To compensate for banana week and help the model learn when it occurs, you might add a column to your data set with banana week or not banana week.
# We know Week 2 is a banana week so we can set it using np.where()
df["Banana Week"] = np.where(df["Week Number"] == 2, 1, 0)
Adding a feature like this might not be so simple. You could find adding the feature does nothing at all since the information you’ve added is already hidden within the data. As in, the purchase orders for the past few years during banana week are already higher than other weeks.
What about removing features?
We’ve done this as well with the Titanic dataset. We dropped the
Cabin column because it was missing so many values before we even ran a model.
But what about if you’ve already run a model using the features left over?
This is where feature contribution comes in. Feature contribution is a way of figuring out how much each feature influences the model.
Why is this information helpful?
Knowing how much a feature contributes to a model can give you direction as to where to go next with your feature engineering.
In our Titanic example, we can see the contribution of
Pclass were the highest. Why do think this is?
What if you had more than 10 features? How about 100? You could do the same thing. Make a graph showing the feature contributions of 100 different features. ‘Oh, I’ve seen this before!’
Zipf’s law back at it again. The top features have far more to contribute than the bottom features.
Seeing this, you might decide to cut the lesser contributing features and improve the ones contributing more.
Why would you do this?
Removing features reduces the dimensionality of your data. It means your model has fewer connections to make to figure out the best way of fitting the data.
You might find removing features means your model can get the same (or better) results on fewer data and in less time.
Like Johnny is a regular at the cafe I’m at, feature engineering is a regular part of every data science project.
Challenge: What are other methods of feature engineering? Can you combine two features? What are the benefits of this?
Building your first model(s)
Finally. We’ve been through a bunch of steps to get our data ready to run some models.
If you’re like me, when you started learning data science, this is the part you learned first. All the stuff above had already been done by someone else. All you had to was fit a model on it.
Our Titanic dataset is small. So we can afford to run a multitude of models on it to figure out which is the best to use.
Notice how I put an (s) in the subtitle, you can pay attention to this one.
But once you’ve had some practice with different datasets, you’ll start to figure out what kind of model usually works best. For example, most recent Kaggle competitions have been won with ensembles (combinations) of different gradient boosted tree algorithms.
Once you’ve built a few models and figured out which is best, you can start to optimise the best one through hyperparameter tuning. Think of hyperparameter tuning as adjusting the dials on your oven when cooking your favourite dish. Out of the box, the preset setting on the oven works pretty well but out of experience you’ve found lowering the temperature and increasing the fan speed brings tastier results.
It’s the same with machine learning algorithms. Many of them work great out of the box. But with a little tweaking of their parameters, they work even better.
But no matter what, even the best machine learning algorithm won’t result in a great model without adequate data preparation.
Exploratory data analysis and model building is a repeating circle.
A final challenge (and some extra-curriculum)
I left the cafe. My ass was sore.
At the start of this article, I said I’d keep it short. You know how that turned out. It will be the same as your EDA iterations. When you think you’re done. There’s more.
We covered a non-exhaustive EDA checklist with the Titanic Kaggle dataset as an example.
1. What question are you trying to solve (or prove wrong)?
Start with the simplest hypothesis possible. Add complexity as needed.
2. What kind of data do you have?
Is your data numerical, categorical or something else? How do you deal with each kind?
3. What’s missing from the data and how do you deal with?
Why is the data missing? Missing data can be a sign in itself. You’ll never be able to replace it with anything as good as the original but you can try.
4. Where are the outliers and why should pay attention to them?
Distribution. Distribution. Distribution. Three times is enough for the summary. Where are the outliers in your data? Do you need them or are they damaging your model?
5. How can you add, change or remove features to get more out of your data?
The default rule of thumb is more data = good. And following this works well quite often. But is there anything you can remove get the same results? Less but better? Start simple.
Data science isn’t always about getting answers out of data. It’s about using data to figure out what assumptions of yours were wrong. The most valuable skill a data scientist can cultivate is a willingness to be wrong.
There are examples of everything we’ve discussed here (and more) in the notebook on GitHub and a video of me going through the notebook step by step on YouTube (the coding starts at 5:05).
FINAL BOSS CHALLENGE: If you’ve never entered a Kaggle competition before, and want to practice EDA, now’s your chance. Take the notebook I’ve created, rewrite it from top to bottom and improve on my result. If you do, let me know and I’ll share your work on my LinkedIn. Get after it.
Extra-curriculum bonus: Daniel Formoso's notebook is one of the best resources you’ll find for an extensive look at EDA on a Census Income Dataset. After you’ve completed the Titanic EDA, this is a great next step to check out.
If you’ve got something on your mind you think this article is missing, leave a response below or send me a note and I’ll be happy to get back to you.
I’m working on a longer form article. An introduction to exploratory data analysis to go along with the Code with Me video I did exploring the Kaggle Titanic dataset and the notebook code to go with it.
I’ve spent the past two days writing and and refining it.
I wanted to get it published today but it’s getting late and you know my thoughts on sleep. I work better when I sleep well.
In the past I’d have trouble walking away from something unless it’s done. But I’ve learned, especially with writing (and code) it pays to walk away, think about nothing for a while and then come back at it with a different pair of eyes.
The next time you look at it, you’ll see things you missed before. That’s what I’ll be doing tomorrow morning.
If you want to read it in the meantime, it’s in draft form on Medium. It needs some graphics and a little tidying but if you do read it, what would you change?
This post originally appeared on Quora as my answer to 'Udacity or Coursera for AI machine learning and data science courses?'
Tea or coffee?
Burger or sandwich?
Rain or sunshine?
Pushups or pull-ups?
Can you see the pattern?
Similar but different. It’s the same with Udacity and Coursera.
I used both of them for my self-created AI Masters Degree. And they both offer incredibly high-quality content.
The short answer: both.
Keep scrolling for a longer version.
Let’s go through the five C’s of online learning.
If you’ve seen my work, you know I’m a big fan of digging your own path and online platforms like Udacity and Coursera are the perfect shovel. But doing this right requires thought around five pillars.
When you imagine the best version of yourself 3–5 years in the future, what are they doing?
Does it align with what’s being offered by Udacity or Coursera?
Is the future you a machine learning engineer at a technology company?
Or have you decided to take the leap on your latest idea and go full startup mode?
It doesn’t matter what the goal is. All of them are valid. Mine is different to yours and yours will be different to the other students in your cohort.
The important part is an insatiable curiosity. In Japanese, this curiosity is referred to as ikigai or your reason for getting up in the morning.
Day to day, you won’t be bounding out of bed running to the laptop to get into the latest class or complete the assignment you’re stuck on.
There will be days where everything else except studying seems like a better option.
Don’t beat yourself up over it. It happens. Take a break. Rest.
Even with all the drive in the world, you still need gas.
Sam was telling me about a book he read over the holidays.
‘There were some things I agreed with but some things I didn’t.’
My insatiable curiosity kicked in.
‘What did you disagree with?’
I was more interested in that. He said it was a good book. What were the things he didn’t like?
Why didn’t he like those things?
The contrast is where you learn the most.
When someone agrees with you, you don’t have to back up your argument. You don’t have to explain why.
But have you ever heard two smart people argue?
I want to hear more of those conversations.
When two smart people argue, you’ve got an opportunity to learn the most.
If they're both smart, why do they disagree?
What are their reasons for disagreeing?
Take this philosophy and apply it to learning online through Udacity or Coursera.
If they’re like tea and coffee, where's the difference?
When I did the Deep Learning Nanodegree on Udacity, I felt like I had a wide (but shallow) introduction to deep learning.
Then when I did Andrew Ng’s deeplearning.ai after, I could feel the knowledge compounding.
Andrew Ng’s teachings didn’t disagree with Udacity’s, they offered a different point of view.
The value is in the contrast.
Both partner with world-leading organisations.
Both have world class quality teachers.
Both have state of the art learning platforms.
When it comes to content, you won’t be disappointed by either.
I’ve done multiple courses on both platforms and I rate them among the best courses I’ve ever done. And I went to university for 5-years.
Udacity Nanodegrees tend to go for longer than Coursera.
For example, the Artificial Intelligence Nanodegree is two terms both about 3–4 months long.
Whereas Coursera Specializations (although at times a similar length), you can dip in and out of.
For example, complete part 1 of a Specialization, take a break and return to the next part when you’re ready. I’m doing this for the Applied Data Science with Python Specialization.
If content is at the top of your decision-making criteria, make a plan of what it is you hope to learn. Then experiment with each of the platforms to see which better suits your learning style.
Udacity has a pay upfront pricing model.
Coursera has a month-to-month pricing model.
There have been times I completed an entire Specialization on Coursera within the first month of signing up, hence only paying for one month.
Whereas, all the Udacity Nanodegree’s I’ve done, I’ve paid the total up front and finished on (or after) the deadline.
This could be Parkinson’s Law at play: things take up as much time as you allow them.
Both platforms offer scholarships as well as financial support services, however, I haven’t had any experience with these.
I drove Uber on weekends for a year to pay for my studies.
I’m a big believer in paying for things.
When I pay for something, I take it more seriously.
Paying for something is a way of saying to yourself, I’m investing my money (and time spent earning it), I better invest my time into too.
All the courses I’ve completed on both platforms have been worth more than the money I spent on them.
You’ve decided on a learning platform.
You’ve decided on a course.
You work through it.
You enjoy it.
Now what do you do?
Do you start the next course?
Do you start applying for jobs?
Does the platform offer any help with getting into the industry?
Udacity has a service which partners students who have completed a Nanodegree with a careers counsellor to help you get a role.
I’ve never got a chance to use this because I was hired through LinkedIn.
What can you do?
Don’t be focused on completing all the courses.
Completing courses is the same as completing tasks. Rewarding. But more tasks don’t necessarily move the needle.
Focus on learning skills.
Once you’ve learned some skills. Practice communicating those skills.
Share your work.
Have a nice GitHub repository with things you’ve built. Stack out your LinkedIn profile. Build a website where people can find you. Talk to people in your industry and ask for their advice.
Because a few digital certificates isn’t a reason to hire someone.
Done all that?
Good. Now remember, the learning never stops. There is no finish line.
This isn’t scary. It’s exciting.
You stop learning when your heart stops beating.
Let’s wrap it up
Both platforms offer some of the highest quality education available.
And I plan on continuing to use them both to learn machine learning, data science and many other things.
But if you can online choose one, remember the five C’s.
Curiosity — Stay curious. Remember it when learning gets tough.
Contrast — Remix different learning resources. All the value in life is at the combination of great things.
Content — What content matches your curiosity? Follow that.
Cost — Cost restrictions are real. But when used right, your education is worth it.
Continuation — Learn skills, apply them, share them, repeat.
I’ve written and made videos about these topics in the past. You might find some of the resources below valuable.
Zac emailed me asking a question.
Keep on working and keep looking for new opportunities in the field…
Go back to uni and finish the last 18 months of my degree.
He just finished an internship and has about 18-months left at university before he finishes his computer science degree.
It’s a tough choice.
I sat and thought about it for a while. Then replied to the email with some unedited thoughts.
And I’m sharing them here, also unedited. Bear in mind, I’ve never been to university to study computer science.
Here’s how I see it, I’m gonna write a few thoughts out loud.
- Where do you want to be/see yourself in 3-5 years?
It sounds like you’re pretty switched on to where your skillset lies (aka, teaching yourself, working on things which interest you).
Might be worth having a think about which one better suits the ideal version of you in 3-5 years.
Does that ideal version of you require a university degree? Or could that version of you get by without one?
- Which one is the most uncomfortable in the short term?
I’m very long term focused (I have to remind myself of this every day). So whenever I come up to a hard decision, I ask myself, ‘Which one is hardest in the short term?’
I treat short term as anything under 2-3 years (the starting era of the ideal version of yourself).
- 18-months isn’t really the longest time
How much of a rush are you in?
Could you stick out the 18-months, share your work online through an online portfolio, upskill yourself through various other courses (and jump ahead of others) and come out with a degree AND some extra skills.
- Get after it
This is countering the above point.
If you think you have the balls to chase after it (sounds like you already do), why do you need university to be a gatekeeper?
Sure, not having an official degree may shut you off from some companies, but to me, a piece a paper never really meant much. Especially when the best quality materials in world are available online.
I have a colleague doing a data science masters at UQ and he said he has learned way more since working with Max Kelsen than at university.
Put it this way, I was driving Uber this time last year. But I followed through with my curriculum, shared my work online and got found by an awesome company.
- Share your work
Whichever path you choose, I can’t emphasis this enough. Make sure people can find you online.
If you’re not going to get a degree. Be the person who’s name comes up on others LinkedIn feeds for data science posts. Have some good Medium articles, share what you’ve been doing.
It’ll feel weird in the start. Trust me. But then you’ll realise the potential of it.
All of sudden, you can become an expert in your field by being the one to communicate the skills you’re learning.
How did I do?
What would you do in Zac’s situation? Learn online and look for more work experience? Or stick out the 18-months of computer science?
Jupyter Lab was open. The notebooks and data I was working on were sitting on the left.
It was close to home time.
After digging through docs all day to figure out how to get some old code running, my brain was looking like this: FQRUQ#$%(#$QJTQITHqRjlrjkaw
Every time I push to GitHub I have to look up a guide.
3-minutes later, I sent off a push command and noticed all the files were being pushed. Not what we wanted. Only the new stuff was to go up.
> git reset
> git add [notebooks]
> git commit -m "adding latest datascience notebooks"
> git push origin master
"You're already one commit ahead of master."
*clicks on first stackoverflow link*
"To rollback a commit you can use git revert..."
Back to the shell.
> git log
*copies previous commit hash key*
> git revert [insert above hash key]
All of a sudden two files disappeared from the Jupyter Lab directory. The exact two files I wanted to commit.
And the data folder was now empty. "Huh?"
30-minutes of trying to revert a revert later, 1 of the 2 notebooks were nowhere to be seen. We saved one because I still had it open. The other wasn’t as lucky. ~500 lines of Python code gone.
Moral of the story?
When it comes to Git. Move slow and save things.
PS. I found a really cool (and colourful) guide on how to use git. I’ve bookmarked for future reference. If you want to step up your git/GitHub game, you might be interested.
The first time doing something is always the hardest.
People had asked me in the past, 'Have you entered Kaggle competitions?'
Until the other day. I made my first official submission.
I'd dabbled before. Looked around at the website. Read some posts. But never properly downloaded the data and went through it.
Fear. Fear of looking at the data and having no idea what to do. And then feeling bad for not knowing anything.
But after a while, I realised that's not a helpful way to think.
I downloaded the Titanic dataset. The one that says 'Start here!' when you visit the competitions page.
A few months into learning machine learning, I wouldn't have been able to explore the dataset.
I learned by starting at the top of the mountain instead of climbing up from the bottom. I started with deep learning instead of practising how to explore a dataset from scratch.
But that's okay. The same principle would apply if you start exploring a dataset from scratch. Once the datasets got bigger, and you wanted your models to be better, you'd have to learn deep learning eventually.
Working through the Titanic data take me a few hours. Then another few hours to tidy up the code. The first run through of any data exploration should always be a little messy. After all, you're trying to build an intuition of the data as quick as possible.
Then came submission time. My best model got a score of just under 76%. Yours will too if you follow through the steps in the notebook on my GitHub.
I made the notebook accessible so you can follow it through and make your very own first Kaggle submission.
There are a few challenges and extensions too if you want to improve on my score. I encourage you to see how you go with these. They might improve the model, they might not.
If you do beat my score, let me know. I'd love to hear about what you did.
Want a coding buddy? When I finished my first submission, I livestreamed myself going step by step through the code. I did my best to explain each step without going into every little detail (otherwise the video would've been 6-hours long instead of 2).
I'll be writing a more in-depth post on the what and why behind the things I did in the notebook. Stay tuned for that.
In the meantime, go and beat my score!
You can find the full code and data on my GitHub.
So you’ve got some data and you’re wondering what can be learned from it. Is it numerical or categorical? Does it have high dimensionality or cardinality?
It’s no secret that data is everywhere. But it’s important to recognise not all data is the same. You might have heard the term data cleaning before. And if you haven’t, it’s not too different to regular cleaning.
When you decide it’s time to tidy your house, you put the clothes on the floor away, and move the stuff from the table back to where it should go. You’re bringing order back to a chaotic environment.
The same thing happens with data. When a machine learning engineer starts looking at a dataset, they ask themselves, ‘where should this go?’, ‘what was this supposed to be?’ Just like putting clothes back in the closet, they start moving things around, changing the values of one column and normalising the values of another.
But wait. How do you know what to do to each piece of data?
Back to the house cleaning analogy. If you have a messy kitchen table, how do you know where each of the items goes?
The spices go in the pantry because they need to stay dry. The milk goes back in the fridge because it has to stay cold. And the pile of envelopes you haven’t opened yet can probably go into the study.
Now say you have a messy table of data. One column has numbers in it, the other column has words in it. What could you with each of these?
A convenient way to break this down is into numerical and categorical data.
Before we go further, let’s meet some friends to help unpack these two types of values.
Harold the pig loves numbers. He counts his grains of food every day.
Klipklop the horse watches all the cars go past the field and knows every type there is.
And Sandy the fish loves both. She knows there’s safety in numbers and loves all the different types of marine life under the sea.
Like Harold, computers love numbers.
With any dataset, the goal is often to transform it in a way so all the values are in some kind of numerical state. This way, computers can work out patterns in the numbers by performing large-scale calculations.
In Harold’s case, his data is already in a numerical state. He remembers how many grains of food he’s had every day for the past three years.
He knows on Saturdays he gets a little extra. So he saves some for Mondays when the supply is less.
You don’t necessarily need a computer to figure out this kind of pattern. But what if you were dealing with something more complex?
Like predicting what Company X’s stock price would be tomorrow, based on the value of other similar companies and recent news headlines about Company X?
Ok – so you know the stock prices of Company X and four other similar companies. These values are all numbers. Now you can use a computer to model these pretty easily.
But what if you wanted to incorporate the headline ‘Company X breaks new records, an all-time high!’ into the mix?
Harold is great at counting. But he doesn’t know anything about the different types of grains he has been eating. What if the type of grain influenced how many pieces of grain he received? Just like how a news headline may influence the price of a stock.
The kind of data that doesn’t come in a straightforward numerical form is called categorical data.
Categorical data is any kind of data which isn’t immediately available in numerical form. And it’s typically where you will hear the terms dimensionality and cardinality thrown around.
This is where Klipklop the horse comes in. He watches the cars go past every day and knows the make and model of each one.
But say you wanted to use this information to predict the price of a car.
You know the make and model contribute something to the value. But what exactly?
How do you get a computer to understand that a BMW is different from a Toyota?
This is where the concept of feature encoding comes in. Or in other words, turning a category into a number so that a computer learns how each of the numbers relates.
Let’s say it’s been a quiet day and Klipklop has only seen 3 cars.
A BMW X5, a Toyota Camry and a Toyota Corolla. How could you turn these cars into numbers a machine could understand whilst still keep their inherent differences?
There are many techniques, but we’ll look at two of the most popular – one-hot-encoding and ordinal encoding.
This is where the car and its make are assigned a number in the order they appeared.
Say the BMW went by first, followed by the Camry, then the Corolla.
But does this make sense?
By this logic, a BMW + Toyota should equal a Toyota (1 + 2 = 3). Not really.
Ordinal encodings can be used for some situations like time intervals but it’s probably not the best choice for this case.
One-hot encoding assigns a 1 to every value that applies to each individual car, and 0 to every value that does not apply.
Now our two Toyotas are similar to each other because they both have 1’s for Toyota but differ on their make.
One-hot-encoding works well to encode category values into numbers but has a downfall. Notice how the number of values used to describe a car increased from 2 to 5.
This is where the term high dimensionality gets used. There are now more parameters describing what each car is than there is the number of cars.
For a computer to learn meaningful results, you want the ratio to be high in the opposite way.
In other words, you’d prefer to have 6,000 examples of cars and only 6 ways of describing them rather than the other way round.
But of course, it doesn’t always work out this way. You may end up with 6,000 cars and 1,000 different ways of describing them because Klipklop has seen 500 different types of makes and models.
This is the issue of high cardinality – when you have many different ways of describing something but not many examples of each.
For an ideal price prediction system, you’d want something like 1,000 Toyota Corollas, 1,000 BMW X5s and 1,000 Toyota Camrys.
Ok, enough about cars.
What about our stock price problem? How could you incorporate a news headline into a model?
Again, you could do this a number of ways. But we’ll start with a binary representation.
You were born before the year 2000, true or false?
Let’s say you answered true. You get a 1. Everyone born after the year 2000 gets a 0. This is binary encoding in a nutshell.
For our stock price prediction, let’s break our news headlines into two categories – good and bad. Good headlines get a 1 and bad headlines get a 0.
With this information, we could scan the web, collecting headlines as they come in and feeding these into our model. Eventually, with enough examples, it would start to get a feel of the stock price changes based on the value it received for the headline.
And with the model, you start to notice a trend – every time a bad headline comes out, the stock price goes down. No surprises.
We’ve used a simple example here and binary encodings don’t exactly capture the intensity of a good or bad headline. What about neutral, very good or very bad? This is where our the previously discussed ordinal encoding could come in.
-2 for very bad headlines, -1 for bad, 0 for neutral, 1 for good and 2 for very good. Now it makes sense that very bad + very good = neutral.
There are more complex ways to bring words into a machine learning model but we’ll leave those for a future article.
The important thing to note is that there are many different ways seemingly non-numerical information can be converted into something a computer can understand.
What can you do?
Machine learning engineers and data scientists spend much of their time trying to think like Sandy the fish.
Sandy knows she’ll be safe staying with the other school of fish but she also knows there’s plenty to learn from exploring the unknown.
It’s easy to lean on only numerical information to draw insights from. But there’s so much more information hidden in diverse ways.
By using a combination of numerical and categorical information, more realistic and helpful models of the world can be built.
It’s one thing to model the stock market using price information, but it’s a whole other game when you add news headlines to the mix.
If you’re looking to start harnessing the power of your data with techniques like machine learning and data science, there are a few things you can to get the most of it.
Normalising your data
If you’re collecting data, what format is it stored in?
The format itself isn’t necessarily as important as the uniformity. Collect it but make sure it’s all stored in the same way.
This applies for numerical and categorical data, but especially for categorical data.
More is better
The ideal dataset has a good balance between cardinality and dimensionality.
In other words, plenty of examples of each particular sample.
Machines aren’t quite as good as humans when it comes to learning (yet). We can see Harold
the pig once and remember what a pig looks like, whereas, a computer needs thousands of examples of a picture of a pig to remember what a pig looks like.
A general rule of a thumb for machine learning is that more (quality) data equals better models.
Document what each piece of information relates to
As more and more data is collected, it’s important to be able to understand what each piece of information relates to.
At Max Kelsen, before any kind of machine learning model is run, the engineers spend plenty of time liaising with subject matter experts who are familiar with the data set.
Why is this important?
Because a machine learning engineer may be able to build a model which is 99% accurate but it’s useless if it’s predicting the wrong thing. Or worse, 99% accurate on the wrong data.
Documenting your data well can help prevent these kinds of misfires.
It doesn’t matter whether you’ve got numerical data, categorical data or a combination of both – if you’re looking to get more out of it, Max Kelsen can help.