Advice which made me a better machine learning engineer

Athon was doing a talk. Something about Variational Autoencoders. He got deep. Much of it I didn’t understand. All I know is one-half tries to condense a distribution from a larger one into a smaller one and the other half tries to turn the smaller one into a new version of a larger one as close as possible to the original.

We went for a break.

There were fish sticks, wings, more 10-minute foods. The kind which tastes good for the 10-minutes you’re eating them but make you feel terrible after.

John was there. We’d met before. He was telling me about a hackathon his team won by using Julia (programming language) to denoise images and then used them for image recognition.

He told me how his company got acquired by a larger company. He’d been at the new company a few weeks but preferred smaller companies.

In between bites of fish sticks, I asked John questions.

John had been programming since he was young. I had 18-months under my belt. There were things he was saying I didn’t understand but I kept a constant stream of nods.

John asked me a question.

What do you think your strength is?

I spoke.

Well, I know I’ll never be the best engineer. But…

John interrupted me.

You won’t be with that attitude.

I was going to continue with my regular story. I know I’ll never be the best engineer but I can be the communication bridge between engineers and customers.

But I didn’t. I digested what John said along with chewed fish sticks. I spoke.

You’re right.

John kept talking.

You won’t improve if you think like that. Even if you know you won’t be the best, be careful what words you use, they’ll hold you back.

Every time I’ve run into a problem since and wanted to bang my head against a wall wanted to give up wanted to try something easier instead of doing the hard thing I remember back to what John said.

I say to myself.

I’m the best engineer in the world.

Source: https://qr.ae/TWNE3s

How do you learn machine learning when you're rusty at math?

Mum, how can I get out of the exam?

What?

I’m going to fail.

Tears started filling my eyes. I was sitting at my desk with the lamp on, 11 pm the night before the final exam.

Maths C. That’s what they called it. There was Maths A and B but C was the hardest and I was doing it.

There was something about matrices and imaginary numbers and proofs. I couldn’t do any of it. Only a few matrix multiplications, the easier ones.

I did the exam. Somehow I passed. I shouldn’t have. My teacher let me off the hook. That was 2010.

University came and I majored in biomedicine. I failed my first statistics course, twice. Then I changed out of biomedicine.

I graduated in 2015 with a dual major in food science and nutrition. Now food is one of my religions.

2017 happened and I decided to get into machine learning. I’d seen the headlines, read the articles.

Andrew Ng’s machine learning course kept getting recommended so I started there.

The content was tough. All the equations, all the Greek symbols. I hadn’t touched any of it since high school.

But the code was exciting, and the concepts made sense. Ng’s teaching skills meant they just made sense.

So I followed those. Kept going at it. This time I didn’t have an Xbox to distract me like high school.

My math is still rusty. I’ve done some Khan Academy courses on matrix manipulation, calculus and bookmarked some linear algebra courses to get into. There’s one from Rachel Thomas and another one from Stanford.

Math is a language, it takes time to learn, time to be able to speak it. Programming is the same. Machine learning combines them both and a bit more.

I started with programming first. Built some machine learning models, using Python code, TensorFlow, PyTorch and the others. Saw how the concepts linked with the code. It got me hooked.

You can start learning machine learning without an in-depth knowledge of the math behind it. If your math is rusty, you can learn machine learning with concepts and code first. Many of the tools available to you abstract away the math and let you build.

But when you gain a little momentum, learn a little more, hit a roadblock, you can dive into the math.

Source: https://qr.ae/TWNzgr

The future of education is online (+ 5 resources I've been loving)

Not everyone has access to the best colleges in the world. But the internet provides a way for everyone to access the best knowledge in the world.

There are no shortage of learning materials. Only a shortage of willingness to learn.

Even with such great learning resources available, it still takes a dedicated effort to work through them. To build upon and to create with them.

And one of the best ways for knowledge to spread and be useful is if it’s shared.

Here are 5 things which have caught my attention this week:

1. Open-source state-of-the-art conversational AI

Thomas Wolf wrote a great blog post summarising how the HuggingFace team built a competition winning conversational AI.

All done in 250 lines of refactored PyTorch code on GitHub! 🔥

2. Open-source Data Science Degree

The Open Source Society Unversity repository contains pathways you can use to take advantage of the internet to educate yourself.

3. GitHub Learning Lab

I need to get better at GitHub.

It’s a required skill for all developers and coders.

So I've been using the GitHub learning lab, a free training resource from The GitHub Training Team.

4. 30+ deep learning best practices

This forum post from the fast.ai forums collates some of the best tidbits for improving your models.

My favourite is the cyclic learning rate.

5. A neural network recipe from Tesla's AI Lead

Training neural networks can be hard. 

But there are a few things you can do to help.

And Andrej Karpathy has distilled them for you.

My favourite?

Become one with the data.

PS this post is an excerpt from the newsletter I sent out this morning. If you’d like to get more like these delivered to your inbox, sign up for more.

University vs. Studying Online and How to Get Around Smart People

Lukas emailed me asking a few questions. I replied back with some answers and then he dug deeper. He thought about what I said and then wanted to know more.

I replied back to him with some of my thoughts which I tidied up a bit and put below. The headings are the topics Lukas was curious about. This post doesn’t have all the context but I think you’ll find some value out of it.

Hey Lukas,

I’ll answer these how I did the last ones and break them apart a bit.

1. “University/school teaches some stuff that you don’t really need or want”

This is true. But also true of all learning. Whatever resource you choose, you’ll never use all of it. Some knowledge will come from elsewhere, some will vanish into nothing.

The reason learning online is valuable is it gives you the chance to narrow down on what it is you want immediately. University and school take a ‘boil the ocean’ solution because that’s the only valid one for what they offer. Individualised learning hasn’t made its way into traditional education services. I found I learn best when I follow what I’m interested in so I take the approach of learning the most important thing when it’s required. What's most important? It will depend on the project you’re working on.

Whilst this is an ideal approach for me. It’s important to always reflect on practicality. If I’m building a business and all I want to do is follow what I’m interested in, will that always line up with what customers/the market want? Maybe. Maybe not.

Lately, I’ve been taking the concept of time splitting and applying it to most of what I do. A 70/20/10 split I stole from Google.

In essence, 70% on core product/techniques (improving and innovating on existing knowledge), 20% on new ventures (still tied to core product) and 10% on moonshots (things that might not work).

In the case of my core product, it’s learning health and machine learning skills that can be applied immediately. I distil these in a work project/online creation I share with others.

For new ventures, it’s taking the core product skills and then expanding them on things I haven’t yet done, learning a new technique, working on a new project. But still tied to the core pillars of health and technology.

For moonshots, it’s going, ‘where will the world be in 5-10 years and how can I start working on those things now.’ These don’t necessarily have to relate to the core product but mine kind of still are (since the crossover of health, technology and art interests me most). For this, I’ve been playing around with the idea of an augmented reality (AR) coach/doctor. If AR glasses are going to be a thing, how could I build a health coach service which lives in the AR realm and is summoned/ever present to give insights into different aspects of your health? All of this would be of course personalised to the individual.

If you're still on the fence between university and learning on your own. One thing you may want to look into is the ‘2-year self apprenticeship’. I wrote an article about this which will shed some more light. Especially at 20, this would be something I’d highly recommend (I already have to my brothers, who are your age).

Remember, there's no rush. You've got plenty of time. Work hard and enjoy it.

2. “Why math at university versus on your own?”

I mentioned I was thinking of going to university to study mathematics rather than online. Here's why.

I learned Chinese and Japanese throughout 2016. The most helpful thing was being able to practice speaking with other people face to face.

I stopped after a year and have lost most of what I learned.

Why?

Because I don’t use it and don’t need to use it every day. English is 99.999% enough for conversations in Australia and the work I do.

Math is also a language. The language of nature. Being able to speak it and work on it with other people is a great way to accelerate your knowledge.

That isn’t to say you couldn’t do the same online. But put it this way, I would never try to learn another language without practising conversing from day 1.

If you want to learn French, move to France. If you want to learn math, take math classes with other people who speak math.

3. “How do you get physically around smart people?”

Aside from working with a great team or going to university and having a great cohort. Meetups are the number 1 thing for this.

They are weird and awkward and beautiful.

I always feel like a fish out of water there because everyone seems like a genius.

Events related to your field are priceless. They don’t have to be too often either. I’m finding once a month or so as a sound check to be enough.

4. “Which platform was best for opportunities?”

For content partnerships and online business opportunities: YouTube & LinkedIn (I've been approached or partnered with Coursera, A Cloud Guru, DataCamp, educative.io and more).

For career progression: LinkedIn. If I was looking for a job or more business opportunities, I’d be posting and interacting here daily.

For reaching an audience: Medium. Words are powerful. Writing every day is the best habit I have (aside from daily movement and staying healthy).

A tip for creating.

People are interested in two things when they look at content. Being educated and/or being entertained. Bonus points if you can do both but you don’t need to do both. One is suffice.

Especially if you’re doing a 2-year self apprenticeship or some kind of solo learning journey, share your work from day 1. Share what you’re learning and teach others if you can.

Do not expect it to go viral. Do not expect everyone to love it. These aren’t required.

What’s required is for you to continue improving your skills and to continue improving how to communicate said skills.

Over the long term, those two things are what matter.

Let me know if there’s any follow ups.

Great questions.

Best,

Daniel Bourke

Activity vs. Progress

“Are you making progress or completing activities?” he said, “That’s what I ask myself at the end of each day.”

“I’m writing that down.”

We kept talking. Not much more worth writing down though.

“Let me know what you get up to.”

“Okay, I will.”

“Talk soon.”

“Have a good day mate. Goodbye.”

Too many activities can feel like progress. That’s what he was talking about. You could be working yourself to the bone but the list never gets any smaller.

Maybe it’s time to get a new list.

One which leads to progress instead of a whole bunch of activities being checked off at the end of the day.

I catch myself when I’m writing a list each morning. On the days where there are only two or three things, write, workout, read, I go to add more add more as a habit. But would more activities lead to progress?

If your goal is to progress, you must decide which activities lead to it and which don’t. It’s hard and you’ll never be able to do it for sure but you can make a decision to. A decision to step back a decision to think about what does add to progress and cut what doesn’t.

In my latest video, I share how I got Google Cloud Professional Data Engineer Certified. I passed the exam without meeting any of the prerequisites. How? A few activities which led to progress. But the certification isn’t the real progress. The real progress comes from doing something with the skills the certificate requires. More on that in the future.







What does a machine learning engineers day look like?

Someone asked me on LinkedIn what they should learn for the rest of the year in order to become a machine learning engineer.

The specific skills are hard to narrow down as every role will be different. I can only share what I’ve learned the past year being a machine learning engineer at Max Kelsen.

I’ve copied the message I replied with here.

Hey [name removed]!

I'm great thank you! I trust you're well too.

Well, machine learning engineers may have different roles at different companies but let me talk you through what my day usually looks like.

  • 9 am - reading articles/papers online about machine learning (arXiv and Medium are the two usual places).

  • 10 am - working on the current project and (sometimes) applying what I've just been reading online.

  • 4 pm - pushing my code to GitHub and writing down experiments for the next day.

  • 5 pm - sending a small report to the team about what I've been working on during the day.

(these are all ideal scenarios)

Now, what happens during the 10-4pm (this is where most of the code gets done). Usually, it will be all be Python code within a Jupyter Notebook playing around with different datasets.

At the moment I'm working on a text classification problem using the Flair library.

As for what skills I'd suggest are most valuable (in my current role).

1. Exploring datasets using exploratory data analysis, this notebook by Daniel Formosso is a great example.

I also wrote an article with a bit more of a gentle introduction to exploratory data analysis which may help.

2. Being able to research different data science and machine learning techniques and apply them to current problems.

This one is a little more tricky because it will be different from problem to problem.

How you could practice this would be to enter a Kaggle competition (previous or current) and start figuring out different practices for different kinds of data, tabular, text, images.

Why Kaggle?

Because it's free, there are others who show their work (so you know what a good job is) and the datasets are relatively close (all real datasets differ a little) to what you'd be working on as a machine learning engineer.

Once you've spent a couple of months doing 1. and 2. you may want to look into what it takes to deploy a machine learning model in production. However, don't rush towards this. This is still a bit of a dark art (it's doable but not well documented yet). I think over the next year, this step will become more and more accessible to everyone.

I hope this helps.

Let me know if you'd like me to tidy anything/clarify some things.

[If you’re reading this, you can reach out and ask questions too, I’ll do my best to answer.]

So many people are learning machine learning. What should you do to stand out?

There it was. Podcasts, YouTube, blog posts, machine learning here there changing this changing that changing it all.

I had to learn. I started. Andrew Ng’s Machine Learning course on Coursera. A bunch of blog posts. It was hard but I was hooked. I kept going. But I needed some structure. I put a few courses together in my own AI Masters Degree. I’m still working through it. It won’t finish. The learning never stops.

Never.

You know this. You’ve seen it happening. You’ve seen the blog posts, you’ve seen the Quora answers, you’ve seen the endless papers the papers which are hard to read the good ones which come explained well with code.

Everyone is learning machine learning.

Machine learning is learning everyone.

How do you stand out?

How how how.

A) Start with skills

The ones you know about, math, code, probability, statistics. All of these could take decades to learn well on their own. But decades is too long. Everyone is learning machine learning. You have to stand out from everyone.

There are courses for these things and courses are great. Courses are great to build a foundation.

Read the books, do the courses, structure what you’re learning.

This week I’m practising code for 30-minutes per day. 30-minutes. That’s what I have to do. When I don’t feel like practicing. I’ll remind myself. These are the skills I have to learn. It’ll be yes or no. It’s my responsibility. I’ll do it. Yes.

Why skills?

Because skills are non-negotiable. Every field requires skills. Machine learning is no different.

If you’re coming from zero, spend a few months getting into the practical work of one thing, math, code, statistics, something. My favourite is code, because it’s what the rest come back to.

If you’re already in the field, a few months, a fear years in, reaccess your skills, what needs improving? What are you good at? How could you become the best in the world at it? If you can’t become the best in the world, combine with something else you’re good at and become the best in the world at the crossover.

B) Got skills? Good. Show them.

Ignore this if you want.

Ignore it and only pay attention to the above. Only pay attention to getting really good at what you’re doing. If you’re the best in the world at what you do, it’s inevitable the world will find out.

What if you aren’t the best in the world yet?

Share your work.

You make a website.

machinelearner.com

I made this up. It might exist.

On your website you share what you’ve been up to. You write an article on an intuitive interpretation of gradient descent. There’s code there and there’s math there. You’ve been working on your skills so to give back you share what you’ve learned in a way others can understand.

The code tab links to your GitHub. On your GitHub you’ve got examples of different algorithms and comments around them and a tutorial on exploratory data analysis of a public health dataset since your interest in health. You’ve ingested a few recent papers and tried to apply it to something.

LinkedIn is your resume, you’ve listed your education, your contributions to different projects the porjects you’ve built the ones you’ve worked on. Every so often you share an update of your latest progress. This week I worked on adding in some new functions to my health project.

You’re getting a bit of traction but it’s time to step it up. You’re after the machine learning role at Airbnb. Their website is so well designed you stayed at their listings you’re a fan of what the work they do you know you could bring them value with your machine learning skills.

You make another website.

whyairbnbshouldhiremeasamachinelearningengineer.com

I made this one up too. Kudos if you’re already on it.

You send it to a few people on the Airbnb recruitment team you found on LinkedIn with a message.

Hi, my name is Charlie, I hope this finds you well.

I’ve seen the Machine Learning Engineer role on your careers page and I’d like to apply.

I made this website which shows my solutions to some of your current challenges.

If you check it out, I’d love your advice on what best to do next.

5/6 of the people you message click on it. This is where they see what you’ve done. You built a recommendation engine. It runs live in the browser. It uses your machine learning skills. Airbnb need a machine learning engineer who has experience with recommendation engines. They recommend a few things.

3 reply with next steps of what to do. The other 2 refer to other people.

How many other people sent through a website showcasing their skills?

0.

Maybe you don’t want a job. Maybe you want to research. Maybe you want to get into a university. The same principles apply.

Get good at what you do. Really good.

Share your work.

How much?

80% skills.

20% sharing.

Source: https://qr.ae/TWpAS2

How do non-technical people learn machine learning?

I drove forward.

The parking inspector starting speaking.

Do you have a valid Queensland drivers licence?

I answered.

Yes.

He kept going.

Well, you shouldn’t because you should know you can’t park in bus stops.

The Uber app guided me to pick up riders. I followed the app without paying attention to the signs. I was more focused on picking them up and getting them out of there. It was 2 am.

The fine came through. $250. I worked for free that night.

I paid it.

Then thought to myself.

I’m not driving Uber anymore.

Two weeks later I got offered an internship as a machine learning engineer.

9-months before that I started my own AI Masters Degree.

Before that, I graduated with a Food Science and Nutrition Degree. Non-technical as it gets.

Where do you start?

A) Delete non-technical from your vocabulary

Words have power. Real power.

They’re magic. It’s why when you list out the letters of a word it’s called spelling.

People isolate themselves with their words.

Some say play to your strengths, others say work on your weaknesses. Both good advice. Which one should you listen to?

As soon as you start saying you’re non-technical, you’re non-technical.

I was speaking to someone the other night.

I used to think my main strength was talking to people.

I told him.

I’ll never be the best engineer.

He snapped back.

Not with that attitude.

It changed me. I’m not trying to be the best engineer but referring to myself as never being the best was limiting my ability to grow.

I’m getting better. Much better. Why?

Because I told myself so.

You can too.

Belief is 50% of anything.

B) Use the placebo effect to your advantage

Here’s another.

Have you heard of the placebo effect?

It’s one of the most dominant forces in science. But it’s not limited to researchers in lab coats. You can use it too.

Example.

People who thought they were taking good medicine (but were actually only taking a placebo, or a sugar pill) got healthier.

What?

Why?

Because they thought they were taking the good medicine and the cosmic forces between the mind, body and universe set them on the track to better health.

I’ve simplified it and used cosmic forces on purpose. Because this effect is still unknown other than describing it as a belief which led to improvement.

What can you do?

The same thing. Take a placebo pill of learning machine learning.

Write it down.

This will be hard for me but I can learn it.

Again.

This will be hard for me but I can learn it.

All useful skills are hard to learn.

C) Get some coding foundations

The first two are most important. The rest snowballs as you go.

Someone commented on my LinkedIn the other night.

One of my favourite sayings from my professor was, "in theory, theory and practice are the same. In practice, they are completely different".

Good advice.

Could you learn to swim without ever touching the water?

If you want to get into machine learning, learn to code, it’s hard to begin with but you get better.

Practice a little every day. And if you miss a day, no problem, continue the next day.

It’s like how your 3-year-old self would’ve learned to talk.

In the beginning, you could only get a few sounds out. A few years later, you can have whole conversations.

Learning to code is the same. It starts out as a foreign language. But then as you learn more, you can start to string things together.

My brother is an accountant. He’s starting to learn machine learning. I recommended he start with Python on DataCamp. Python code reads similar to how you would read words. Plus, DataCamp teaches code from 0 to full-blown machine learning. He's been loving it.

D) Build a framework

Once you’ve been through a few DataCamp courses or learned some Python in general, start to piece together where you want to head next.

This is hard.

Because in the beginning it’s hard to know where you want to go and there’s a bunch of stuff out there.

So you’ve got two problems. Not knowing where to go and having too many things to choose from.

If you know you want to learn more machine learning, why not put together your own path?

What could this look like?

  1. 3–4 months of DataCamp

  2. 3–4 months of Coursera courses

  3. 3–4 months going through the fast.ai curriculum

Do you have to use these?

No.

I only recommend them because I’ve been through them as a part of my AI Masters Degree. The best advice comes from mentors who are 1–3 years ahead of you. Short enough to still remember the specifics and long enough to have made some mistakes.

Will it be easy?

No.

All useful skills are hard to learn.

Day by day you may not feel like you’re learning much. But by the end of the year (3 blocks of 4 months) you’ll be a machine learning practitioner.

E) You don’t need math*

*to get started.

When you look at machine learning resources, many of them have a bunch of math requirements.

Math isn’t taught well in schools so it scares people.

Like code, mathematics is another language. Mathematics is the language of nature.

If the math prerequisites of some of the courses you’ve been looking at are holding you back, you can get started without it.

The Python coding frameworks such as TensorFlow, PyTorch, NumPy and sklearn, abstract away the need to fully understand the math (don’t worry if you don’t know what these are you’ll find them later).

As you go forward and get better at the code, your project may demand knowledge of the math involved. Learn it then.

F) It’s always day one

Am I the best machine learning engineer?

No.

But two years ago I was asking myself the question, how I do learn machine learning with no technical skills?

The answer was simple, start learning the technical skills and don’t stop, but there were details.

Details like above.

Driving Uber on the weekends allowed me to pay for the courses I was doing to learn machine learning.

Getting a fine for picking up people in the wrong spot helped me make the decision to back myself.

A year into being a machine learning engineer and I’m more technical than when I started but there’s plenty more to learn.

How does machine learning get used by regular people?

You wake up. You check your phone. Its been on charge all night. The battery usage from the previous day was recorded and will be used along with previous use history to maximise charge. Machine learning.

Your new emails load, there’s an email from Steve. He wants the thing done earlier. It’s always earlier. Faster.

There are a few more emails. One from Amazon. Your package is going to be shipped today. And a new Uber promotion. 10% of rides for this week. You click the link and apply the code.

The rest of the emails are garbage. Stuff you’ve subscribed to but didn’t need. At least you subscribed to them.

There’s another 15 emails you don’t see. They’re in the spam folder. Your email client scanned through the text as they came in and put them there. Machine learning.

It isn’t perfect though. You could’ve been a millionaire. A billionaire! If only you’d seen the email and sent that Nigerian Prince your bank details.

A charge comes through on your card. The amount pops up on your screen. $37.85 from Amazon. Another email. Your order of Machine Learning is Everywhere is on its way.

Another charge. Gym membership for the week.

All this money, flying across the internet. You’d think someone could tap into this stream and take some. They try but they can’t. Even when they get your card. They try to buy something. It might go through the first time but by the second time, the card is dead.

What happened?

Your bank detected the fraud and froze your account. The transaction in another country wasn’t like the other ones you’ve made in the past.

Another email.

We’ve detected fraud and frozen your account. Don’t worry, your funds are safe. To sort this out, you can contact us here.

There’s no way any person could monitor all the transactions happening. Machine learning.

You call the bank. Your card has been unfrozen and the funds will be back in your account in 24-hours.

It’s 8:34 am. The Uber app pops up at the bottom of your phone. Your calendar says work starts at 9 and based on previous trips, your phone knows it takes about 16-minutes to get there. Machine learning.

6 drivers are close by. Less than usual. Your work address is already preloaded. You get matched with the Black Prius, licence plate, 889LYJK. Machine learning.

The driver takes a route you’ve never been. And then the map corrects itself to adjust for traffic. Machine learning.

Josh is in Bali. Sarah had a birthday party on the weekend. Their photos are on Instagram. An ad appears for those new shoes you’ve been looking at. The ones with the orange. The ad was delivered in your feed amongst photos from the people you’ve been following — there’s a balance. Too many ads and you’d stop using Instagram. Machine learning.

And the naked picture from HotPix2928133Q in your discover tab? You didn’t see it because it was filtered out. Machine learning.

The traffic is bad. Car, car, car, car, truck, car, bus, car. Nose to tail. The Black Prius crawls ahead. The cameras in the front grill stop it from getting too close to the car in front. Machine learning.

You’re not even at work yet.

Tomorrow is your day off. What’s the weather like? Open the weather app. Sunny with a chance of rain, 18% chance. Where’d the chance come from? Machine learning.

Breakfast: a bagel with bacon and avocado. You’ve been trying to cut the carbs but Bagel Boys do it so well. So well. The bagels are a dream. They’re $7 but you’d pay $10.

You can't see them but they're there. The drones monitoring the wheat crops. They look for crop stress. Too much stress is never good. Too much stress leads to death. The drones help the farmer. They help the farmer so you can taste the Bagel Boys glory. Machine learning.

The address reader knows what characters look like. It reads your address on the package and sends it down the chute. Machine learning.

All the information is encoded in the barcode but not everywhere has the same barcode system. The driver picks up a collection of parcels. Your book is one of them. Your house is close to the depot so it's delivered before 10.

It's a good book. A best seller. You want to tell Ankit it arrived and tell him he should read it too. Your phones screen brightness changes as you take a photo. The photo looks so crisp. You’re a real photographer now. The book is in focus and the background is blurred. All with the tap of a button. Magic? No. Machine learning.

Machine Learning is Everywhere.

Future book cover? haha

Future book cover? haha

Source: https://qr.ae/TUf6DM

"Sometimes you have to give up exercise" — Ask a Machine Learning Engineer Anything | February 2019

Sometimes you need to give up exercise.

And everything else.

And focus on the work.

I was hosting a live question and answer session on my YouTube channel.

Ask a machine learning engineer anything.

Someone asked how to focus. Before I could answer, people started offering some great advice in the chat.

Except for the one about sacrificing exercise to work more. It was from a good place, but I disagreed.

There’s nothing I could work on with is worth sacrificing my health or relationships.

Plus, the work won’t matter if you don’t have your health.

Health is the force multiplier of life.

As much as I love data science, health has my heart.

I answered more ML related questions on the stream.

“How do I get a job in ML?”

“How much math is required?”

“What courses are the best?”

You can watch the full video on YouTube.

Or listen to the audio version. If the player below doesn’t, it’s available on Anchor too.

And if your question didn’t get answered, feel free to ask anytime.

Source: https://www.linkedin.com/feed/update/urn:l...

"How do you stay motivated whilst studying?" — Ask a Machine Learning Engineer Anything

Every month, I host a livestream on my channel where I answer some of the most common questions I get, plus as many of the live questions as I can.

"How can I get a job in machine learning?”

“Where’s the best place to learn machine learning?”

“How do you manage your time?”

“How do you stay fit whilst studying?”

“What do you think of Coursera, EdX, Udacity and Udemy?”

“Should I go to university to study data science?”

Read More

Can a biology student get into machine learning?

Our class went on an excursion. We played with different kinds of food compounds which could shape themselves around the outside of a balloon. And then got taught about these tools which could output very small drops.

‘What are these called?’ I asked.

‘Pipettes.’

We got back to school. The teacher turned and asked what I thought of the trip.

‘I liked the tour but it was very focused on science.’

‘That’s what it was all about.’

She was right. We went to a science institute.

The same teacher asked me to be captain of debating. It was tradition to get up and talk in front of the school. I got up and gave a talk. Everyone clapped but my speech wasn’t as good as I wanted it to be.

I was set out to do law. I’d see lawyers on the TV. All it looked like was a form of debating where everyone wears suits and says ‘objection!’ Followed by something smart.

I thought, ‘I could do that.’

A few episodes of Law & Order and everyone becomes a lawyer.

We got our grades, I got 7/25, lower was better. Not as good as I hoped but I expected it. Most of my senior year was devoted to running our Call of Duty team. We were number one in Australia.

The letters came, it was time to choose what to study at university. I read the headings in bold and left the rest to read later. I was set out to do law.

We were on the waterfront riding scooters. There was a girl there I knew from primary school. I had a crush on her in grade four. For Easter, my mum gave me two chocolates to take in, a big one and a small one. The big one was for my teacher, Mrs Thompson. When I got to school I gave the big one to the girl. But she still liked Tony Black.

She was smart. That’s why I liked her.

‘What are you studying?’ I asked.

‘Biomed.’

‘What’s that?’

‘Biomedical science, it’s what you study before getting into medicine.’

‘Oh, that’s what I’m doing.’

I wasn’t. I hadn’t filled out the form. I was set out to do law.

I got home and checked the study guide. Biomedical science required a score of 11/25. I was eligible. I put it down as my number one preference. Same as the girl.

The email came a few weeks later. I got into my number one preference. A Bachelor of Science majoring in Biomedical Science.

We went to orientation day together. I spent $450 on textbooks. I used my mum's card. There was a biology one with 1200 pages. It had a red spine and a black cover. The latest edition.

Our timetables were the same. 30-something contact hours per week. I lived 45-minutes from university by car. 90-minutes by train and bus. The first lecture of the week was at 8 am on Monday. BIOL1020. Why someone chose this time for a lecture still confuses me.

The lecturer started.

‘30% of you will fail this course.’

‘That won’t be me.’

It was me.

My report card in high school went something like this.

  • Maths - B

  • Extension Maths - C

  • Physics - B

  • Religion - A+ (most of religion was storytelling, debating helped with this)

  • English - B

  • Geography - B

  • Sports - A

Not a single biology course. I was set out for law.

I took the same course the next year. I passed. It took me a year to get some foundations in biology. By then the girl was already through to second year. She was smart. That’s why I liked her.

Being a doctor sounded cool.

‘I’m going to be a doctor,’ I told people at parties.

But by end of my second year, my grades were still poor.

The Dean of Science emailed me. Not him. One of his secretaries. But it said I had to go and see him. My grades were bad. The email was the warning. Improve or we’ll kick you out.

I met with the Dean. He told me I could change courses if I wanted to. I changed to food science and nutrition. Still within the health world but less biology. I wasn’t set out for law.

My grades improved and I graduated three years later. Five years to do a three-year degree.

People asked when I finished.

‘What are you going to do with your nutrition degree?’

‘Stay healthy.’

I thought it was a good plan.

I was working at Apple. They paid for language courses. I signed up for Japanese and Chinese. Japanese twice a week. Chinese once a week.

My study routine was solid. The main skill I learned at university was learning how to learn.

I was getting pretty good. When Chinese customers came in, I’d ask them if they had a backup of their iPhone in Chinese.

‘Nĭ yŏu méiyŏu beifan?’

They loved it.

I passed the level 2 Japanese exam the night before flying to Japan. Being solo for a month meant plenty of walking. Plenty of listening to podcasts. Most of them were about technology or health. Two things I’m interested in. And all the ones about technology kept mentioning machine learning.

On the trains between cities, I’d read articles online.

I went to Google.

‘What is machine learning?’

‘How to learn machine learning?’

I quit Apple two months after getting back from Japan. Travelling gave me a new perspective. Cliche but true.

My friend quit too. We worked on an internet startup for a couple of months. AnyGym, the Airbnb of fitness facilities. It failed. Partly due to lack of meaning, partly due to the business model of gyms depending on people not showing up. We wanted to do the opposite.

Whilst building the website, the internet was exploding with machine learning.

I did more research. The same Google searches.

‘What is machine learning?’

‘How to learn machine learning?’

Udacity’s Deep Learning Nanodegree came up. The trailer videos looked epic and the colours of the website were good on the eye. I read everything on the page and didn’t understand most of it. I got to the bottom and saw the sign-up price, thought about it, scrolled back to the top and then back to the bottom. I closed my laptop.

The prerequisites contained some words I’d never heard of.

Python programming, statistics and probability, linear algebra.

More research. Google again.

‘How to learn Python?’

‘What is linear algebra?’

I had some savings from Apple but they were supposed to last a while. Signing up for the Nanodegree would take a big chunk out.

I signed up. Class started in 3-weeks.

Back to the internet. It was time to learn Python.

‘How hard could it be?’ I thought.

Treehouse’s Python course looked good. I enrolled. I went through it fast. 3-4 hours every day.

Emails came through for the Deep Learning Nanodegree. There was a Slack channel for introductions. I joined it and starting reading.

‘Hey everyone, I’m Sanjay, I’m a software engineering at Google.’

‘Hello, I’m Yvette, I live in San Francisco and am a data scientist at Intuit.’

I kept reading. More of the same.

Mine went something like this.

‘Nice to meet you all! I’m Daniel, I started learning programming 3-weeks ago.’

After seeing the experience level of others, I emailed Udacity support asking what the refund policy was. ‘Two weeks,’ they said. I didn’t reply.

Four months later, I graduated from the Deep Learning Foundations Nanodegree. It was hard. All my assignments were either a couple of days late or right on time. I was learning Python and math I needed as I needed it.

I wanted to keep building upon the knowledge I’d gained. So I explored the internet for more courses like the Deep Learning Nanodegree. I found a few, Andrew Ng’s deeplearning.ai, the Udacity AI Nanodegree, fast.ai and put them together.

My self-created AI Masters Degree was born. I named it that because it’s easier than saying, ‘I’m stringing together a bunch of courses.’ Plus, people kind of understand what a Masters Degree is.

8-months into it I got a message from Ashlee on LinkedIn.

‘Hey Dan, what you’re posting is great, would you like to meet Mike?’

I met Mike.

‘If you’re into technology and health, you should meet Cam.’

I met Cam. I told him I was into technology and health and what I had been studying.

‘Would you like to come in on Thursday to see what it’s like?’

I went in on Thursday.

It was a good day. The team were exploring some data with Pandas.

‘Should I come back next Thursday?’ I asked.

‘Definitely.’

A couple of Thursday’s later I sat down with the CEO and lead Machine Learning Engineer. They offered me a role. I accepted.

One of our biggest projects is in healthcare. Immunotherapy Outcome Prediction (IOP). The goal is to use genome data to better predict who is most likely to respond to immunotherapy. Right now about it’s effective in about 42% of people. But the hard part is figuring out which 42%.

To help with the project we hired a biologist and a neuroscientist and a few others.

Before joining, they hadn’t done much machine learning at all. But thanks to the resources available online and a genuine curiosity to learn more, they’ve produced some world class work.

We had a phone call with the head of Google’s Genomics team the other day.

‘I’m really impressed by your work.’

They’ve done an amazing job. But compliments should always be accepted with a grain of salt and a smile. Results on paper and results in the real world are two different things.

The team know that.

Can a biology student get into AI and machine learning?

I’m not a good example because I failed biology. Almost twice.

But I sit across from two who have done it.

The formula?

You’ve already got it. The same one which led you to learn more about biology. Be curious and have the courage to be wrong.

Biology textbooks get rewritten every 5-years or so right?

Back to day one BIOL1020. The lecturer had another saying.

‘What you learn this year will probably be wrong in 5-years.’

It’s the same in machine learning. Except the math. Math sticks around.

Photo from    Learning Intelligence 37 — Learning Data Science with my Brother.    You can see my biology textbook gathering dust in the background.

Photo from Learning Intelligence 37 — Learning Data Science with my Brother. You can see my biology textbook gathering dust in the background.

Source: https://qr.ae/TUvTBk

A Gentle (and visual) Introduction to Exploratory Data Analysis

Pink singlet, dyed red hair, plated grey beard, no shoes, John Lennon glasses. What a character. Imagine the stories he’d have. He parked his moped and walked into the cafe.

This cafe is a local favourite. But the chairs aren’t very comfortable. So I’ll keep this short (spoiler: by short, I mean short compared to the amount of time you’ll actually spend doing EDA).

When I first started as a Machine Learning Engineer at Max Kelsen, I’d never heard of EDA. There are a bunch of acronyms I’ve never heard of.

I later learned EDA stands for exploratory data analysis.

It’s what you do when you first encounter a data set. But it’s not a once off process. It’s a continual process.

The past few weeks I’ve been working on a machine learning project. Everything was going well. I had a model trained on a small amount of the data. The results were pretty good.

It was time to step it up and add more data. So I did. Then it broke.

I filled up the memory on the cloud computer I was working on. I tried again. Same issue.

There was a memory leak somewhere. I missed something. What changed?

More data.

Maybe the next sample of data I pulled in had something different to the first. It did. There was an outlier. One sample which had 68 times the amount of purchases as the mean (100).

Back to my code. It wasn’t robust to outliers. It took the outliers value and applied to the rest of the samples and padded them with zeros.

Instead of having 10 million samples with a length of 100, they all had a length of 6800. And most of that data was zeroes.

I changed the code. Reran the model and training began. The memory leak was patched.

Pause.

The guy with the pink singlet came over. He tells me his name is Johnny.

He continues.

‘The girls got up me for not saying hello.’

‘You can’t win,’ I said.

‘Too right,’ he said.

We laughed. The girls here are really nice. The regulars get teased. Johnny is a regular. He told me he has his own farm at home. And his toenails were painted pink and yellow, alternating, pink, yellow, pink, yellow.

Johnny left.

Back to it.

What happened? Why the break in the EDA story?

Apart from introducing you to the legend of Johnny, I wanted to give an example of how you can think the road ahead is clear but really, there’s a detour.

EDA is one big detour. There’s no real structured way to do it. It’s an iterative process.


Why do EDA?

When I started learning machine learning and data science, much of it (all of it) was through online courses. I used them to create my own AI Masters Degree. All of them provided excellent curriculum along with excellent datasets.

The datasets were excellent because they were ready to be used with machine learning algorithms right out of the box.

You’d download the data, choose your algorithm, call the .fit() function, pass it the data and all of a sudden the loss value would start going down and you’d be left with an accuracy metric. Magic.

This was how the majority of my learning went. Then I got a job as a machine learning engineer. I thought, finally, I can apply what I’ve been learning to real-world problems.

Roadblock.

The client sent us the data. I looked at it. WTF was this?

Words, time stamps, more words, rows with missing data, columns, lots of columns. Where were the numbers?

‘How do I deal with this data?’ I asked Athon.

‘You’ll have to do some feature engineering and encode the categorical variables,’ he said, ‘I’ll Slack you a link.’

I went to my digital mentor. Google. ‘What is feature engineering?’

Google again. ‘What are categorical variables?’

Athon sent the link. I opened it.

There it was. The next bridge I had to cross. EDA.

You do exploratory data analysis to learn more about the more before you ever run a machine learning model.

You create your own mental model of the data so when you run a machine learning model to make predictions, you’ll be able to recognise whether they’re BS or not.

Rather than answer all your questions about EDA, I designed this post to spark your curiosity. To get you to think about questions you can ask of a dataset.


Where do you start?

How do you explore a mountain range?

Do you walk straight to the top?

How about along the base and try and find the best path?

It depends on what you’re trying to achieve. If you want to get to the top, it’s probably good to start climbing sometime soon. But it’s also probably good to spend some time looking for the best route.

Exploring data is the same. What questions are you trying to solve? Or better, what assumptions are you trying to prove wrong?

You could spend all day debating these. But best to start with something simple, prove it wrong and add complexity as required.

Example time.


Making your first Kaggle submission

You’ve been learning data science and machine learning online. You’ve heard of Kaggle. You’ve read the articles saying how valuable it is to practice your skills on their problems.

Roadblock.

Despite all the good things you’ve heard about Kaggle. You haven’t made a submission yet.

That was me. Until I put my newly acquired EDA skills to work.

You decide it’s time to enter a competition of your own.

You’re on the Kaggle website. You go to the ‘Start Here’ section. There’s a dataset containing information about passengers on the Titanic. You download it and load up a Jupyter Notebook.

What do you do?

What question are you trying to solve?

‘Can I predict survival rates of passengers on the Titanic, based on data from other passengers?’

This seems like a good guiding light.


An EDA checklist

Every morning, I consult with my personal assistant on what I have to do for the day. My personal assistant doesn’t talk much. Because my personal assistant is a notepad. I write down a checklist.

If a checklist is good enough for pilots to use every flight, it’s good enough for data scientists to use with every dataset.

My morning lists are non-exhaustive, other things come up during the day which have to be done. But having it creates a little order in the chaos. It’s same with the EDA checklist below.

An EDA checklist

1. What question(s) are you trying to solve (or prove wrong)?
2. What kind of data do you have and how do you treat different types?
3. What’s missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?

We’ll go through each of these.

What would you add to the list?


What question(s) are you trying to solve?

I put an (s) in the subtitle. Ignore it. Start with one. Don’t worry, more will come along as you go.

For our Titanic dataset example it’s:

Can we predict survivors on the Titanic based on data from other passengers?

Too many questions will clutter your thought space. Humans aren’t good at computing multiple things at once. We’ll leave that to the machines.

Sometimes a model isn’t required to make a prediction.

Before we go further, if you’re reading this on a computer, I encourage you to open this Juypter Notebook and try to connect the dots with topics in this post. If you’re reading on a phone, don’t fear, the notebook isn’t going away. I’ve written this article in a way you shouldn’t need the notebook but if you’re like me, you learn best seeing things in practice.



What kind of data do you have and how to treat different types?

You’ve imported the Titanic training dataset.

Let’s check it out.

training.head()
.head() shows the top five rows of a dataframe. The rows you’re seeing are from the Kaggle Titanic Training Dataset.

.head() shows the top five rows of a dataframe. The rows you’re seeing are from the Kaggle Titanic Training Dataset.

Column by column, there’s: numbers, numbers, numbers, words, words, numbers, numbers, numbers, letters and numbers, numbers, letters and numbers and NaNs, letters. Similar to Johnny’s toenails.

Let’s separate the features out into three boxes, numerical, categorical and not sure.

Columns of different information are often referred to as features. When you hear a data scientist talk about different features, they’re probably talking about different columns in a dataframe.

Columns of different information are often referred to as features. When you hear a data scientist talk about different features, they’re probably talking about different columns in a dataframe.

In the numerical bucket we have, PassengerId, Survived, Pclass, Age, SibSp, Parch and Fare.

The categorical bucket contains Sex and Embarked.

And in not sure we have Name, Ticket and Cabin.

Now we’ve broken the columns down into separate buckets, let’s examine each one.

The Numerical Bucket

numerical_bucket.png

Remember our question?

‘Can we predict survivors on the Titanic based on data from other passengers?’

From this, can you figure out which column we’re trying to predict?


We’re trying to predict the green column using data from the other columns.

We’re trying to predict the green column using data from the other columns.

The Survived column. And because it’s the column we’re trying to predict, we’ll take it out of the numerical bucket and leave it for the time being.

What’s left?

PassengerId, Pclass, Age, SibSp, Parch and Fare.

Think for a second. If you were trying to predict whether someone survived on the Titanic, do you think their unique PassengerId would really help with your cause?

Probably not. So we’ll leave this column to the side for now too. EDA doesn’t always have to be done with code, you can use your model of the world to begin with and use code to see if it’s right later.

How about Pclass, SibSp and Parch?

These are numbers but there’s something different about them. Can you pick it up?

What does Pclass, SibSp and Parch even mean? Maybe we should’ve read the docs more before trying to build a model so quickly.

Google. ‘Kaggle Titanic Dataset’.

Found it.

Pclass is the ticket class, 1 = 1st class, 2 = 2nd class and 3 = 3rd class. SibSp is the number of siblings a passenger has on board. And Parch is the number of parents someone had on board.

This information was pretty easy to find. But what if you had a dataset you’d never seen before. What if a real estate agent wanted help predicting house prices in their city. You check out their data and find a bunch of columns which you don’t understand.

You email the client.

‘What does Tnum mean?’

They respond. ‘Tnum is the number of toilets in a property.’

Good to know.

When you’re dealing with a new dataset, you won’t always have information available about it like Kaggle provides. This is where you’ll want to seek the knowledge of an SME.

Another acronym. Great.

SME stands for subject matter expert. If you’re working on a project dealing with real estate data, part of your EDA might involve talking with and asking questions of a real estate agent. Not only could this save you time, but it could also influence future questions you ask of the data.

Since no one from the Titanic is alive anymore (RIP (rest in peace) Millvina Dean, the last survivor), we’ll have to become our own SMEs.

There’s something else unique about Pclass, SibSp and Parch. Even though they’re all numbers, they’re also categories.

How so?

Think about it like this. If you can group data together in your head fairly easily, there’s a chance it’s part of a category.

The Pclass column could be labelled, First, Second and Third and it would maintain the same meaning as 1, 2 and 3.

Remember how machine learning algorithms love numbers? Since Pclass, SibSp and Parch are already all in numerical form, we’ll leave them how they are. The same goes for Age.

Phew. That wasn’t too hard.


The Categorical Bucket

categorical_bucket.png

In our categorical bucket, we have Sex and Embarked.

These are categorical variables because you can separate passengers who were female from those who were male. Or those who embarked on C from those who embarked from S.

To train a machine learning model, we’ll need a way of converting these to numbers.

How would you do it?

Remember Pclass? 1st = 1, 2nd = 2, 3rd = 3.

How would you do this for Sex and Embarked?

Perhaps you could do something similar for Sex. Female = 1 and male = 2.

As for Embarked, S = 1 and C = 2.

We can change these using the .LabelEncoder() function from the sklearn library.

training.embarked.apply(LabelEncoder().fit_transform)

Wait? Why does C = 0 and S = 2 now? Where’s 1? Hint: There’s an extra category, Q, this takes the number 1. See the  data description page  on Kaggle for more.

Wait? Why does C = 0 and S = 2 now? Where’s 1? Hint: There’s an extra category, Q, this takes the number 1. See the data description page on Kaggle for more.

We’ve made some good progress towards turning our categorical data into all numbers but what about the rest of the columns?

Challenge: Now you know Pclass could easily be a categorical variable, how would you turn Age into a categorical variable?


The Not Sure Bucket

not_sure.png

Name, Ticket and Cabin are left.

If you were on Titanic, do you think your name would’ve influenced your chance of survival?

It’s unlikely. But what other information could you extract from someone's name?

What if you gave each person a number depending on whether their title was Mr., Mrs. or Miss.?

You could create another column called Title. In this column, those with Mr. = 1, Mrs. = 2 and Miss. = 3.

What you’ve done is created a new feature out of an existing feature. This is called feature engineering.

Converting titles to numbers is a relatively simple feature to create. And depending on the data you have, feature engineering can get as extravagant as you like.

How does this new feature affect the model down the line? This will be something you’ll have to investigate.

For now, we won’t worry about the Name column to make a prediction.

What about Ticket?

ticket_head.png

The first few examples don’t look very consistent at all. What else is there?

training.Ticket.head(15)

The first 15 entries of the Ticket column.

The first 15 entries of the Ticket column.

These aren’t very consistent either. But think again. Do you think the ticket number would provide much insight as to whether someone survived?

Maybe if the ticket number related to what class the person was riding in, it would have an effect but we already have that information in Pclass.

To save time, we’ll forget the Ticket column for now.

Your first pass of EDA on a dataset should have the goal of not only raising more questions about the data but to get a model built using the least amount of information possible so you’ve got have a baseline to work from.

Now, what do we do with Cabin?

You know, since I’ve already seen the data, my spidey-senses are telling me it’s a perfect example for the next section.

Challenge: I’ve only listed a couple examples of numerical and categorical data here. Are there any other types of data? How do they differ to these?


What’s missing from the data and how do you deal with it?

missingno.matrix(train, figsize = (30,10))
The  missingno library  is a great quick way to quickly and visually check for holes in your data, it detects where NaN values (or no values) appear and highlights them. White lines indicate missing values.

The missingno library is a great quick way to quickly and visually check for holes in your data, it detects where NaN values (or no values) appear and highlights them. White lines indicate missing values.

The Cabin column looks like Johnny’s shoes. Not there. There are a fair few missing values in Age too.

How do you predict something when there’s no data?

I don’t know either.

So what are our options when dealing with missing data?

The quickest and easiest way would be to remove every row with missing values. Or remove the Cabin and Age column entirely.

But there’s a problem here. Machine learning models like more data. Removing large amounts of data will likely decrease the ability of our model to predict whether a passenger survived or not.

What’s next?

Imputing values. In other words, filling up the missing data with values calculated from other data.

How would you do this for the Age column?

When we called .head() the Age column had no missing values. But when we look at the whole column, there are plenty of holes.

When we called .head() the Age column had no missing values. But when we look at the whole column, there are plenty of holes.

Could you fill missing values with average age?

There are drawbacks to this kind of value filling. Imagine you had 1000 total rows, 500 of which are missing values. You decide to fill the 500 missing rows with the average age of 36.

What happens?

Your data becomes heavily stacked with the age of 36. How would that influence predictions on people 36-years-old? Or any other age?

Maybe for every person with a missing age value, you could find other similar people in the dataset and use their age. But this is time-consuming and also has drawbacks.

There are far more advanced methods for filling missing data out of scope for this post. It should be noted, there is no perfect way to fill missing values.

If the missing values in the Age column is a leaky drain pipe the Cabin column is a cracked dam. Beyond saving. For your first model, Cabin is a feature you’d leave out.

Challenge: The Embarked column has a couple of missing values. How would you deal with these? Is the amount low enough to remove them?


Where are the outliers and why you should be paying attention to them?

‘Did you check the distribution?’ Athon asked.

‘I did with the first set of data but not the second set…’ It hit me.

There it was. The rest of the data was being shaped to match the outlier.

If you look at the number of occurrences of unique values within a dataset, one of the most common patterns you’ll find is Zipf’s law. It looks like this.

Zipf’s law: The highest occurring variable will have double the number of occurrences of the second highest occurring variable, triple the amount of the third and so on.

Zipf’s law: The highest occurring variable will have double the number of occurrences of the second highest occurring variable, triple the amount of the third and so on.

Remembering Zipf’s law can help to think about outliers (values towards the end of the tail don’t occur often and are potential outliers).

The definition of an outlier will be different for every dataset. As a general rule of thumb, you may consider anything more than 3 standard deviations away from the mean might be considered an outlier.

You could use a general rule to consider anything more than three standard deviations away from the mean as an outlier.

You could use a general rule to consider anything more than three standard deviations away from the mean as an outlier.

Or from another perspective.

Outliers from the perspective of an (x, y) plot.

Outliers from the perspective of an (x, y) plot.

How do you find outliers?

Distribution. Distribution. Distribution. Distribution. Four times is enough (I’m trying to remind myself here).

During your first pass of EDA, you should be checking what the distribution of each of your features is.

A distribution plot will help represent the spread of different values of data you have across. And more importantly, help to identify potential outliers.

train.Age.plot.hist()

Histogram plot of the Age column in the training dataset. Are there any outliers here? Would you remove any age values or keep them all?

Histogram plot of the Age column in the training dataset. Are there any outliers here? Would you remove any age values or keep them all?

Why should you care about outliers?

Keeping outliers in your dataset may turn out in your model overfitting (being too accurate). Removing all the outliers may result in your model being too generalised (it doesn’t do well on anything out of the ordinary). As always, best to experiment iteratively to find the best way to deal with outliers.

Challenge: Other than figuring out outliers with the general rule of thumb above, are there any other ways you could identify outliers? If you’re confused about a certain data point, is there someone you could talk to? Hint: the acronym contains the letters M E S.


Getting more out of your data with feature engineering

The Titanic dataset only has 10 features. But what if your dataset has hundreds? Or thousands? Or more? This isn’t uncommon.

During your exploratory data analysis process, once you’ve started to form an understanding AND you’ve got an idea of the distributions AND you’ve found some outliers AND you’ve dealt with them, the next biggest chunk of your time will be spent on feature engineering.

Feature engineering can be broken down into three categories: adding, removing and changing.

The Titanic dataset started out in pretty good shape. So far, we’ve only had to change a few features to be numerical in nature.

However, data in the wild is different.

Say you’re working on a problem trying to predict the changes in banana stock requirements of a large supermarket chain across the year.

Your dataset contains a historical record of stock levels and previous purchase orders. You're able to model these well but you find there are a few times throughout the year where stock levels change irrationally. Through your research, you find during a yearly country-wide celebration, banana week, the stock levels of bananas plummet. This makes sense. To keep up with the festivities, people buy more bananas.

To compensate for banana week and help the model learn when it occurs, you might add a column to your data set with banana week or not banana week.

# We know Week 2 is a banana week so we can set it using np.where()
df["Banana Week"] = np.where(df["Week Number"] == 2, 1, 0)
A simple example of adding a binary feature to dictate whether a week was banana week or not.

A simple example of adding a binary feature to dictate whether a week was banana week or not.

Adding a feature like this might not be so simple. You could find adding the feature does nothing at all since the information you’ve added is already hidden within the data. As in, the purchase orders for the past few years during banana week are already higher than other weeks.

What about removing features?

We’ve done this as well with the Titanic dataset. We dropped the Cabin column because it was missing so many values before we even ran a model.

But what about if you’ve already run a model using the features left over?

This is where feature contribution comes in. Feature contribution is a way of figuring out how much each feature influences the model.

An example of a feature contribution graph using Sex, Pclass, Parch, Fare, Embarked and SibSp features to predict who would survive on the Titanic. If you’ve seen the movie, why does this graph make sense? If you haven’t, think about it anyway. Hint: ‘Save the women and children!’

An example of a feature contribution graph using Sex, Pclass, Parch, Fare, Embarked and SibSp features to predict who would survive on the Titanic. If you’ve seen the movie, why does this graph make sense? If you haven’t, think about it anyway. Hint: ‘Save the women and children!’

Why is this information helpful?

Knowing how much a feature contributes to a model can give you direction as to where to go next with your feature engineering.

In our Titanic example, we can see the contribution of Sex and Pclass were the highest. Why do think this is?

What if you had more than 10 features? How about 100? You could do the same thing. Make a graph showing the feature contributions of 100 different features. ‘Oh, I’ve seen this before!’

Zipf’s law back at it again. The top features have far more to contribute than the bottom features.

Zipf’s law at play with different features and their contribution to a model.

Zipf’s law at play with different features and their contribution to a model.

Seeing this, you might decide to cut the lesser contributing features and improve the ones contributing more.

Why would you do this?

Removing features reduces the dimensionality of your data. It means your model has fewer connections to make to figure out the best way of fitting the data.

You might find removing features means your model can get the same (or better) results on fewer data and in less time.

Like Johnny is a regular at the cafe I’m at, feature engineering is a regular part of every data science project.

Challenge: What are other methods of feature engineering? Can you combine two features? What are the benefits of this?


Building your first model(s)

Finally. We’ve been through a bunch of steps to get our data ready to run some models.

If you’re like me, when you started learning data science, this is the part you learned first. All the stuff above had already been done by someone else. All you had to was fit a model on it.

Our Titanic dataset is small. So we can afford to run a multitude of models on it to figure out which is the best to use.

Notice how I put an (s) in the subtitle, you can pay attention to this one.

Cross-validation accuracy scores from a number of different models I tried using to predict whether a passenger would survive or not.

Cross-validation accuracy scores from a number of different models I tried using to predict whether a passenger would survive or not.

But once you’ve had some practice with different datasets, you’ll start to figure out what kind of model usually works best. For example, most recent Kaggle competitions have been won with ensembles (combinations) of different gradient boosted tree algorithms.

Once you’ve built a few models and figured out which is best, you can start to optimise the best one through hyperparameter tuning. Think of hyperparameter tuning as adjusting the dials on your oven when cooking your favourite dish. Out of the box, the preset setting on the oven works pretty well but out of experience you’ve found lowering the temperature and increasing the fan speed brings tastier results.

It’s the same with machine learning algorithms. Many of them work great out of the box. But with a little tweaking of their parameters, they work even better.

But no matter what, even the best machine learning algorithm won’t result in a great model without adequate data preparation.

Exploratory data analysis and model building is a repeating circle.

The EDA circle of life.

The EDA circle of life.

A final challenge (and some extra-curriculum)

I left the cafe. My ass was sore.

At the start of this article, I said I’d keep it short. You know how that turned out. It will be the same as your EDA iterations. When you think you’re done. There’s more.

We covered a non-exhaustive EDA checklist with the Titanic Kaggle dataset as an example.

1. What question are you trying to solve (or prove wrong)?

Start with the simplest hypothesis possible. Add complexity as needed.

2. What kind of data do you have?

Is your data numerical, categorical or something else? How do you deal with each kind?

3. What’s missing from the data and how do you deal with?

Why is the data missing? Missing data can be a sign in itself. You’ll never be able to replace it with anything as good as the original but you can try.

4. Where are the outliers and why should pay attention to them?

Distribution. Distribution. Distribution. Three times is enough for the summary. Where are the outliers in your data? Do you need them or are they damaging your model?

5. How can you add, change or remove features to get more out of your data?

The default rule of thumb is more data = good. And following this works well quite often. But is there anything you can remove get the same results? Less but better? Start simple.

Data science isn’t always about getting answers out of data. It’s about using data to figure out what assumptions of yours were wrong. The most valuable skill a data scientist can cultivate is a willingness to be wrong.

There are examples of everything we’ve discussed here (and more) in the notebook on GitHub and a video of me going through the notebook step by step on YouTube (the coding starts at 5:05).

FINAL BOSS CHALLENGE: If you’ve never entered a Kaggle competition before, and want to practice EDA, now’s your chance. Take the notebook I’ve created, rewrite it from top to bottom and improve on my result. If you do, let me know and I’ll share your work on my LinkedIn. Get after it.

Extra-curriculum bonus: Daniel Formoso's notebook is one of the best resources you’ll find for an extensive look at EDA on a Census Income Dataset. After you’ve completed the Titanic EDA, this is a great next step to check out.

If you’ve got something on your mind you think this article is missing, leave a response below or send me a note and I’ll be happy to get back to you.

Source: https://towardsdatascience.com/a-gentle-in...

Work in progress

I’m working on a longer form article. An introduction to exploratory data analysis to go along with the Code with Me video I did exploring the Kaggle Titanic dataset and the notebook code to go with it.

I’ve spent the past two days writing and and refining it.

I wanted to get it published today but it’s getting late and you know my thoughts on sleep. I work better when I sleep well.

In the past I’d have trouble walking away from something unless it’s done. But I’ve learned, especially with writing (and code) it pays to walk away, think about nothing for a while and then come back at it with a different pair of eyes.

The next time you look at it, you’ll see things you missed before. That’s what I’ll be doing tomorrow morning.

If you want to read it in the meantime, it’s in draft form on Medium. It needs some graphics and a little tidying but if you do read it, what would you change?