Quality: the only universal criteria

The teacher would hand out the sheet with the boxes on it. Each box had words in it which were supposed to specify how you got a certain mark. The words were all the same with one or two changed in each box.

‘The student shows sound understanding of the topic.’ That was worth a C. Sound was the mid-tier. Not good. Sound.

‘The student shows great understanding of the topic.’ B.

‘The student shows exceptional understanding of the topic.’ This was the money. Enough of these boxes and you got an A.

I never got why exceptional was the word for an A. I thought it meant something like accepted. ‘Your work is accepted, here’s an A.’

When doing assignments I never paid attention to the criteria sheet. It was always overflowing with words. So many it lost its meaning.

All I wanted to know was what I had to do. What I had to hand in to not get in trouble so I could get back to gaming.

All my assignments looked great. I made sure of that. I had a thing for good looking documents. I’d finish a physics assignment and hand it in. A+ for aesthetics, B for content.

University was the same. More criteria sheets. More lack of reading. More reading the task sheet 6 times and asking myself, ‘What do I actually need to do?’

Then came creating online. No criteria sheets. Anything goes.

My first blog post was crap [TK — link]. Crap but honest. I tried to get my girlfriend to read it. She was good with words. Since then, I’ve probably had 6 great, 277 sound and 3 exceptional posts.

There are no criteria sheets on the internet. So it can hard to start making anything. ‘What do I actually need/want to do?’ Notice the addition of want.

There may be plenty of things you want to do. Too many. So it’s unlikely you’re stuck with a lack of ideas. Instead, a lack of direction.

The cure?

The universal criteria.

You already know this one.

People like things which are of high quality.

Things that teach them something. Things that entertain them. Things which suit the story they repeatedly tell themselves every day. Things that work.

If you’re a maker and looking for a guide or some criteria to adhere to, make it quality.

Everything else is up for debate.

Elephant on the moon

Walking on the moon

moonwalking like Mike

trying to find the light

Have you seen it?

It’s usually so bright

The elephant stopped

Then kept going

Looking for the light

it was there before

Where’d it go?

Dust came up

Step, dust, step, dust

The elephant looked, the rabbit was there

The rabbit didn’t say anything

Step, dust, step, dust

Where was the fisherman?

The one from dreamworks

maybe he’d know,

how this dream works

The elephant stopped

No step, no dust

Remembered the fence

The old gate, gathering rust

All it took was a heave and a thrust

Now out in the wild

Now time to go

Now what?

Now on the moon

Step, dust, step, dust

On the moon,

the elephant grinned

Looked around, there it was

Back at the ground

There the whole time

The elephant remembered

On a floating rock,

through empty space

The place that can never be filled

The elephant remembered

In empty space but feeling full

The elephant remembered

There it was

There the whole time

All so bright

An elephant never forgets,

even on the moon.

A Gentle (and visual) Introduction to Exploratory Data Analysis

Pink singlet, dyed red hair, plated grey beard, no shoes, John Lennon glasses. What a character. Imagine the stories he’d have. He parked his moped and walked into the cafe.

This cafe is a local favourite. But the chairs aren’t very comfortable. So I’ll keep this short (spoiler: by short, I mean short compared to the amount of time you’ll actually spend doing EDA).

When I first started as a Machine Learning Engineer at Max Kelsen, I’d never heard of EDA. There are a bunch of acronyms I’ve never heard of.

I later learned EDA stands for exploratory data analysis.

It’s what you do when you first encounter a data set. But it’s not a once off process. It’s a continual process.

The past few weeks I’ve been working on a machine learning project. Everything was going well. I had a model trained on a small amount of the data. The results were pretty good.

It was time to step it up and add more data. So I did. Then it broke.

I filled up the memory on the cloud computer I was working on. I tried again. Same issue.

There was a memory leak somewhere. I missed something. What changed?

More data.

Maybe the next sample of data I pulled in had something different to the first. It did. There was an outlier. One sample which had 68 times the amount of purchases as the mean (100).

Back to my code. It wasn’t robust to outliers. It took the outliers value and applied to the rest of the samples and padded them with zeros.

Instead of having 10 million samples with a length of 100, they all had a length of 6800. And most of that data was zeroes.

I changed the code. Reran the model and training began. The memory leak was patched.

Pause.

The guy with the pink singlet came over. He tells me his name is Johnny.

He continues.

‘The girls got up me for not saying hello.’

‘You can’t win,’ I said.

‘Too right,’ he said.

We laughed. The girls here are really nice. The regulars get teased. Johnny is a regular. He told me he has his own farm at home. And his toenails were painted pink and yellow, alternating, pink, yellow, pink, yellow.

Johnny left.

Back to it.

What happened? Why the break in the EDA story?

Apart from introducing you to the legend of Johnny, I wanted to give an example of how you can think the road ahead is clear but really, there’s a detour.

EDA is one big detour. There’s no real structured way to do it. It’s an iterative process.


Why do EDA?

When I started learning machine learning and data science, much of it (all of it) was through online courses. I used them to create my own AI Masters Degree. All of them provided excellent curriculum along with excellent datasets.

The datasets were excellent because they were ready to be used with machine learning algorithms right out of the box.

You’d download the data, choose your algorithm, call the .fit() function, pass it the data and all of a sudden the loss value would start going down and you’d be left with an accuracy metric. Magic.

This was how the majority of my learning went. Then I got a job as a machine learning engineer. I thought, finally, I can apply what I’ve been learning to real-world problems.

Roadblock.

The client sent us the data. I looked at it. WTF was this?

Words, time stamps, more words, rows with missing data, columns, lots of columns. Where were the numbers?

‘How do I deal with this data?’ I asked Athon.

‘You’ll have to do some feature engineering and encode the categorical variables,’ he said, ‘I’ll Slack you a link.’

I went to my digital mentor. Google. ‘What is feature engineering?’

Google again. ‘What are categorical variables?’

Athon sent the link. I opened it.

There it was. The next bridge I had to cross. EDA.

You do exploratory data analysis to learn more about the more before you ever run a machine learning model.

You create your own mental model of the data so when you run a machine learning model to make predictions, you’ll be able to recognise whether they’re BS or not.

Rather than answer all your questions about EDA, I designed this post to spark your curiosity. To get you to think about questions you can ask of a dataset.


Where do you start?

How do you explore a mountain range?

Do you walk straight to the top?

How about along the base and try and find the best path?

It depends on what you’re trying to achieve. If you want to get to the top, it’s probably good to start climbing sometime soon. But it’s also probably good to spend some time looking for the best route.

Exploring data is the same. What questions are you trying to solve? Or better, what assumptions are you trying to prove wrong?

You could spend all day debating these. But best to start with something simple, prove it wrong and add complexity as required.

Example time.


Making your first Kaggle submission

You’ve been learning data science and machine learning online. You’ve heard of Kaggle. You’ve read the articles saying how valuable it is to practice your skills on their problems.

Roadblock.

Despite all the good things you’ve heard about Kaggle. You haven’t made a submission yet.

That was me. Until I put my newly acquired EDA skills to work.

You decide it’s time to enter a competition of your own.

You’re on the Kaggle website. You go to the ‘Start Here’ section. There’s a dataset containing information about passengers on the Titanic. You download it and load up a Jupyter Notebook.

What do you do?

What question are you trying to solve?

‘Can I predict survival rates of passengers on the Titanic, based on data from other passengers?’

This seems like a good guiding light.


An EDA checklist

Every morning, I consult with my personal assistant on what I have to do for the day. My personal assistant doesn’t talk much. Because my personal assistant is a notepad. I write down a checklist.

If a checklist is good enough for pilots to use every flight, it’s good enough for data scientists to use with every dataset.

My morning lists are non-exhaustive, other things come up during the day which have to be done. But having it creates a little order in the chaos. It’s same with the EDA checklist below.

An EDA checklist

1. What question(s) are you trying to solve (or prove wrong)?
2. What kind of data do you have and how do you treat different types?
3. What’s missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?

We’ll go through each of these.

What would you add to the list?


What question(s) are you trying to solve?

I put an (s) in the subtitle. Ignore it. Start with one. Don’t worry, more will come along as you go.

For our Titanic dataset example it’s:

Can we predict survivors on the Titanic based on data from other passengers?

Too many questions will clutter your thought space. Humans aren’t good at computing multiple things at once. We’ll leave that to the machines.

Sometimes a model isn’t required to make a prediction.

Before we go further, if you’re reading this on a computer, I encourage you to open this Juypter Notebook and try to connect the dots with topics in this post. If you’re reading on a phone, don’t fear, the notebook isn’t going away. I’ve written this article in a way you shouldn’t need the notebook but if you’re like me, you learn best seeing things in practice.



What kind of data do you have and how to treat different types?

You’ve imported the Titanic training dataset.

Let’s check it out.

training.head()
.head() shows the top five rows of a dataframe. The rows you’re seeing are from the Kaggle Titanic Training Dataset.

.head() shows the top five rows of a dataframe. The rows you’re seeing are from the Kaggle Titanic Training Dataset.

Column by column, there’s: numbers, numbers, numbers, words, words, numbers, numbers, numbers, letters and numbers, numbers, letters and numbers and NaNs, letters. Similar to Johnny’s toenails.

Let’s separate the features out into three boxes, numerical, categorical and not sure.

Columns of different information are often referred to as features. When you hear a data scientist talk about different features, they’re probably talking about different columns in a dataframe.

Columns of different information are often referred to as features. When you hear a data scientist talk about different features, they’re probably talking about different columns in a dataframe.

In the numerical bucket we have, PassengerId, Survived, Pclass, Age, SibSp, Parch and Fare.

The categorical bucket contains Sex and Embarked.

And in not sure we have Name, Ticket and Cabin.

Now we’ve broken the columns down into separate buckets, let’s examine each one.

The Numerical Bucket

numerical_bucket.png

Remember our question?

‘Can we predict survivors on the Titanic based on data from other passengers?’

From this, can you figure out which column we’re trying to predict?


We’re trying to predict the green column using data from the other columns.

We’re trying to predict the green column using data from the other columns.

The Survived column. And because it’s the column we’re trying to predict, we’ll take it out of the numerical bucket and leave it for the time being.

What’s left?

PassengerId, Pclass, Age, SibSp, Parch and Fare.

Think for a second. If you were trying to predict whether someone survived on the Titanic, do you think their unique PassengerId would really help with your cause?

Probably not. So we’ll leave this column to the side for now too. EDA doesn’t always have to be done with code, you can use your model of the world to begin with and use code to see if it’s right later.

How about Pclass, SibSp and Parch?

These are numbers but there’s something different about them. Can you pick it up?

What does Pclass, SibSp and Parch even mean? Maybe we should’ve read the docs more before trying to build a model so quickly.

Google. ‘Kaggle Titanic Dataset’.

Found it.

Pclass is the ticket class, 1 = 1st class, 2 = 2nd class and 3 = 3rd class. SibSp is the number of siblings a passenger has on board. And Parch is the number of parents someone had on board.

This information was pretty easy to find. But what if you had a dataset you’d never seen before. What if a real estate agent wanted help predicting house prices in their city. You check out their data and find a bunch of columns which you don’t understand.

You email the client.

‘What does Tnum mean?’

They respond. ‘Tnum is the number of toilets in a property.’

Good to know.

When you’re dealing with a new dataset, you won’t always have information available about it like Kaggle provides. This is where you’ll want to seek the knowledge of an SME.

Another acronym. Great.

SME stands for subject matter expert. If you’re working on a project dealing with real estate data, part of your EDA might involve talking with and asking questions of a real estate agent. Not only could this save you time, but it could also influence future questions you ask of the data.

Since no one from the Titanic is alive anymore (RIP (rest in peace) Millvina Dean, the last survivor), we’ll have to become our own SMEs.

There’s something else unique about Pclass, SibSp and Parch. Even though they’re all numbers, they’re also categories.

How so?

Think about it like this. If you can group data together in your head fairly easily, there’s a chance it’s part of a category.

The Pclass column could be labelled, First, Second and Third and it would maintain the same meaning as 1, 2 and 3.

Remember how machine learning algorithms love numbers? Since Pclass, SibSp and Parch are already all in numerical form, we’ll leave them how they are. The same goes for Age.

Phew. That wasn’t too hard.


The Categorical Bucket

categorical_bucket.png

In our categorical bucket, we have Sex and Embarked.

These are categorical variables because you can separate passengers who were female from those who were male. Or those who embarked on C from those who embarked from S.

To train a machine learning model, we’ll need a way of converting these to numbers.

How would you do it?

Remember Pclass? 1st = 1, 2nd = 2, 3rd = 3.

How would you do this for Sex and Embarked?

Perhaps you could do something similar for Sex. Female = 1 and male = 2.

As for Embarked, S = 1 and C = 2.

We can change these using the .LabelEncoder() function from the sklearn library.

training.embarked.apply(LabelEncoder().fit_transform)

Wait? Why does C = 0 and S = 2 now? Where’s 1? Hint: There’s an extra category, Q, this takes the number 1. See the  data description page  on Kaggle for more.

Wait? Why does C = 0 and S = 2 now? Where’s 1? Hint: There’s an extra category, Q, this takes the number 1. See the data description page on Kaggle for more.

We’ve made some good progress towards turning our categorical data into all numbers but what about the rest of the columns?

Challenge: Now you know Pclass could easily be a categorical variable, how would you turn Age into a categorical variable?


The Not Sure Bucket

not_sure.png

Name, Ticket and Cabin are left.

If you were on Titanic, do you think your name would’ve influenced your chance of survival?

It’s unlikely. But what other information could you extract from someone's name?

What if you gave each person a number depending on whether their title was Mr., Mrs. or Miss.?

You could create another column called Title. In this column, those with Mr. = 1, Mrs. = 2 and Miss. = 3.

What you’ve done is created a new feature out of an existing feature. This is called feature engineering.

Converting titles to numbers is a relatively simple feature to create. And depending on the data you have, feature engineering can get as extravagant as you like.

How does this new feature affect the model down the line? This will be something you’ll have to investigate.

For now, we won’t worry about the Name column to make a prediction.

What about Ticket?

ticket_head.png

The first few examples don’t look very consistent at all. What else is there?

training.Ticket.head(15)

The first 15 entries of the Ticket column.

The first 15 entries of the Ticket column.

These aren’t very consistent either. But think again. Do you think the ticket number would provide much insight as to whether someone survived?

Maybe if the ticket number related to what class the person was riding in, it would have an effect but we already have that information in Pclass.

To save time, we’ll forget the Ticket column for now.

Your first pass of EDA on a dataset should have the goal of not only raising more questions about the data but to get a model built using the least amount of information possible so you’ve got have a baseline to work from.

Now, what do we do with Cabin?

You know, since I’ve already seen the data, my spidey-senses are telling me it’s a perfect example for the next section.

Challenge: I’ve only listed a couple examples of numerical and categorical data here. Are there any other types of data? How do they differ to these?


What’s missing from the data and how do you deal with it?

missingno.matrix(train, figsize = (30,10))
The  missingno library  is a great quick way to quickly and visually check for holes in your data, it detects where NaN values (or no values) appear and highlights them. White lines indicate missing values.

The missingno library is a great quick way to quickly and visually check for holes in your data, it detects where NaN values (or no values) appear and highlights them. White lines indicate missing values.

The Cabin column looks like Johnny’s shoes. Not there. There are a fair few missing values in Age too.

How do you predict something when there’s no data?

I don’t know either.

So what are our options when dealing with missing data?

The quickest and easiest way would be to remove every row with missing values. Or remove the Cabin and Age column entirely.

But there’s a problem here. Machine learning models like more data. Removing large amounts of data will likely decrease the ability of our model to predict whether a passenger survived or not.

What’s next?

Imputing values. In other words, filling up the missing data with values calculated from other data.

How would you do this for the Age column?

When we called .head() the Age column had no missing values. But when we look at the whole column, there are plenty of holes.

When we called .head() the Age column had no missing values. But when we look at the whole column, there are plenty of holes.

Could you fill missing values with average age?

There are drawbacks to this kind of value filling. Imagine you had 1000 total rows, 500 of which are missing values. You decide to fill the 500 missing rows with the average age of 36.

What happens?

Your data becomes heavily stacked with the age of 36. How would that influence predictions on people 36-years-old? Or any other age?

Maybe for every person with a missing age value, you could find other similar people in the dataset and use their age. But this is time-consuming and also has drawbacks.

There are far more advanced methods for filling missing data out of scope for this post. It should be noted, there is no perfect way to fill missing values.

If the missing values in the Age column is a leaky drain pipe the Cabin column is a cracked dam. Beyond saving. For your first model, Cabin is a feature you’d leave out.

Challenge: The Embarked column has a couple of missing values. How would you deal with these? Is the amount low enough to remove them?


Where are the outliers and why you should be paying attention to them?

‘Did you check the distribution?’ Athon asked.

‘I did with the first set of data but not the second set…’ It hit me.

There it was. The rest of the data was being shaped to match the outlier.

If you look at the number of occurrences of unique values within a dataset, one of the most common patterns you’ll find is Zipf’s law. It looks like this.

Zipf’s law: The highest occurring variable will have double the number of occurrences of the second highest occurring variable, triple the amount of the third and so on.

Zipf’s law: The highest occurring variable will have double the number of occurrences of the second highest occurring variable, triple the amount of the third and so on.

Remembering Zipf’s law can help to think about outliers (values towards the end of the tail don’t occur often and are potential outliers).

The definition of an outlier will be different for every dataset. As a general rule of thumb, you may consider anything more than 3 standard deviations away from the mean might be considered an outlier.

You could use a general rule to consider anything more than three standard deviations away from the mean as an outlier.

You could use a general rule to consider anything more than three standard deviations away from the mean as an outlier.

Or from another perspective.

Outliers from the perspective of an (x, y) plot.

Outliers from the perspective of an (x, y) plot.

How do you find outliers?

Distribution. Distribution. Distribution. Distribution. Four times is enough (I’m trying to remind myself here).

During your first pass of EDA, you should be checking what the distribution of each of your features is.

A distribution plot will help represent the spread of different values of data you have across. And more importantly, help to identify potential outliers.

train.Age.plot.hist()

Histogram plot of the Age column in the training dataset. Are there any outliers here? Would you remove any age values or keep them all?

Histogram plot of the Age column in the training dataset. Are there any outliers here? Would you remove any age values or keep them all?

Why should you care about outliers?

Keeping outliers in your dataset may turn out in your model overfitting (being too accurate). Removing all the outliers may result in your model being too generalised (it doesn’t do well on anything out of the ordinary). As always, best to experiment iteratively to find the best way to deal with outliers.

Challenge: Other than figuring out outliers with the general rule of thumb above, are there any other ways you could identify outliers? If you’re confused about a certain data point, is there someone you could talk to? Hint: the acronym contains the letters M E S.


Getting more out of your data with feature engineering

The Titanic dataset only has 10 features. But what if your dataset has hundreds? Or thousands? Or more? This isn’t uncommon.

During your exploratory data analysis process, once you’ve started to form an understanding AND you’ve got an idea of the distributions AND you’ve found some outliers AND you’ve dealt with them, the next biggest chunk of your time will be spent on feature engineering.

Feature engineering can be broken down into three categories: adding, removing and changing.

The Titanic dataset started out in pretty good shape. So far, we’ve only had to change a few features to be numerical in nature.

However, data in the wild is different.

Say you’re working on a problem trying to predict the changes in banana stock requirements of a large supermarket chain across the year.

Your dataset contains a historical record of stock levels and previous purchase orders. You're able to model these well but you find there are a few times throughout the year where stock levels change irrationally. Through your research, you find during a yearly country-wide celebration, banana week, the stock levels of bananas plummet. This makes sense. To keep up with the festivities, people buy more bananas.

To compensate for banana week and help the model learn when it occurs, you might add a column to your data set with banana week or not banana week.

# We know Week 2 is a banana week so we can set it using np.where()
df["Banana Week"] = np.where(df["Week Number"] == 2, 1, 0)
A simple example of adding a binary feature to dictate whether a week was banana week or not.

A simple example of adding a binary feature to dictate whether a week was banana week or not.

Adding a feature like this might not be so simple. You could find adding the feature does nothing at all since the information you’ve added is already hidden within the data. As in, the purchase orders for the past few years during banana week are already higher than other weeks.

What about removing features?

We’ve done this as well with the Titanic dataset. We dropped the Cabin column because it was missing so many values before we even ran a model.

But what about if you’ve already run a model using the features left over?

This is where feature contribution comes in. Feature contribution is a way of figuring out how much each feature influences the model.

An example of a feature contribution graph using Sex, Pclass, Parch, Fare, Embarked and SibSp features to predict who would survive on the Titanic. If you’ve seen the movie, why does this graph make sense? If you haven’t, think about it anyway. Hint: ‘Save the women and children!’

An example of a feature contribution graph using Sex, Pclass, Parch, Fare, Embarked and SibSp features to predict who would survive on the Titanic. If you’ve seen the movie, why does this graph make sense? If you haven’t, think about it anyway. Hint: ‘Save the women and children!’

Why is this information helpful?

Knowing how much a feature contributes to a model can give you direction as to where to go next with your feature engineering.

In our Titanic example, we can see the contribution of Sex and Pclass were the highest. Why do think this is?

What if you had more than 10 features? How about 100? You could do the same thing. Make a graph showing the feature contributions of 100 different features. ‘Oh, I’ve seen this before!’

Zipf’s law back at it again. The top features have far more to contribute than the bottom features.

Zipf’s law at play with different features and their contribution to a model.

Zipf’s law at play with different features and their contribution to a model.

Seeing this, you might decide to cut the lesser contributing features and improve the ones contributing more.

Why would you do this?

Removing features reduces the dimensionality of your data. It means your model has fewer connections to make to figure out the best way of fitting the data.

You might find removing features means your model can get the same (or better) results on fewer data and in less time.

Like Johnny is a regular at the cafe I’m at, feature engineering is a regular part of every data science project.

Challenge: What are other methods of feature engineering? Can you combine two features? What are the benefits of this?


Building your first model(s)

Finally. We’ve been through a bunch of steps to get our data ready to run some models.

If you’re like me, when you started learning data science, this is the part you learned first. All the stuff above had already been done by someone else. All you had to was fit a model on it.

Our Titanic dataset is small. So we can afford to run a multitude of models on it to figure out which is the best to use.

Notice how I put an (s) in the subtitle, you can pay attention to this one.

Cross-validation accuracy scores from a number of different models I tried using to predict whether a passenger would survive or not.

Cross-validation accuracy scores from a number of different models I tried using to predict whether a passenger would survive or not.

But once you’ve had some practice with different datasets, you’ll start to figure out what kind of model usually works best. For example, most recent Kaggle competitions have been won with ensembles (combinations) of different gradient boosted tree algorithms.

Once you’ve built a few models and figured out which is best, you can start to optimise the best one through hyperparameter tuning. Think of hyperparameter tuning as adjusting the dials on your oven when cooking your favourite dish. Out of the box, the preset setting on the oven works pretty well but out of experience you’ve found lowering the temperature and increasing the fan speed brings tastier results.

It’s the same with machine learning algorithms. Many of them work great out of the box. But with a little tweaking of their parameters, they work even better.

But no matter what, even the best machine learning algorithm won’t result in a great model without adequate data preparation.

Exploratory data analysis and model building is a repeating circle.

The EDA circle of life.

The EDA circle of life.

A final challenge (and some extra-curriculum)

I left the cafe. My ass was sore.

At the start of this article, I said I’d keep it short. You know how that turned out. It will be the same as your EDA iterations. When you think you’re done. There’s more.

We covered a non-exhaustive EDA checklist with the Titanic Kaggle dataset as an example.

1. What question are you trying to solve (or prove wrong)?

Start with the simplest hypothesis possible. Add complexity as needed.

2. What kind of data do you have?

Is your data numerical, categorical or something else? How do you deal with each kind?

3. What’s missing from the data and how do you deal with?

Why is the data missing? Missing data can be a sign in itself. You’ll never be able to replace it with anything as good as the original but you can try.

4. Where are the outliers and why should pay attention to them?

Distribution. Distribution. Distribution. Three times is enough for the summary. Where are the outliers in your data? Do you need them or are they damaging your model?

5. How can you add, change or remove features to get more out of your data?

The default rule of thumb is more data = good. And following this works well quite often. But is there anything you can remove get the same results? Less but better? Start simple.

Data science isn’t always about getting answers out of data. It’s about using data to figure out what assumptions of yours were wrong. The most valuable skill a data scientist can cultivate is a willingness to be wrong.

There are examples of everything we’ve discussed here (and more) in the notebook on GitHub and a video of me going through the notebook step by step on YouTube (the coding starts at 5:05).

FINAL BOSS CHALLENGE: If you’ve never entered a Kaggle competition before, and want to practice EDA, now’s your chance. Take the notebook I’ve created, rewrite it from top to bottom and improve on my result. If you do, let me know and I’ll share your work on my LinkedIn. Get after it.

Extra-curriculum bonus: Daniel Formoso's notebook is one of the best resources you’ll find for an extensive look at EDA on a Census Income Dataset. After you’ve completed the Titanic EDA, this is a great next step to check out.

If you’ve got something on your mind you think this article is missing, leave a response below or send me a note and I’ll be happy to get back to you.

Source: https://towardsdatascience.com/a-gentle-in...

4000.

Over the holiday period I hit 4000 subscribers on YouTube.

Thank you all.

As you know, the number doesn’t mean much to me. I’d rather have 4000 people who are interested in what’s happening than 4,000,000 who are there because the crowd is.

To celebrate the milestone, I held a livestream answering your questions and sharing my curriculum for 2019.

As you’ll see in the video I focused on the questions more than the curriculum. The specific things I’m interested in learning might not be the same as you but we do have one thing in common. The hunger for knowledge.

My curriculum is heavily focused on the intersection of health and technology. How can we use technology more or use it less to help us live healthier lives? I think about this question ferociously.

Some of the questions I answered in the video include:

  • How to study something new

  • Where to learn the math required for machine learning

  • How to get an internship in data science (or any field)

  • The two main things you should be focused on when deciding what to study (hint: curiosity + practicality)

If you have anything on your mind I didn’t get to, send me an email.

Happy New Year.



Work in progress

I’m working on a longer form article. An introduction to exploratory data analysis to go along with the Code with Me video I did exploring the Kaggle Titanic dataset and the notebook code to go with it.

I’ve spent the past two days writing and and refining it.

I wanted to get it published today but it’s getting late and you know my thoughts on sleep. I work better when I sleep well.

In the past I’d have trouble walking away from something unless it’s done. But I’ve learned, especially with writing (and code) it pays to walk away, think about nothing for a while and then come back at it with a different pair of eyes.

The next time you look at it, you’ll see things you missed before. That’s what I’ll be doing tomorrow morning.

If you want to read it in the meantime, it’s in draft form on Medium. It needs some graphics and a little tidying but if you do read it, what would you change?

How to destroy your enemies according to Abraham Lincoln

In a speech Abraham Lincoln delivered at the height of the Civil War, he referred to the Southerners as fellow human beings who were in error. An elderly lady chastised him for not calling them irreconcilable enemies who must be destroyed. “Why, madam,” Lincoln replied, “do I not destroy my enemies when I make them my friends?” — page 23 of the 48 Laws of Power by Robert Greene.

Most of our enemies are in our head.

You think someone is conspiring against you when really, it’s you creating the image of them conspiring against you in your head.

Impatience is making time an enemy. 

Making the person you could be an enemy turns into a lack of motivation.

If you want to defeat your enemies, put down  the sword and reach out with a hand.

When a customer came into Apple in a demonic state, there was a saying we had.

Kill them with kindness.

Or as Lincoln would say, make them your friend.

Looking forward to it

A family holiday.

A project.

Something you're training for.

Going on a date.

Finally graduating from that course you’ve been studying.

All fun when you do them.

But half the fun is looking forward to it.

That's I why I like to set myself weekly, monthly and yearly things to look forward to.

The catch-up session with friends at the end of the week.

The monthly check-in with myself and the projects I'm working on.

And the family holiday at the end of the year. A half-yearly adventure is good too.

The actual event is often short-lived but if you plan it right, the excitement of anticipation can be everlasting.

The Five C's of Online Learning

This post originally appeared on Quora as my answer to 'Udacity or Coursera for AI machine learning and data science courses?'

P1000829.jpg

Tea or coffee?

Burger or sandwich?

Rain or sunshine?

Pushups or pull-ups?

Can you see the pattern?

Similar but different. It’s the same with Udacity and Coursera.

I used both of them for my self-created AI Masters Degree. And they both offer incredibly high-quality content.

The short answer: both.

Keep scrolling for a longer version.

Let’s go through the five C’s of online learning.

If you’ve seen my work, you know I’m a big fan of digging your own path and online platforms like Udacity and Coursera are the perfect shovel. But doing this right requires thought around five pillars.


Curiosity

When you imagine the best version of yourself 3–5 years in the future, what are they doing?

Does it align with what’s being offered by Udacity or Coursera?

Is the future you a machine learning engineer at a technology company?

Or have you decided to take the leap on your latest idea and go full startup mode?

It doesn’t matter what the goal is. All of them are valid. Mine is different to yours and yours will be different to the other students in your cohort.

The important part is an insatiable curiosity. In Japanese, this curiosity is referred to as ikigai or your reason for getting up in the morning.

Day to day, you won’t be bounding out of bed running to the laptop to get into the latest class or complete the assignment you’re stuck on.

There will be days where everything else except studying seems like a better option.

Don’t beat yourself up over it. It happens. Take a break. Rest.

Even with all the drive in the world, you still need gas.


Contrast

Sam was telling me about a book he read over the holidays.

‘There were some things I agreed with but some things I didn’t.’

My insatiable curiosity kicked in.

‘What did you disagree with?’

I was more interested in that. He said it was a good book. What were the things he didn’t like?

Why didn’t he like those things?

The contrast is where you learn the most.

When someone agrees with you, you don’t have to back up your argument. You don’t have to explain why.

But have you ever heard two smart people argue?

I want to hear more of those conversations.

When two smart people argue, you’ve got an opportunity to learn the most.

If they're both smart, why do they disagree?

What are their reasons for disagreeing?

Take this philosophy and apply it to learning online through Udacity or Coursera.

If they’re like tea and coffee, where's the difference?

When I did the Deep Learning Nanodegree on Udacity, I felt like I had a wide (but shallow) introduction to deep learning.

Then when I did Andrew Ng’s deeplearning.ai after, I could feel the knowledge compounding.

Andrew Ng’s teachings didn’t disagree with Udacity’s, they offered a different point of view.

The value is in the contrast.


Content

Both partner with world-leading organisations.

Both have world class quality teachers.

Both have state of the art learning platforms.

When it comes to content, you won’t be disappointed by either.

I’ve done multiple courses on both platforms and I rate them among the best courses I’ve ever done. And I went to university for 5-years.

Udacity Nanodegrees tend to go for longer than Coursera.

For example, the Artificial Intelligence Nanodegree is two terms both about 3–4 months long.

Whereas Coursera Specializations (although at times a similar length), you can dip in and out of.

For example, complete part 1 of a Specialization, take a break and return to the next part when you’re ready. I’m doing this for the Applied Data Science with Python Specialization.

If content is at the top of your decision-making criteria, make a plan of what it is you hope to learn. Then experiment with each of the platforms to see which better suits your learning style.


Cost

Udacity has a pay upfront pricing model.

Coursera has a month-to-month pricing model.

There have been times I completed an entire Specialization on Coursera within the first month of signing up, hence only paying for one month.

Whereas, all the Udacity Nanodegree’s I’ve done, I’ve paid the total up front and finished on (or after) the deadline.

This could be Parkinson’s Law at play: things take up as much time as you allow them.

Both platforms offer scholarships as well as financial support services, however, I haven’t had any experience with these.

I drove Uber on weekends for a year to pay for my studies.

I’m a big believer in paying for things.

Especially education.

When I pay for something, I take it more seriously.

Paying for something is a way of saying to yourself, I’m investing my money (and time spent earning it), I better invest my time into too.

All the courses I’ve completed on both platforms have been worth more than the money I spent on them.


Continuation

You’ve decided on a learning platform.

You’ve decided on a course.

You work through it.

You enjoy it.

Now what do you do?

Do you start the next course?

Do you start applying for jobs?

Does the platform offer any help with getting into the industry?

Udacity has a service which partners students who have completed a Nanodegree with a careers counsellor to help you get a role.

I’ve never got a chance to use this because I was hired through LinkedIn.

What can you do?

Don’t be focused on completing all the courses.

Completing courses is the same as completing tasks. Rewarding. But more tasks don’t necessarily move the needle.

Focus on learning skills.

Once you’ve learned some skills. Practice communicating those skills.

How?

Share your work.

Have a nice GitHub repository with things you’ve built. Stack out your LinkedIn profile. Build a website where people can find you. Talk to people in your industry and ask for their advice.

Why?

Because a few digital certificates isn’t a reason to hire someone.

Done all that?

Good. Now remember, the learning never stops. There is no finish line.

This isn’t scary. It’s exciting.

You stop learning when your heart stops beating.


Let’s wrap it up

Both platforms offer some of the highest quality education available.

And I plan on continuing to use them both to learn machine learning, data science and many other things.

But if you can online choose one, remember the five C’s.

  1. Curiosity — Stay curious. Remember it when learning gets tough.

  2. Contrast — Remix different learning resources. All the value in life is at the combination of great things.

  3. Content — What content matches your curiosity? Follow that.

  4. Cost — Cost restrictions are real. But when used right, your education is worth it.

  5. Continuation — Learn skills, apply them, share them, repeat.

More

I’ve written and made videos about these topics in the past. You might find some of the resources below valuable.

Source: https://qr.ae/TUnFZB

You have to work up to be a level 35 boss

Sam and Josh are the bed trying out my new bed base. I'm at the desk writing.

Sam speaks.

This is like a psychologist session for me.

I look up.

What are you thinking?

The goals I have and finding the fastest way possible to do them. Having fun as well. But I'm scared.

I speak.

What are you scared of?

Scared of doing them. Scared of them being impossible to reach. I don't have time or experience or the money to do it myself.

Josh speaks.

You are a level 1 crook at the moment. You have to work up to be a level 35 boss. That's how mafia works.

We laugh. He was talking about a meme going around. And he’s right.

I speak.

Write down what you're thinking. You may get lost in thoughts but found in the words. It helps me. And it's free to try.

Sam listens closely.

Then get up in the morning and write 1 thing down on a piece of paper. What's 1 thing you want to get done that day? Then do it the next day. Learn what it's like to get something small done.

He nods.

I continue.

Hey Siri 20-minute timer. I set a timer.

When it's time to do the thing you wrote down, set a timer for yourself. For the next 20-minutes do nothing except work on that thing. For the first few minutes it will be hard but then as you go on, you'll get into a flow.

He smiles and picks up his phone. Facebook is more entertaining than me.

I go back to writing.

I think of something else to tell him. Patience. But it can wait. I don't want to over lecture him. The best way to teach is to set an example anyway. I know I can do that.

He's holding himself back. I'm holding myself back. We're all holding ourselves back. It's the old version of ourselves. Our old way of thinking.

That's why we're scared of the goals we have. Because the old version of ourselves doesn't have the brain capacity to handle them.

Can you imagine trying to run the latest apps on an old operating system? The latest version of the Quora app wouldn’t run very well on iOS 6.

To achieve what you want to achieve, your old ways of thinking have to be upgraded.

This is where writing has helped me. Write down the best version of your future self. 3-5 years from now is a good timeline. Not too long. Not too short.

How would they think?

How would they make decisions?

What actions would they take?

Those are the features you want in the most important software there is: your way of thinking.

Birds of a feather flock together

Opposites attract is a common saying. But how much time would you really want to spend with someone who's the complete opposite of you?

If you're into health and fitness and working on challenging projects, how much time do you want to be spending with someone who likes spending their days on the couch watching reruns of reality TV shows?

Give yourself permission to change tribes if you need to.

And if the tribe doesn't exist, you can always create it.

My theme for 2019

Resolutions used to be my thing. One year I did without takeaway food for the whole year. The habit still stands.

This year, instead of a resolution, I'm going to follow a theme and make a promise. Are they the same? Who knows.

My theme for 2019 is Subtract. Any time I'm faced with a hard decision, I'll reflect back to this.

For example, should I start working more hours? No. Subtract. Quantity doesn't equal quality.

And my 2019 promise is to continue publishing a consistent stream of quality work.

How do these come into play?

Subtract comes from one of my favourite sayings, less but better.

Less but better was my resolution/theme/who knows for last year. And I love it so much, it's following through.

Here's the meat of what I'm going to be doing.

None

  • Instagram

  • Facebook services in general (I never go on Facebook and I'm deleting messenger)

Why?

Aside from the consistent bad media stories about the company, I continually started to catch myself scrolling without knowing what I was doing.

If I think about what I want to achieve this year, I don't want my time and effort being dedicated to those actions.

To take action, I deleted the apps on my phone and got my best friend to change my passwords on my accounts.

Less

  • Trying to convince others about what I'm working on (just do it instead)

  • Screen time (under 2-hours per day average)

Why?

Have you ever been held back on something you know you would enjoy because you were waiting for permission from someone else?

Here's what I did.

Dear Daniel,

I give you permission to work on the things you want to work on.

From,

Daniel

The new iPhone iOS has a setting which tells you how much time you spent on your phone in the past week. All of the work which I do that brings the most value is not done on a phone. Less screen time on my iPhone means more time to use elsewhere.

Continuing to cultivate

1. Staying healthy (sleep, food, movement, relationships)

2. Quality of work (educating and entertaining)

3. Long-term thinking (what action can I take now, to help me and others in 20-years time?)

4. Income streams (money isn't happiness but it does give you the freedom to spend your time how you want)

Why?

1. Health = life. Health is always first in any of my equations.

2. Keeping in line with my promise for 2019. You'll be able to find my work here, Medium, Quora, LinkedIn and YouTube. These are the platforms I'll be sticking to.

3. I played RuneScape religiously as a kid. My favourite skill was Farming. It wasn't like other skills where you gained points immediately. Instead, you planted a tree, came back 24-hours later and got a massive bonus. I want to apply this thinking to all of my life.

4. $100k online income. That's my number. That's what I'm going for. $275 a day. Making more money means I'll be able to reinvest it into building more long-term (and expensive) projects (see 3).

What are your goals/resolutions/themes/who knows for 2019? If you don't have any, which of mine do you disagree with? Send me an email, I'll reply.

PS I'm also going to be finishing off this list: 25 Things I’m Going To Do Before I Turn 26

Three things you’ll need

Whatever role you’re going for. A job, starting something of your own or even a new relationship, there are three things you’ll need:

1. Skills. Whatever it is you choose to do, be good at it. Really good. If you’re new, there’s nothing wrong with being a beginner, learn what it takes to be good.

2. Communication. A product with no shelf space won’t sell very much. The same goes with someone with skills but can’t demonstrate them very well. Being able to communicate to others what you’re good at is half the game. Share your work. Share your talents. Share your story.

3. People. We’re social creatures. There’s no such thing as self-made. Everyone has two parents and has learned a thing or two from someone else. Whatever you choose to do, it pays to know the right people. Sometimes the right person appears at exactly the right time. But best not to leave it to chance, start looking (they’re probably looking for you too).

External forces may push you around a bit

I used to work at Apple. I use an iPhone every day and I’m writing this on an iPad.

I’m a fan of the company. I appreciate their values and the products and services they offer.

When I start a business of my own, it’ll take queues from Apple.

This coming quarter will be the first quarter since 2016 Apple hasn’t passed expectations of their earnings.

As a stock holder, this means the price of my shares in the company will likely drop a bit (they’ve been dropping for the past couple of months). But I don’t care.

I’m in it for the long game. I’m not interested in hearing if a company makes new record profits every 3 months. I’m more interested in whether the company will be around in 20-years time.

Say you went on a date with a girl and then went on a second date and a third. Then you started to like her a bit. You decide to make things official and ask her to be your girlfriend, are you hoping the relationship lasts a couple of months? Or are you more interested in something more long term?

The same goes with health. Are you on a health kick because it’s the start of the New Year and you’re sticking to your resolution for January? Or are you looking for long term health gains because you’re a fan of yourself when you’re healthy.

Tim Cook sent out a memo to Apple employees talking about the recent headlines. This was my favourite outtake.

‘External forces may push us around a bit, but we are not going to use them as an excuse. Nor will we just wait around until they get better. This moment gives us an opportunity to learn and to take action, to focus on our strengths and on Apple’s mission — delivering the best products on earth for our customers and providing them with an unmatched level of service. We manage Apple for the long term, and in challenging times we have always come out stronger.’

It’s good as it is but let me reword it for you.

External forces may push you around a bit, but you’re not going to use them as an excuse. Nor will you just wait around until things get better. Being pushed around gives you an opportunity to learn and to take action, to focus on your strengths and on your mission — delivering the highest quality of work forged in the fires of your soul. The short term is attractive but it isn’t what you’re playing for. Every challenge you’ve faced, you’ve conquered, this one will be no different.

FullSizeRender.jpg