How to explore your first Kaggle competition dataset and make a submission

The first time doing something is always the hardest.

People had asked me in the past, 'Have you entered Kaggle competitions?'

'Not yet.'

Until the other day. I made my first official submission.

I'd dabbled before. Looked around at the website. Read some posts. But never properly downloaded the data and went through it.

Why?

Fear. Fear of looking at the data and having no idea what to do. And then feeling bad for not knowing anything.

But after a while, I realised that's not a helpful way to think.

I downloaded the Titanic dataset. The one that says 'Start here!' when you visit the competitions page.

A few months into learning machine learning, I wouldn't have been able to explore the dataset.

I learned by starting at the top of the mountain instead of climbing up from the bottom. I started with deep learning instead of practising how to explore a dataset from scratch.

But that's okay. The same principle would apply if you start exploring a dataset from scratch. Once the datasets got bigger, and you wanted your models to be better, you'd have to learn deep learning eventually.

Working through the Titanic data take me a few hours. Then another few hours to tidy up the code. The first run through of any data exploration should always be a little messy. After all, you're trying to build an intuition of the data as quick as possible.

Then came submission time. My best model got a score of just under 76%. Yours will too if you follow through the steps in the notebook on my GitHub.

I made the notebook accessible so you can follow it through and make your very own first Kaggle submission.

There are a few challenges and extensions too if you want to improve on my score. I encourage you to see how you go with these. They might improve the model, they might not.

If you do beat my score, let me know. I'd love to hear about what you did.

Want a coding buddy? When I finished my first submission, I livestreamed myself going step by step through the code. I did my best to explain each step without going into every little detail (otherwise the video would've been 6-hours long instead of 2).

I'll be writing a more in-depth post on the what and why behind the things I did in the notebook. Stay tuned for that.

In the meantime, go and beat my score!

You can find the full code and data on my GitHub.