This kind of language is a constant throughout the book. Many of the techno-jargon terms are described in one or two lines without the fluff.
What’s a classification problem?
Classification is a problem of automatically assigning a label to an unlabelled example. Spam detection is a famous example of classification.
What’s a regression problem?
Regression is a problem of predicting a real-valued label (often called target) given an unlabelled example. Estimating house price valuation based on house features, such as area, the number of bedrooms, location and so on is a famous example of regression.
I took these from the book.
Chapter 3 and 4 — What are the best machine learning algorithms? Why?
Chapter 3 and 4 demonstrate some of the most powerful machine learning algorithms and what makes them learning algorithms.
You’ll find working examples of Linear Regression, Logistic Regression, Decision Tree Learning, Support Vector Machines and k-Nearest Neighbours.
There’s plenty of mathematical notation but nothing you’re not equipped to handle after Chapter 2.
Burkov does an amazing job of setting up the theory, explaining a problem and then pitching a solution for each of the algorithms.
With this, you’ll start to see why inventing a new algorithm is a rare practice. It’s because the existing ones are good at what they do. And as a budding machine learning engineer, your role is to figure out how they can be applied to your problem.
Chapter 5 — Basic practice (Level 1 Machine Learning)
Now you’ve seen examples of the most useful machine learning algorithms, how do you apply them? How do you measure their effectiveness? What should you do if they’re working too well (overfitting)? Or not working well enough (underfitting)?
You’ll see how much of a data scientists or machine learning engineers time is making sure the data is ready to be used with a learning algorithm.
What does this mean?
It means turning data into numbers (computers don’t do well with anything else), dealing with missing data (you can’t learn on nothing), making sure it’s all in the same format, combining different pieces of data or removing them to get more out of what you have (feature engineering) and more.
Once your data is ready, you have to choose the right learning algorithm. Different algorithms work better on different problems.
The book covers this.
You assess what your learning algorithm learned. This is the most important thing you’ll have to communicate to others.
It often means boiling weeks of work down into one metric. So you want to make sure you’ve got it right.
99.99% accuracy looks good. But what’s the precision and recall? Or the area under the ROC curve (AUC)? Sometimes these are more important. The back end of Chapter 5 explains why.
Chapter 6 — The machine learning paradigm taking the world by storm, neural networks and deep learning
You’ve seen the pictures. Images of the brain with deep learning neural networks next to them. Some say they try to mimic the brain others argue there’s no relation.
What matters is how you can use them, what they’re actually made up of not what they’re kind of made of.
A neural network is a combination of linear and non-linear functions. Straight lines and non-straight lines. Using this combination, you can draw (model) anything.
The Hundred-Page Machine Learning Book goes through the most useful examples of neural networks and deep learning such as, Feed-Forward Neural Networks, Convolutional Neural Networks (usually used for images) and Recurrent Neural Networks (usually used for sequences, like words in an article or notes in a song).
Deep learning is what you’ll commonly hear referred to as AI. But after reading this book, you’ll realise as much as it’s AI, it’s also a combination of the different mathematical functions you’ve been learning about in previous chapters.
Chapter 7 & 8 — Using what you’ve learned
Now you’ve got all these tools, how and when should you use them?
If you’ve got articles you need an algorithm to label for you, which one should you use?
If you’ve only got two categories of articles, sport and news, you’ve got a binary problem. If you’ve got more, sport, news, politics, science, you’ve got a multi-class classification problem.
What if an article could have more than one label? One about science and economics. That’s a multi-label problem.
How about translating your articles from English to Spanish? That’s a sequence-to-sequence problem, a sequence of English words to a sequence of Spanish words.
Chapter 7 covers these along with ensemble learning (using more than one model to predict the same thing), regression problems, one-shot learning, semi-supervised learning and more.
So you’ve got a bit of an understanding on what algorithm you can use when. What happens next?
Chapter 8 dives into some of the challenges and techniques you’ll come across with experience.
Imbalanced classes is the challenge of having more data for one label and not enough for another. Think of our article problem but this time we have 1,000 sports articles and only 10 science articles. What should you do here?
Are many hands better than one? Combining models trying to predict the same thing can lead to better results. What are the best ways to do this?
And if one of your models already knows something, how can you use this in another one? This practice is called Transfer Learning. You likely do it all the time. Taking what you know in one domain and using it in another. Transfer Learning does the same but with neural networks. If your neural network knows what order words in Wikipedia articles appear, can it be used to help classify your articles?
What if you have multiple inputs to a model, like text and images? Or multiple outputs, like whether or not your target appears in an image (binary classification) and if it does, where (the coordinates)?
The book covers these.
Chapter 9 & 10 — Learning without labels and other forms of learning
Unsupervised learning is when your data doesn’t have labels. It’s a hard problem because you don’t have a ground truth to judge your model against.
The book looks at two ways of dealing with unlabelled data, density estimation and clustering.
Density estimation tries to specify the probability of a sample falling in a range of values as opposed to taking on a single value.
Clustering aims to group samples which are similar together. For example, if you had unlabelled articles you’d expect ones on sports to be clustered closer together than articles on science (once they’ve been converted to numbers).
Even if you did have labels, another problem you’ll face is having too many variables for the model to learn and not enough samples of each. The practice of fixing this is called dimensionality reduction. In other words, reducing the number of things your model has to learn but still maintaining the quality of the data.
To do this, you’ll be looking at using principal component analysis (PCA), uniform manifold approximation and projection (UMAP) or autoencoders.
These sound intimidating but you’ve built the groundwork to understand them in the previous chapters.
The penultimate chapter goes through other forms of learning such as learning to rank. As in, what Google uses to return search results.
Learning to recommend, as in what YouTube uses to recommend you videos to watch.
And self-supervised learning, in the case of word embeddings which are created by an algorithm reading text and remembering which words appear in the presence of others. It’s self-supervised because the presence of words next to each other are the labels. As in, dog is more likely to appear in a sentence with the word pet than the word car.
The book that keeps giving: The accompanying wiki
The Hundred-Page Machine Learning Book is covered with QR codes. For those after extra-curriculum, the QR codes link to accompanying documentation for each chapter. The extra material includes code examples, papers and references where you can dive deeper.
The best thing?
Burkov updates the wiki himself with new material. Further compounding the start here and continue here for machine learning label.