My first contribution to an open source deep learning library

GitHub still confuses me. But it's needed. You can create your own tools but the best come from collaboration.

The philosophy of open source is simple. Take the best information and knowledge from others and make it available to every one in an accessible manner and let them create.

It says, here's the thing we've built, you can use it for free but if you find a way to improve it, let us know but we'd appreciate it if you made the change yourself.

Most open source libraries have far more users than contributors. And that's a good thing. It shows the scalability of software. It means many can benefit from the work of a few.

Since starting to learn machine learning, I've used plenty of open source software but I'd never contributed back. Until now.

We've been working on a text classification problem at Max Kelsen. The model we built was good, really good. But it wasn't perfect. No model is. So we wanted to know what it didn't know.

Our search led to Bayesian methods. I don't have the language to describe them properly but they offer a solution to the problem of figuring out what your model doesn't know.

How?

In our case, we used Monte Carlo dropout to estimate model uncertainty. Monte Carlo dropout removes part of your model every time it makes a prediction. The Monte Carlo part means you end up with 100 (this number can change) different predictions on each sample all made with slightly different versions of your original model. How your 100 predictions vary, indicates how certain or uncertain your model is about a prediction. In the ideal scenario, all 100 would be the same. Where as, 100 different predictions would be considered very uncertain.

Our text classifier was based on the ULMFit architecture using the fast.ai deep learning library. This worked well but the fast.ai library didn't have Monte Carlo dropout built-in. We built it for our problem and it worked well.* Maybe others could find value from it too, so we made a pull request to the fast.ai GitHub repository.

With a few changes from the authors, the code was accepted. Now others can use the code we created.

A contribution to open source doesn't have to be adding new functionality. It could be fixing an error, adding some documentation about something or making existing code run better.

Still stuck?

Best to start with scratching your own itch. You might not have one to begin with, I didn't for 2-years. But now I've done it once, I know what's required for next time.

If you want to learn more, I made a video about the what, why and how of a pull request. And I used the one we made to the fast.ai library as the example.

*After a few more experiments, we've started to question the usefulness of the Monte Carlo dropout method. In short, our thinking is if you simulate different versions of your model enough, eventually you end up with your same model. So the pull request may not be as useful as we originally thought. You have to be skeptical of your own work. Doing so is what makes it better. Stay tuned.