Practical Methodology in Deep Learning

April 21, 2018

Transcript of the video

It's certainly useful to know the fundamental ideas and the math behind deep learning, but in many cases, you rarely need a lot of math to build a machine learning model. In this video, we are going to focus on practical information and go through a basic recipe for machine learning, which can be used to tackle many types of machine learning problems.

Earlier, we summarized machine learning as using data to optimize a model towards improving its future performance on a given task. Let's go through each one of these components to build a practical scheme: the task, data, model, and performance.

The first step is to understand what the task is and to formulate the problem. Do we need to classify data, generate data, predict some variables, or transform from one type of data to another? Let's say we want to build a mobile app that classifies the types of dog breeds using a smartphone camera.

Once we know what we want to do, we need to find data. We have several options here. In some cases, the data is already publicly available, where other times, we might need to collect our own data. You can check out my earlier video on data collection to learn more.

Let's say we have a set of labeled images for dog breeds. Next, we need to build a model that inputs pictures of dogs and outputs the labels for the breeds. As you may remember from my earlier video on designing deep models, a good strategy here would be to pick something that worked for a similar problem. Even if there isn't a very similar problem that has been studied earlier, you can pick some model that has shown to be successful in the same category of problems. For our dog classifier, for example, a generic image classification model would be a sensible choice. It's crucial to come up with a baseline model as soon as possible. Once you have a baseline, you can always improve on that iteratively.

We also need to define what improvement means and how we measure the performance. For classification tasks, it is common to use accuracy as the performance metric. If the model allows for adjusting the trade-off between true-positive and false-positive rates, the area under an ROC curve is also commonly used.

This is fine for some tasks, but not for others. For example, if the classes in a dataset are not balanced, using accuracy only might be misleading. If we have two classes, where one of them is very rare as compared to the other, then the accuracy might be very high, although the model completely ignores the rare class. In such a case, it would make more sense to use precision and recall to quantify the performance. These metrics are also commonly used to evaluate detection and retrieval models.

Precision tells us how accurate the detections are, and recall tells us what proportion of the true samples are detected. Since it's convenient to have a single number that summarizes the performance, it's common to take the harmonic mean of the precision and recall, which is called the F-score, as an overall performance metric. It's also common to use the Mean Average Precision as an overall performance metric for a retrieval system.

Many other metrics are possible. Jaccard index, for example, looks at the ratio of the intersection over union between two sample sets. This is a common type of performance metric for object detection and localization tasks.

Let's go back to our dog classifier example. Let's say our dataset is fairly balanced, and we pick accuracy as the performance metric. We trained the model and found the accuracy on a test set as 96%. We will need a baseline to be able to tell whether this is a good number or not.

The easiest to beat baseline should be a random model. If we have ten balanced classes, for example, the accuracy of a random model would be 10%. If our model is not doing any better than that, then it's not any better than a coin flipper. If our model aims to improve on an existing model, then that model would be our baseline.

Another concern we might have is the computational cost. One model might be marginally more accurate than another, but does it run as fast? How about the model size and memory consumption? Especially, if you plan to run a model on a low-power device, a lightweight model might be a better choice even if it's a little less accurate than its heavier counterpart.

Let's talk about how to identify what might be wrong with a model. We have already talked about overfitting, underfitting, and regularization. It's a sign of overfitting when the training error is low, but the validation error is high. In such cases, we might want to increase the regularization strength or reduce the number of trainable parameters. If the training and validation errors are both high, then we might want to the opposite and increase model capacity. If that doesn't work, the problem might also be an implementation error.

What if the model has very high accuracy on a test set, but our app doesn't seem to work well in practice? Let's say we decided to deploy our dog classifier on a smartphone. It had 96% accuracy on our test set, but it doesn't seem to work well on pictures of dogs captured with our phone.

One thing we can do is to make sure that the inputs for both models are preprocessed the same way. For example, the images that the model was trained on and the ones captured by the phone might be in a different color space.

To test model consistency, we can capture a picture with the device, and run inference on it on the platform where we ran the tests earlier. If the results are different, then there might be a bug in the implementation.

We should also make sure that the test set is representative of what we expect our model to see in the future. If the test set is vastly different from what the model actually gets, then the test set accuracy might not tell a lot about the real-life performance of the model.

The problem might also be the dataset. Unless you are using a well-known benchmark dataset, it's not uncommon to have bugs in datasets. For example, some labels in the dataset might be wrong. Besides noisy labels, some classes might have the wrong label as a whole. Class IDs and names might not be in the correct order. There might also be duplicate or near-duplicate classes having different names. Having a confusion matrix would help identify some of these.

Another technique that helps identify the problems in a model is visualization: visualizing learned filters, activations, and samples themselves. Hidden units in a neural network don't have to be completely hidden. Visualizing the parameters and activations of hidden units might give us an idea of what the model is learning. Visualizing misclassified samples can also help us understand what types of inputs cause problems. Sorting the results by the level of confidence that the model has might reveal some patterns in what the model considers a difficult example.

That brings us to the end of the Deep Learning Crash Course series. I hope you enjoyed it. Please subscribe if you would like to see more videos like these. As always, thanks for watching, stay tuned, and see you next time.

Further reading: