Overfitting, Underfitting, and Model Capacity

January 14, 2018

Transcript of the video

Can a machine learning model predict a lottery? Given the lottery is fair and truly random, the answer must be no, right? What if I told you that it is indeed possible to fit a model to historical lottery data. Sounds awesome! Then why don't we go ahead and train such a model to predict the lottery?

Can't we put together a dataset that consists of historical lottery data, train a model that fits that dataset, use that model to predict the next lucky numbers, and get rich?

The problem is that a model being able to fit seen data doesn't mean that it will perform well also on unseen data. If a neural network has a large capacity, in other words, has a large number of trainable parameters and is trained long enough, then it can memorize the input-output pairs in the training set.

For example, in the lottery case, the input variable might be a timestamp, such as the date of the lottery, and the output variable might be the lucky number for that day's lottery. If we train a high-capacity model on the historical data where the inputs are the dates and the outputs are the winning numbers, the model can basically learn a lookup table that maps the inputs to the outputs, without having any predictive power. The model would give the correct outputs for the inputs that it has seen before, but it would not be able to make accurate predictions once it gets an input that it hasn't seen before, such as a future date.

This disparity between the performance on the seen and previously unseen data is called the generalization gap. It's common in machine learning problems to observe a gap between training and test performance.

There's usually a link between capacity and error. Increasing model capacity helps reduce training error, but it also increases the risk of overfitting, leading to a larger generalization gap.

Let's take a look at how model capacity affects the way a model fits the same set of data. First, let's fit a linear function to the data points. The problem here is that the error on training samples is quite high. The model has low variance but high bias. This is called underfitting.

To get a closer fit, we can increase the model capacity and try to fit a polynomial of a higher order. Now, the function fits the training samples almost perfectly. It has very low bias, but it has a high variance and is unlikely to generalize well. For example, if we get a previously unseen input x0 the output of this function would be y0, which is not likely to be a good prediction. This is called overfitting when a model performs well on training data but poorly on unseen data.

A better fit for this data would look something like this function, which has capacity somewhere between these two. It has low bias, and low variance, therefore, fits the training data closely while generalizing seemingly well to unseen points.

You might wonder how we know if a model will perform on unseen data if it's, well, unseen? We can't test a model on data points that are truly unseen, but we can hide a portion of our data during training and evaluate the model on that hidden part.

Typically, we divide our data into three subsets: training, validation, and test sets. We train our model on the training set and evaluate it on the validation set. We use the validation set to configure our model, which involves choosing the hyperparameters such as the width and the height of the network, how neurons are connected to each other, and how long the model is trained, etc.

Let's take a look at an example, where we look at the error on the validation set to decide on the model capacity. There are several factors that affect the model capacity, such as the number of trainable parameters and training iterations. In this example, let's focus on the latter one and use the validation set to choose when to stop training. We can consider the number of iterations that we train a model, just a hyperparameter that affects the model capacity. We will talk about why training a model longer increases effective model capacity later, but in short, the longer we train a model, the larger parameter space it gets to explore.

As we train our model, we take snapshots of the model and run a validation routine that measures the error on the validation set periodically. We use the validation set to measure how well a model generalizes to unseen data. Finally, we choose the snapshot that minimizes the error on the validation set.

So far, we haven't used the test set in this example. You might wonder why a validation set alone is not enough, and we need a separate test set. Well, we don't strictly need a test set. It might be ok in some cases to have training and validation sets only. The purpose of the test set is to help us get an unbiased estimate of the generalization performance.

Especially, when we have a lot of hyperparameters to tune, there might be a risk of overfitting to the validation set. Although the model never sees the validation set, we do, and we tune the knobs to reconfigure our model accordingly. When we have a lot of knobs to tune, the model might end up being overly tuned to perform well on the validation set, yet do not generalize well to truly unseen data. That's why it might be beneficial to have a separate test set to use once we are done with configuring and training our model.

Let's talk about how we choose our training, validation, and test sets. Unless we are already provided with these sets, we obtain these sets by partitioning all data we have into these sets.

How we partition the data depends on how much data we have. It is a common practice to use more or less 80% of the data for training and the rest for validation and test for small and mid-sized datasets. For example, if we have 10,000 samples, we can reserve 1000 samples for validation and test sets each and use the rest for training.

For large datasets, such as the ones that have millions of samples, it might be a waste to use 20% for validation and test sets. If you have millions of samples even 1% of the data might be enough for the validation and test sets as long as the samples are partitioned in an unbiased way. For an unbiased split, randomly shuffling the data before partitioning is usually good enough. If the distribution of the labels in the dataset is heavily imbalanced, you might want to do stratified sampling to make sure that you have representative samples of all labels in all sets. Another caveat is that some datasets might have duplicate samples. So, we need to make sure that the partitions are disjoint, and we don't use the same samples for both training and evaluation.

Although the validation and test sets usually consist of samples that come from the same distribution as the training set, this is not a strict requirement. It's not uncommon to train a model on samples from one source and test it on another source, which usually consists of more challenging samples. It might also be practical in some cases to train on loosely labeled, large-volume data, such as images crawled from the web and test on a smaller dataset with better quality labels, such as images manually annotated by humans.

One thing to keep in mind is to choose validation and test sets to reflect the type of data you plan to run your model on or expect your model to receive in the future.

Now we know, what training and validation sets are and how they are used to recognize overfitting and underfitting. To recap, here's a cheat sheet to interpret the error, or loss in general, on the training and validation sets.

If the loss is high, both on the training and validation sets, it's a sign of underfitting. Since the model does not even fit the training set well, we might want to increase model capacity by using a larger model or training the model for longer.

If the training loss is low, but the validation loss is high, it's a sign of overfitting. Our model might be memorizing the training samples, yet learning nothing meaningful. To curb overfitting, we can try shrinking our model size or use regularization techniques that we will cover in the next video.

If the loss is low on both sets, then we might have achieved our goal. We can use the test set to get an unbiased estimate of the generalization performance. There are many different performance metrics that we can use, such as accuracy, precision, and recall. We will cover some of these metrics in another video.

It's unlikely to have a lower loss on the validation set than the training set. If your training loss is significantly higher than the validation loss, check your training and validation sets, and make sure you are training your model on the right dataset.

In the earlier example, we talked about choosing an appropriate model capacity by deciding when to stop training using the validation set. There are many other ways to keep the model complexity in control to prevent overfitting. In the next video, we will focus on strategies that encourage simpler models while reducing the generalization error, known as regularization techniques. That's all for today. As always, thanks for watching, stay tuned, and see you next time.

Further reading: