Transcript of the video

Occam's razor states that when you have two competing hypotheses that make the same predictions, the simpler one is the better. This is not an unquestionable statement, but it is a useful principle in many contexts. In the context of machine learning, we can rephrase this statement as given two models that have a similar performance; it's better to choose the simpler one. Because the simpler ones are less likely to overfit and, in some cases, have some extra computational benefits. For example, a model with fewer parameters would demand less computational power.

Reducing the number of parameters is not the only way to make a model simpler, though. In fact, in deep learning, best performing models tend to be the models that are large but trained in a way that restricts the utilization of their entire potential. This strategy is called regularization.

Regularization techniques encourage models to have a preference towards simpler models without making a part of the hypothesis space completely unavailable. These techniques intend to reduce the risk of overfitting without increasing the bias significantly.

There are several ways to regularize a model. A very common regularization strategy is to add a weight decay term to the loss function. This weight decay term puts a constraint on the model complexity by penalizing large weights. For example, let's say our original loss function is the mean squared error, and we add the sum of squares of the weights to this loss function as a weight decay term. Here, alpha is a hyperparameter that determines the strength of the regularization. Setting alpha to zero basically disables the regularizer. On the other hand, a large alpha may lead to underfitting.

When we minimize this regularized loss function, our optimization algorithm tries to decrease both the original loss function and the squared magnitude of the weights, expressing a preference towards smaller weights. This type of regularization is called L2-regularization, and regression with L2-regularization is also called Ridge regression.

L2-regularization is not the only type of weight decay. Another option is to use the sum of absolute values of the weights instead of the sum of their squares. This type of weight decay is called the L1-regularization, also known as LASSO. A key property of L1-regularization is that it leads to sparser weights. In other words, it drives less important weights to zero, therefore acting like a natural feature selector.

The reason behind that is that L2 regularization penalizes smaller weights less than the larger weights since it tries to minimize the squared magnitude of the weights. So, there isn't really a big incentive for the model to drive the small weights to zero. For example, reducing a weight from a to b decreases the loss greatly where reducing a smaller weight by the same amount decreases the loss by a much smaller amount.

L1 regularization, on the other hand, pushes all the weights down equally. As a result, some weights get smashed down to zero, and only a subset of the weights survives.

Although L1-regularization has this nice sparsity property, personally, I have rarely seen it lead to a significantly better performance than L2-regularization. So, I would say L2-regularization is pretty much the go-to option unless you have a reason to use L1.

In the previous video, I mentioned that training a model for longer iterations increases its effective capacity, and we talked about how we can make use of this information to avoid overfitting. We take snapshots of the model periodically, run the snapshots over a validation set, and pick the snapshot that has the best validation set performance. Once the validation loss stops improving for a while, it's usually practical and efficient to stop training. This practice is called early stopping.

Early stopping has a regularization effect that is similar to weight decay. Given that we initialize our model with small weights, every training iteration has the potential to update a weight towards larger values. Therefore, the earlier we stop the training, the smaller the weights are likely to be.

Another simple yet effective regularization method is the dropout technique. As its name suggests, dropout randomly drops out neurons with some probability during training. Basically, the training algorithm uses a random subset of the network at every iteration. This approach encourages neurons to learn useful features on their own without relying on other neurons.

Once the model is trained, the entire network is used for inference. The outputs of the neurons are scaled to make sure that the overall magnitude of the neuron outputs doesn't change due to the changed number of active neurons during training and test.

This approach is somewhat similar to ensemble methods in machine learning. Ensemble methods train multiple models separately for the same task, then combine them to achieve better predictive performance. The difference in dropout is that the training algorithm doesn't train disjoint models. Instead, a random sub-network is selected at every step. These sub-networks share parameters as they all come from the same network, but with a different set of units masked. In a way, dropout can be considered a type of ensemble method that trains nearly as many models as the number of steps, where the models share parameters.

Regularization methods, in general, introduce additional prior knowledge in optimization processes. For example, weight decay and early stopping encode the additional information that smaller weights are preferable to larger ones. There are also other types of prior knowledge that we can inject into our optimization process to improve our models. One example would be sharing parameters between units, assuming that if one feature is useful in one part of the input signal, it's likely that it's useful in other parts too. This is a useful assumption, particularly if the inputs are images or audio signals. We will talk about parameter sharing more in detail when we cover convolutional neural networks in another video.

To prevent overfitting, we usually have two types of options: using methods that limit the effective capacity of a model or to get more training data. The assumptions we made so far focused on the first one: limiting the model capacity. We can also make some other assumptions that can help us increase the amount of data we have. For example, we might know that distorting the input in some ways shouldn't change the results. Based on this prior knowledge, we can increase the amount of training data we have by generating modified versions of the samples that we already have. This technique is called data augmentation.

How we augment the data depends on the type of the data and the nature of the task. If the task is to classify flower pictures, we can assume that flipping and rotating an image shouldn't change the labels. So we can train a model on randomly rotated and flipped versions of an input image.

However, if the task is, for example, to recognize handwritten digits, an excessive amount of rotation might not be acceptable. For example, a six might become a nine, and an eight might become infinity after rotation. We could, however, add some random noise to the input, and the labels would still be the same.

Data augmentation is particularly useful for image recognition related tasks, where we can simulate a wide variety of imaging conditions, such as scaling, translating, rotating, mirroring, brightness and contrast shift, additive noise, and many others. Other types of tasks, such as speech recognition or time series prediction, can also benefit from it to some extent. For example, random noise can be added to an audio signal for data augmentation.

Now we now know how to augment data, but you might ask where to find data in the first place? This is going to be the topic of the next video, where we will talk about where to find data, how to collect it to create our own datasets, and how to preprocess it before feeding it to a model.

That's all for today. Thanks for watching, stay tuned, and see you next time.

Transcript of the video

Further reading: