# Artificial Neural Networks: going deeper

December 23, 2017

# Transcript of the video

Welcome back! In the previous video, we talked about what artificial neural networks are and how to train a single neuron. If you haven't watched the previous video yet, find the link in the description below. In this video, we will pick up where we left off and talk about how to train deeper and more complex networks.

Quick recap: how do we train a model that has a single weight to learn? We define a loss function, which tells us how well the model is doing, take its derivative with respect to the weight, then incrementally update the weight towards the opposite sign of the derivative.

The derivative simply tells us the slope of the loss function at a given point. For example, if the weight is at this point where the slope is positive, then we need to decrement the weight. If the weight is here on the opposite side, then the slope becomes negative, so we need to move right to move towards the value that minimizes the loss.

It's pretty straightforward to learn a single weight, but what do we do when we have many parameters to learn? We do the same thing: compute the derivatives and update the weights.

Take this simplified multi-layer network, for example. We can rewrite the derivative of the error with respect to the weight as the derivative of the error with respect to y, times y with respect to the hidden node z, times z with respect to w.

This is called the chain rule of calculus and computing the derivatives this way is called backpropagation, or backpropagating the error since the error is calculated at the output layer and propagated back through the earlier hidden layers.

We do this for all the weights in the network. We first randomly initialize them, then compute the outputs given the data samples, which is called forward propagation. Then compute the error and the partial derivatives with respect to all these weights. The partial derivatives are collectively called the gradient. Each one of these partial derivatives measures how the loss function would change if we were to change a single variable. Once we have the derivatives, we update the weights, just like we did in the previous video.

There are several tricks to improve both the stability and efficiency of this process. For example, the derivatives can be computed more efficiently by storing previously computed derivatives in a table and reusing them to compute others. In dynamic programming, this technique is called memoization, not to be confused with memorization which usually refers to overfitting that we will cover later.

Good news! Unless you want to develop a deep learning framework yourself, you don't have to worry too much about how to compute these derivatives because many modern deep learning frameworks take care of it for us.

In the previous video, I briefly talked about activation functions and mentioned that sigmoid activation is not ideal for deep learning. Now, let's see why.

First, let's take a look at how the sigmoid function looks like. It looks like an S curve that saturates at large values of its input.

Now let's see what happens to the gradient when we use sigmoid activations.

This is how we computed the derivative of the error with respect to the weight using the chain rule. Since these hidden units use sigmoid activations, their derivatives involve the derivative of the sigmoid function, which looks like this. As you can see, it quickly saturates to very small numbers when their input moves away from zero.

As we backpropagate the error by chaining these expressions, the magnitude of the gradient rapidly diminishes. The more sigmoids we encounter along the backpropagation path, the smaller the gradient gets. In deeper models, the derivatives get so small towards the earlier layers that it becomes virtually impossible to update the early layer parameters.

Moreover, since the latter layers depend on the earlier layers, they don't learn anything useful either, although they get larger updates. This problem is called the 'vanishing gradient problem.'

Other saturating activation functions such as the hyperbolic tangent function, which is simply a scaled and shifted version of the sigmoid function, also suffers from this problem. But it usually performs better since its output is zero-centered.

So how do we fix this vanishing gradient problem, or at least alleviate it?

A simple solution is to use a rectified linear unit activation. A rectified linear unit (or ReLU for short) is a clipped version of the linear activation, meaning that the output of the function is the same as input if the input is positive and zero otherwise.

How does the derivative of this function look like? It's a simple step function, which doesn't vanish unlike saturating activation functions.

ReLUs are easy to optimize, but they have a problem too: they sometimes die.

For example, if a neuron learns a large negative bias, the input of the ReLU function might always be negative. Therefore the neuron might never fire again.

Since the gradient flowing through the neuron will always be zero, its parameters will not get any updates either.

There are some variants of the rectified linear function that aim to solve this problem. For example, Leaky ReLUs use a small negative slope instead of zero on the left-hand side so that a neuron always gets an update and can eventually recover from death. Parametric ReLUs take this idea one step further and learn the value of this slope during training. Another alternative is an exponential linear unit, which is linear on the right-hand side and exponential on the left-hand side.

Dying neurons is usually not a huge problem, and many modern neural network architectures still use the plain ReLU function. So, if you are not sure what activation function to choose, I would say go for ReLU first.

Last week we talked a little bit about loss functions and used the mean squared error as our loss function in the examples.

Clearly, the mean squared error is not our only option when it comes to loss functions, perhaps not even the most popular one.

We can design our custom loss function depending on what we really want to minimize. So we need to express our goal in terms of a function that we can minimize through tuning the model parameters.

Let's take a look at what type of loss function we generally use for classification problems.

Let's say we want to recognize handwritten digits, and our input is an 8x8 binary matrix. We feed the values in this matrix to a neural network that consists of several layers of neurons.

What do you think the output of this neural network should be? Maybe a single neuron that gives us the value of the predicted digit?

We could train this network using the mean squared error between the predicted and the actual values of the digits, right? Well, not quite. I mean, we could, but that wouldn't be ideal.

Although the labels are digits, they aren't really numerical variables. Take this handwritten digit, for example. It looks like a six but could be a zero as well. If we are not sure if it's zero or six, should we call it a three? It's certainly not three.

The better way to do this is to have a separate output neuron for each class and train the model using a cross-entropy function.

For simplicity, we can think of the cross-entropy as a measure of similarity between the actual and the predicted probability distributions of the classes. It's not truly a similarity metric, though, because it's not symmetric. The cross-entropy between p and q is not the same as the cross-entropy between q and p.

So how do we get these probability distributions? For the ground truth labels, it's quite simple. We assign probability one to the true class label and zero to all others. This representation is also called one-hot encoding. These labels don't have to be binary; in some cases, we might have a softer probability distribution for each sample too.

To get the predicted class probabilities, we pass the output variables (which are called logits) through a softmax function, which squashes the outputs to the range [0, 1] that sum to 1.

Once we have the actual and the predicted probability distributions, we can easily compute the cross-entropy function and train the model to minimize it.

Since softmax outputs always sum up to 1, it creates a competition between its inputs. When a neuron for one class gets a large value, it pushes all other class probabilities down. This is a useful property when our class labels are mutually exclusive. But in some cases, each sample can have more than one label. For example, if we are detecting objects in pictures, one picture can contain both a laptop and a smartphone.

In such cases, we can pass the logits through separate sigmoid functions instead of a single softmax so that the probabilities do not necessarily sum up to 1.

Then given the actual and predicted probability distributions, we can compute the cross-entropy just like we did before. Earlier I mentioned that cross-entropy is not symmetric. Then you might wonder how we decide whether p or q should be the predicted probabilities, and the other one is the actual probabilities. Think of it this way. From a theoretical perspective, we can think of taking the logarithm of the predicted probabilities as reverting the exponential introduced by the sigmoid or the softmax functions. From a practical perspective, since the target variables are usually binary, taking the logarithm of 0 and 1 doesn't really help.

One last thing. Theoretically, a feedforward network with a single hidden layer with a nonlinear activation function can approximate any function with an arbitrary amount of desired error. This is called the universal approximation theorem. Then why do we even need deeper models? First, the number of hidden units we need to represent a complex function might be infeasibly large. More importantly, being able to represent a function with a model doesn't mean that our model can easily learn to represent that function. Deep architectures allow for learning a hierarchy of features resulting in a fewer number of units and better generalization in many cases.

So, what do we mean by generalization? That's what we are going to talk about next. In the next video, we will cover some of the key concepts in machine learning, such as the model capacity, overfitting, and underfitting. We will also go through a basic recipe for machine learning that can be applied to many kinds of AI problems. That's all for today. As always, thanks for watching, stay tuned, and see you next time.