Transcript of the video

The first time I heard about neural networks, I was 11 or something. I saw an article on a popular tech magazine that said scanners now use neural networks to recognize characters. Naively, I thought they utilized actual biological neurons. When I told my mom about it, she said if they use anything biological, then you need to feed it. Does the scanner consume sugar or something? I mean, she was right. But we don't feed our scanners sugar, do we? Many years later, I figured out that what they used was nothing but a mathematical model.

So what's so neural about them? These models are called neural networks because they are loosely inspired by biological neural networks. Artificial neural networks consist of artificial neurons, each one of which resembles a biological neuron in the sense that it receives signals from other neurons, accumulates the signal where each input has a different weight, and fires if the signal is strong enough. Although artificial neural networks draw some inspiration from biological models, modern computer science research in neural networks focuses more on building useful models rather than understanding the brain and modeling it accurately. Understanding how the brain works is a very interesting field of research too, but we will not focus on that in this series.

To model a neural network, let's start with a single neuron. An artificial neuron takes the inputs x0 through xn, multiplies them with weights w0 through wn, and sums the products to produce the output y. We can express this operation as a simple matrix multiplication. Assume that the weights are stored in a row matrix w transpose, and the inputs are stored in a column matrix x. The output of the neuron is simply the multiplication of these two. Linear algebra recap: to multiply these matrices, we multiply x0 with w0, x1 with w1, and x2 with w2, and we sum these products.

So why do we do this? What we are trying to accomplish here is to approximate a function: a linear function in this example. Given a set of (x, y) pairs, our goal is to find such weights w0, w1, and w3 that fits the data we have the best. For example, we can represent the function y=2x using a single neuron with a single input with linear activation, meaning that the output is merely a product of the input and the weight. However, the functions that we want to approximate might not be as simple as y=2x. We can learn more complex functions by using a network of these neurons, where the outputs of a set of neurons are fed into another set of neurons as inputs.

Let's take a look at a common type of neural network: a multi-layer perceptron, which consists of layers of neurons. These types of neural networks are called feedforward networks because the data flows in one direction from the input layer to the outputs. The first layer is the input layer, where each neuron is connected to an input variable. The last layer is the output layer, which has as many neurons as the output variables. For example, if this is a regression problem where we try to predict the current value of a car, x0, x1, and x2 might be the year, mileage, and the price of the car when it was new, where y0 is the predicted current value. Or if this is a classification problem where we want to classify indoor and outdoor pictures, x0 through xn can be the pixel values, and y0 and y1 can be the indoor and outdoor neurons.

The layers we have between the inputs and output layers are called hidden layers. These layers learn to produce outputs that are useful for the next layers. They are called hidden layers because we don't explicitly specify what happens at these layers. The learning algorithm decides how to use these layers to approximate a function. Each hidden layer tries to make the input more useful for the next layer. The number of these layers gives the depth of the model --that's where the term deep learning comes from-- whereas the number of neurons per layer gives the width of the model. Both increasing the width and depth increases model complexity, which allows for learning more complex patterns. Or does it? Each one of these neurons is basically taking a weighted sum of their inputs. Isn't a weighted sum of linear functions also a linear function?

Let's simplify this network by taking a slice from this network and see what happens. The input gets multiplied by w0, w1, w2, and w3, and we get our output y0. We can rewrite this equation as y = x0 wc, where wc is the multiplication of all these weights. Then we could represent this function using a single neuron. Basically, if we don't introduce some sort of nonlinearities between these neurons, the entire network collapses into a single linear function.

That's why we use non-linear activation functions at the outputs of neurons, meaning that we pass the output of a neuron through a non-linear function before we feed it to the next one. Doing so introduces nonlinearities in our network. This non-linear function can be the sigmoid function, which squashes its input into a range between 0 and 1. The sigmoid function is usually not ideal for deep models but we'll come back to that later. Let's rewrite the function that our model represents, now with the nonlinearities in between neurons... As you can see, it no longer reduces to a single-layer model. This enables our model to represent non-linear functions.

Let's run a simulation on TensorFlow playground to observe the impact of the non-linear activations. First, let's try to classify linearly separable data without using nonlinearities. The model learns a decision boundary without any trouble. How about classifying these data points that lie on a swiss-roll shaped manifold. It seems like the decision boundary is still linear despite having several hidden layers. Now let's try again using a non-linear activation function this time. It's learning non-linear decision boundaries now, but what do we actually mean by learning?

Let's go back to the previous example. The weights, w0, w1, and w2, are the trainable parameters. These parameters are learned from training data. The values of these parameters are what we keep to deploy our model. Let's talk about how we train a model to learn these weights.

Training a neural network is essentially an optimization problem, where our goal is to minimize a loss function. The loss function tells our model how well it's doing on training data so that the weights can be updated towards decreasing the loss and increasing future performance as a consequence. What we mean by training is simply finding the weights that minimize our loss function. For example, if our goal is to predict a continuous-valued variable, we can use the mean squared error or the mean absolute error as our loss function. The mean squared error is simply the mean of the squared differences between the predicted and actual values of output variables. And the mean absolute error is the mean of absolute differences between these actual and predicted values. The loss function can be anything as long as it's differentiable, and we'll see why soon. We'll go back to loss functions later.

Let's have a very simple example to understand how we train neural networks first. Let's say we have these data points, and we want to learn a function that generates similar data points. Let's use a single neuron with a single weight and use mean squared error as our loss function. For simplicity, let's omit the bias term and the non-linear activation function.

First step: we initialize our weights to small random values. In this example, we have a single weight w, which is "randomly" initialized to .5. Then we evaluate the output given a data sample. The data is usually shuffled before training, so let's pick x = 2 as our first training sample. Plugging in x, we get y = 1, which is not so close to the actual value of y, which was 4. So, how do we fix this? How do we tell the model to update the weight towards the right direction?

We picked the mean squared error as our loss function. Since we are evaluating the samples one by one, it's simply the squared difference between the actual and predicted values of y. We take the derivative of the error with respect to the weight. Then, we use the derivative to define an update rule, which tells us how to change the weight to make the predictions better.

Here, alpha is the learning rate, which specifies the magnitude of the update at every iteration. It's a common practice to decay the learning rate gradually, which is actually analogous to how humans learn. Kids learn faster, but adults are less gullible since they are exposed to more training data. For simplicity, let's fix the learning rate at 0.1 in this example.

Now that we have our update rule, let's iterate over the data. We get a data sample, and we update the weight by evaluating the update rule. We do this until the loss converges to an acceptable point, which is the global minimum 0 in this example. This optimization algorithm is called Stochastic Gradient Descent, there are some tricks to improve this optimization process, but this is how it works in its plain form.

In this example, we iterated over the samples one-by-one. This is called online learning. Training a model this way can sometimes lead to noisy weight updates and slow down the convergence. Alternatively, we could use all data points at once and average the loss over all data points at every iteration. That's called batch learning, and the iterative optimization algorithm that we used earlier is called gradient descent when we use the entire dataset for each update. However, in many modern applications, this is not a feasible approach since the dataset is usually too big to fit into memory. Even if the entire dataset fits the memory, it might still be preferable not to use the entire data at every step. In the previous example, our loss function was a nice and smooth convex function. Here's how the overall loss roughly looks like when plotted as a function of the weight. This would be an ideal case for the full-batch gradient descent. However, this is hardly the case for real-life applications. In practice, the w can be much higher dimensional, and the loss manifold is unlikely to be perfectly convex. Many applications adopt an approach between these two: pick a mini-batch that consists of a number of samples and average the loss over the samples in the mini-batch at every iteration. The number of samples in a mini-batch is called the batch size.

Now we know what training a neural network means and how to train a single neuron with a single trainable parameter. But how do we train networks with many more layers and many more trainable parameters? It's not as complicated as one might think it is. In the next video, we will talk about how is training deep neural networks different from training shallow ones and intuitively explain how we train these models.

That's all for today. Thanks for watching, stay tuned and see you next time.

Artificial Neural Networks: demystified

Transcript of the video

Further reading: