Transcript of the video

Hello, and welcome! Let's talk about Convolutional Neural Networks, which are specialized kinds of neural networks that have been very successful particularly at computer vision tasks, such as recognizing objects, scenes, and faces, among many other applications.

First, let's take a step back and talk about what convolution is. Convolution is a mathematical operation that combines two signals and is usually denoted with an asterisk. Let's say we have a time series signal a, and we want to convolve it with an array of 3 elements. What we do is simple, we multiply the arrays elementwise, sum the products, and shift the second array. Then, we do the same thing again for the other elements by moving the second array over the first one like a sliding window.

Technically, what we do here is cross-correlation rather than convolution. Mathematically speaking, the second signal needs to be flipped in order for this operation to be considered a convolution. But in the context of neural networks, the terms convolution and cross-correlation are used pretty much interchangeably. That's a little off-topic, but you might ask, 'why would anyone want to flip one of the inputs.' One reason is that doing so makes the convolution operation commutative. When you flip the second signal, a * b becomes equal to b * a. This property isn't really useful in neural networks, so there is no need to flip any of the inputs.

In digital signal processing, this operation is also called filtering a signal a with a kernel b, which is also called a filter. As you may have noticed, this particular kernel computes the local averages by averaging the values within a window. If we plot this signal and the result when it's convolved with this averaging filter, we can see that the result is basically a smoothed version of the input.

We can easily extend this operation to two dimensions. Let's convolve this 8x8 image with this 3x3 filter, for example. Just like the previous example, we overlay the kernel on the image, multiply the elements, sum the products, and move to the next tile. This specific kernel is actually an edge detector that detects the edges in one direction. It has a weak response over the smooth areas in an image, and a strong response to the edges. If we apply the same kernel to a larger grayscale image like this one, the output image will look like this where the vertical edges are highlighted. If we transpose the kernel, then it detects the horizontal edges.

The filter in the previous example smoothed its input, whereas in this example, the filter does the opposite and makes the local changes, such as the edges, more pronounced. The idea is that kernels can be used to extract some certain features from input signals.

The input signal doesn't have to be a grayscale image. It can be an RGB color image, for example, and we can learn 3-dimensional filters to extract features from these inputs. The inputs don't even have to be images. They can be any type of data that has a grid-like structure, such as audio signals, video, and even electroencephalogram signals. Both the inputs and the filters can be n-dimensional.

There's a lot that can be said about convolutions and filter design. But since the focus of this video is not digital signal processing, I think this is enough background to understand what happens inside a convolutional neural network.

In the earlier examples, we convolved the input signals with kernels having hardcoded parameters. What if we could learn these parameters from data and let the model discover what kind of feature extractors would be useful to accomplish a task? Let's talk about that now.

Let's say we have an 8x8 input image. In a traditional neural network, each one of the hidden units would be connected to all pixels in the input. Now imagine if this was a 300x300 RGB image. Then we would have 270,000 weights for just a single neuron. Now, that's a lot of connections. If we built a model that had many fully connected units at every layer like this, the model would be big, slow, and prone to overfitting.

One thing we can do here is to connect each neuron to only a local region of the input volume. Next, we can make an assumption that if one feature is useful in one part of the input, it's likely that it would be useful in the other parts too. Therefore, we can share the same weights across the input.

Looks familiar? Yes, what this unit does here is basically convolution.

A layer that consists of convolutional units like these is called a convolutional layer. Convolutional networks, also called ConvNets and CNNs, are simply neural networks that use convolutional layers rather than using only fully connected layers.

The parameters learned by each unit in a convolutional layer can be thought of as a filter. The outputs of these units are simply the filtered versions of their inputs. Passing these outputs through an activation function, such as a ReLU, gives us the activations at these units, each one of which responds to one kind of feature.

As compared to traditional fully-connected layers, convolutional layers have fewer parameters, where the same parameters are used in more than one place. This makes the model more efficient, both statistically and in computational terms.

Although convolutional layers are visualized as running sliding windows over the inputs and multiplying the elements, they aren't usually implemented that way. As compared to for loops, matrix multiplications are faster and scale better. So instead of sliding a window using for loops, many libraries implement convolution as a matrix multiplication.

Let's assume that we have an RGB image as input and have four 3x3x3 kernels. We can reshape these kernels into 1x27 arrays each. Together, they would make a 4x27 matrix, where each row represents a single kernel. Similarly, we can divide the input into image blocks that are the same size as the kernels and rearrange these blocks into columns. This would produce a 27xN matrix, where N is the number of blocks. By multiplying the matrices, we can compute all these convolutions at once. Each row in this resultant matrix would give us the filter outputs when reshaped back to input dimensions.

Another type of layer that is commonly used in convolutional neural nets is the pooling layer. A pooling layer downsamples its input by locally summarizing them. Max pooling, for example, subsamples its input by picking the maximum value within a neighborhood. Alternatively, the average pooling takes the average.

In many cases, we care about if some features exist in the input regardless of their exact position. Pooling layers make this easier by making the outputs invariant to small translations in the input. Because even if the input is off by a few pixels, the local maxima would still make it to the next layers. Another obvious advantage of pooling is that it reduces the size of the activations that are fed to the next layer, which reduces the memory footprint and improves the overall computational efficiency.

A typical convolutional neural network usually stacks convolutional and pooling layers on top of each other and sometimes uses traditional, fully connected layers at the end of the network.

An interesting property of convolutional neural networks is that they learn to extract features. Early convolutional layers, for example, learn primitive features such as oriented edges. After training a model, the filters in the first layer usually look like Gabor-like filters, edge detectors, and color-contrast sensitive filters.

As we move towards the output layer, the features become more complex, and neurons start to respond to more abstract, more specific concepts. We can observe neurons that respond to cat faces, human faces, printed text, and so on.

The dots you see in the activations of this convolutional layer can be a result of neurons that respond to cats, pets, or animals in general. One of them, for example, can be a neuron that activates only if there is a cat in the input picture. The following layers make use of this information to produce an output, such as a class label with some probability.

An interesting thing is, the concepts that are learned by the intermediate layers don't have to be a part of our target classes. For example, a scene classifier can learn a neuron that responds to printed text even if that's not one of the target scene types. The model can learn such units if they help detect books and classify a scene as a library.

This is somewhat similar to how visual information is processed in the primary visual cortex in the brain, which consists of many simple and complex cells. The simple cells respond primarily to oriented edges and bars of particular orientations, similar to early convolutional layers.

The complex cells receive inputs from simple cells and respond to similar features but have a higher degree of spatial invariance, somewhat like the convolutional layers after the pooling layers. As the signal moves deeper into the brain, it's postulated that it might reach specialized neurons that fire selectively to specific concepts such as faces and hands.

An advantage of using pooling layers in our network is that it increases the receptive field of the subsequent units helping them see a bigger picture. The term receptive field comes from neuroscience and refers to a particular region that can affect the response of a neuron. Similarly, the receptive field of an artificial neuron refers to the spatial extent of its connectivity. For example, the convolutional unit in the earlier example had a receptive field of 3x3. Units in the deeper layers have a greater receptive field since they indirectly have access to a larger portion of the input. Let's have another example, and for simplicity, let's assume both the input and the filter is one dimensional. This unit has access to three pixels at a time. If we add a pooling layer followed by another convolutional layer on top of that, a single unit at the end of the network gains access to all 8 pixels in the input.

Of course, pooling is not the only factor that increases the receptive field. The size of the kernel obviously has an impact. A larger kernel would mean that a neuron sees a larger portion of its input.

A larger receptive field can also be achieved by stacking convolutions. In fact, it is usually preferable to use smaller kernels stacked one on another as compared to using a larger kernel, since doing so usually reduces the number of parameters and increases non-linearity when a non-linear activation function is used at the output of each unit. For example, a stack of two 3x3 convolutions would have the same receptive field as a single 5x5 convolution, while having fewer mathematical operations and more non-linearities.

One thing to pay attention when stacking convolutional layers is how the size of the input volume changes before and after a layer. Without any padding, the spatial dimensions of the input shrink by one pixel less than the kernel dimensions. For example, if we have an 8x8 input and a 3x3 kernel, the output of the convolution would be 6x6. Many frameworks call this type of convolution a 'valid' convolution or a convolution with valid padding. Valid convolution might cause some problems. Especially if we use larger kernels or stack many layers on top of each other, the amount of information that gets thrown out might be critical.

There is an easy hack that helps improve the performance by keeping information at the borders. What it does is to pad the input with zeros so that the spatial dimensions of the input is preserved after the convolutions. This type of zero padding is called 'SAME' padding by many frameworks. Zero padding commonly used and works fine in practice, although it's not ideal from a digital signal processing perspective since it creates artificial discontinuities at the borders.

Another hyperparameter that has an impact on the receptive field is the stride of the sliding window. So far, we used a stride of one in the examples. This is usually the default behavior of a convolutional layer. If we set it to two, for example, the sliding window moves by two pixels instead of one, leading to a larger receptive field. Using a stride larger than one has a downsampling effect that is similar to pooling layers, and some models use it as an alternative to pooling.

One thing that is sometimes confused with stride is the dilation rate. A dilated convolution, also known as atrous convolution or à trous convolution, uses filters with holes. Just like pooling and strided convolutions, dilated convolutions also learn multi-scale features. But instead of downsampling the activations, dilated convolutions expand the filters without increasing the number of parameters. This type of convolutions can be useful if a task requires the spatial resolution to be preserved. For example, if we are doing pixel-wise image segmentation, pooling layers may lead to a loss in detail. Using dilated convolutions preserves spatial resolution while increasing the receptive field. However, this approach demands more memory and comes at a computational cost since the activations need to be kept in memory at full resolution.

In this video, we talked about the building blocks of convolutional neural networks. We also covered what some of the hyperparameters in convolutional networks are and what they do.

In the next video, we will talk about how to choose these hyperparameters and how to design our own convolutional neural network. We will also cover some of the architectures that have been widely successful at a variety of tasks and went mainstream.

Ok, that's all for today. It's already been a litter longer than usual. As always, thanks for watching, stay tuned, and see you next time.

Transcript of the video

Further reading: