How to Design a Convolutional Neural Network

February 20, 2018

Transcript of the video

One of the questions that I get frequently is 'how do you design a neural network' or more specifically 'how do you know how many layers you need to have' or 'how do you know what's the right value for a particular hyperparameter'.

First things first, if you are not familiar with convolutional neural networks, you can find the link to my introductory video in the description below.

When people say the best thing about deep learning is that it requires no hand-designed feature extractors, everything is learned from data, and there's almost no human intervention, that's not entirely true. Indeed, the features are learned from data, and that's great. A hierarchy of learned features can lead to a great representational power. But there's still a lot of human intervention in the model design although there have been some efforts to automate the model selection process. I think eventually we will get to a point where no human intervention is required but it seems like humans will still be in the loop for a while.

You might ask why not just do a grid search on all hyperparameters and automatically pick the best configuration? Wouldn't that be more systematic? Well, a complete grid search is usually not a feasible option since there are too many hyperparameters. Furthermore, model selection in deep models is not just about choosing the number of layers and hidden units and a few other hyperparameters. Designing the architecture of a model also involves choosing the types of layers and the way they are arranged and connected to each other. So there are infinitely many ways one can design a network.

Designing a good model usually involves a lot of trial and error. It is still more of an art than science, and people have their own ways of designing models. So the tricks and design patterns that I will be presenting in this video will be mostly based on 'folk wisdom', my personal experience with designing models, and ideas that come from successful model architectures.

Back to our question "how do you design a neural network?" The short answer is: you don't. The easiest thing you can do is pick something that has been proven to work for a similar problem and train it for your task. You don't even have to train it from scratch. You can take a model that has already been trained on some large dataset and fine tune the weights to adapt it to your problem. This is called transfer learning and we will come back to that later.

This approach works in many practical cases, but not applicable in all cases especially if you are working on a novel problem or doing bleeding edge research. Even if you are working on a novel problem or the existing models don't meet your needs, that doesn't mean that you need to reinvent the wheel. You can always borrow ideas from successful models to design your own model. We will discuss some of these ideas in this video. Let's go through frequently asked questions about designing a convolutional neural network.

First question: how do you choose the number of layers and number of units per layer? My experience is that beginning with a very small model and gradually increasing the model size usually works well. And by increasing the model size, I mean adding layers and increasing the number of units per layer.

You could also go the other way around and start with a big model and keep shrinking it. The problem with that is it's hard to decide how big you should start. If you want to start small you always have a point zero, which is the linear regression. That doesn't mean that you should always try linear regression first even if it's obvious that there is no linear mapping between the inputs and the outputs and the problem is not linear. But, overall it usually has more benefits to start smaller and increase the model capacity until the validation error stops improving. Earlier, I made a separate video about how to choose model capacity. You can find it in the Deep Learning Crash Course playlist to learn more about it.

You might wonder given the same number of trainable parameters whether it's better to have more layers or more units per layer. It's usually better to go deeper than wider, so I would opt in for a deeper model. However, a very tall and skinny network can be hard to optimize.

One way to make training deep models easier is to add skip connections that connect non-consecutive layers. A well-known model architecture, called ResNet, uses blocks with this type of shortcut connections. Using such connections gives the following layers a reference point so that adding more layers won't worsen the performance. The skip connections also create an additional path for the gradient to flow back more easily. This makes it easier to optimize the earlier layers.

Using skip connections is a common pattern in neural network design. Different models may use skip connections for different purposes. For example, fully convolutional networks use skip connections to combine the information from deep and shallow layers to produce pixel-wise segmentation maps.

A paper that I have published last year proposed using both types of skip connections to segment remotely sensed multispectral imagery. The skip connections on the left help recover fine spatial information discarded by the coarse layers while preserving coarse structures. The skip connections on the right provide access to previous layer activations at each layer, making it possible to reuse features from previous layers.

Let's move on to the second question: how do you decide on the size of the kernels in the convolutional layers? Short answer: 3x3 and 1x1 kernels usually work the best. They might sound too small, but you can stack 3x3 kernels on top of each other to achieve a larger receptive field as I mentioned in the previous video. How about 1x1 kernels? Isn't a 1x1 filter just a scalar? First, a 1x1 filter isn't really a 1x1 filter. The size of a kernel usually refers to its spatial dimensions. So a 1x1 filter is, in fact, a 1x1xN filter where N is the number of input channels. You can think of them as channel-wise dense layers that learn cross-channel features.

Obviously, 1x1 filters don't learn spatial features and stacking 1x1 filters alone wouldn't increase the receptive field, but combined with 3x3 filters they can help build very efficient models. This pattern is at the heart of many convolutional neural network architectures, including Network in Network, Inception family models, and MobileNets.

One advantage of 1x1 convolutions is that they can be used for dimensionality reduction. For example, if the input volume is 32x32x256 and we use 64 of 1x1 units then the output volume would be 32x32x64. Doing so reduces the number of channels before its fed into the next layer. Let's say the output is fed into a 3x3 convolutional layer with 128 filters and compute the number of operations that we need to do to compute these convolutions. To compute the output of the 1x1 filter we need to compute the values for each one of 32x32x64 pixels, and we need to do 1x1x256 operations, which is the size of the filter, to compute each value. We do the same thing to compute the activations of the following 3x3 convolutional layer which sums up to roughly 92 million operations. Now, if we remove the 1x1 layer and compute the number of operations we end up with over 300 million operations. It may sound a little counterintuitive at first but adding 1x1 convolutions to a network can greatly improve the computational efficiency.

Another use of pointwise convolutions is to implement a depthwise separable convolution, which reduces the number of parameters. The idea is simple, perform a spatial convolution on each channel in the input volume separately, then use a pointwise convolution to learn cross-channel features.

Let's take the previous example with the traditional convolutional layer. We had 128 units, each had 3x3x256 parameters, where 256 is the number of channels in the input volume. So, in total, this layer had roughly 300,000 parameters.

Alternatively, we could use 256 filters each only applied to one channel, separately. So, the units in the first layer would have 3x3x1 parameters instead of 3x3x256, since each unit acts on only a single channel. Then, we can use a pointwise layer to learn cross-channel features and get the same output volume. This would lead to about 35,000 trainable parameters, spatial and pointwise layers combined.

This is the main idea behind the recently popularized MobileNet architecture. By stacking depthwise separable convolutional blocks MobileNet manages to be very small and efficient without sacrificing too much accuracy.

Separable convolution is not a new concept. For example, in image processing, it's a common practice to separate a 2-dimensional filter into 1-dimensional row and column filters and apply them separately to reduce the computational cost. So we can take the depthwise separable convolution idea one step further and stack 1x3, 3x1, and 1x1 filters on top of each other to learn row-wise, column-wise, and depthwise separable filters. Actually, I tried this several years ago, but it turns out that the savings from the spatially separable filters are not worth the accuracy that is sacrificed since the filters are already small spatially. So it seems like depthwise-only filter separation is a good compromise.

Next question: how to choose the sliding window step size, also known as the stride? Choose 1 if you want to preserve the spatial resolution of the activations, choose 2 if you want to downsample and don't want to use pooling. If you want to upsample the activations use a fractional stride such as 1/2, which is similar to a stride of 2 but has its input and output reversed.

A convolution with a fractional stride is sometimes called a transposed convolution or a deconvolution, although using the term 'deconvolution' is a little misleading from a mathematical perspective.

How about pooling parameters? Max pooling with same padding and a pooling size of 2x2 usually works fine. If you want your model to handle variable-sized inputs and your output is fixed-size you might consider pooling to a fixed size or using global average pooling. For example, if your inputs are images having different dimensions and your output is a single class label, then you can take the mean of the activations before the fully-connected layers.

How to choose the type of activation functions? Short answer: choose ReLU except for the output layer. Long answer: check out my earlier video on artificial neural networks. It's actually a short video. So, I should have said not so long answer.

What type of regularization techniques should I use? Short answer: use L2 weight decay and dropout between the fully connected layers if there are any. Not so long answer: check out my earlier video on regularization.

What should be the batch size? A batch size of 32 usually works fine for image recognition tasks. If the gradient is too noisy you might try a bigger batch size. If you feel like the optimization gets stuck in local minima or if you run out of memory, then a smaller batch size would work better.

These are the hyperparameters and design patterns that I can think of right now. The next video will be about transfer learning. Feel free to ask questions in the comments section and subscribe to my channel for more videos if you like. As always, thanks for watching, stay tuned, and see you next time.

Further reading: