Recurrent Neural Networks

March 28, 2018

Transcript of the video

So far, we have discussed only feedforward neural networks in this series. Feedforward neural networks have no feedback loops. The data flow in one direction from the input layer to the outputs.

Recurrent neural networks, or RNNs for short, use inputs from previous stages to help a model remember its past. This is usually shown as a feedback loop on hidden units. These types of models are particularly useful for processing sequential data.

Sequential data can be any type of data that can be represented as a series of data points. Audio, video, text, biomedical data, such as EEG signals or DNA sequence data, financial data, such as stock prices, etc.

Recurrent models feed the outputs of units as inputs in the next time step. That's where the name 'recurrent' comes from. So, how do we implement this feedback loop? Under the hood, recurrent neural networks are actually Feedforward Neural Networks with repeated units. We can unfold this recurrent graph into a full network as a chain of repeated units.

Earlier, we talked about how convolutional neural networks share parameters spatially. Similarly, RNNs can be thought of as neural networks that share parameters in time.

RNNs can handle inputs and outputs of different types and lengths. For example, to translate from one language to another, a model would input a piece of text and output another piece of text, where the length of the inputs and outputs are not necessarily the same. A speech recognition model would consume audio data to produce text. A speech synthesis model would do the opposite.

The input and the output don't have to be both sequences either. A model can input a sequence like a blog post and output a categorical variable, such as a variable that indicates whether the text carries positive, negative, or neutral sentiment.

Similarly, the output can be a sequence, whereas the input is not. A random text generator can input a random seed and output random sentences that are similar to the sentences in a corpus of text.

It's possible to have many different types of input and output configurations. An RNN can be one-to-one, one-to-many, many-to-one, and many-to-many. The input and output don't have to be the same length. They can be time-delayed as well, like in this figure. It can even be none-to-many, where a model generates a sequence without an input. This type is essentially the same as one-to-many since the output would depend on some initial seed, even if it's not explicitly defined as an input.

As we talked earlier, optimizing models become more difficult as the chain of units gets longer. In RNNs, we can easily end up with very long chains of units when we unfold them in time.

One of the problems we might come across is the exploding gradients problem. Long sequences can result in long chains of parameter multiplications. When we multiply so many weights together, the loss becomes highly sensitive to the weights. This sensitivity may results in steep slopes in the loss function. The slope of the cost function at a point might be too large that when we use it to update the weights. The weights might go outside a reasonable range and end up having an unrepresentable value, such as a NaN value. This doesn't have to happen in a single update either. It can happen over the course of several updates. A long chain of large weights lead to large activations, large activations lead to large gradients, and large gradients lead to large weight updates and even larger activations. A quick fix for this problem is to clip the gradient magnitude to prevent it from being larger than some maximum value. This is called gradient clipping.

Another problem we might encounter is the vanishing gradient problem, which we discussed earlier. To recap, when we backpropagate the error in a deep network, the gradient sometimes gets so small until it reaches the early layers. In a feedforward network, this makes it harder to optimize the early layers since they barely get any updates. In the context of recurrent neural networks, this results in quickly forgetting things that the model has seen earlier. In many cases, this type of behavior is not acceptable since there might be long-term dependencies. For example, if our task is to predict a missing word in a paragraph, the contextual cues we need might not be very close to the word being predicted. We can tell the missing word in this example is probably "1970s" by looking at the beginning of the paragraph but we would have no clue if we had access only to the words right next to the missing word.

So we need a model architecture that is better than the vanilla version of the recurrent neural nets. Two popular RNN architectures called LSTMs and Gated Recurrent Units both aim to remember long-term dependencies while alleviating the vanishing and exploding gradient problems. These architectures use gated modules to keep what's important in a sequence of data points.

The main idea in gated architectures is to have a straight channel that flows through time and have modules connected to it. These modules are regulated by gates, which determine how much the module should contribute to the main channel. The gates are simply sigmoid units that produce a number between zero and one. Zero means nothing passes through the gate, and one means everything is let through as-is.

Let's build an extremely simplified version of a gated unit. We have a main channel where all the modules will connect to. We have modules that can add or remove information from this channel, where what needs to be kept or discarded is determined by sigmoid gates. Actual LSTMs and Gated Recurrent Units are more complicated than this simplified example.

This figure from Chris Olah's blog, for example, summarizes how LSTMs work. The first gate in this module determines how much of the past we should remember. The second gate decides how much should this unit add to the current state. Finally, the third gate decides what parts of the current cell state makes it to the output.

It's possible to increase the representational capacity of recurrent neural networks by stacking recurrent units on top of each other. Deeper RNNs can learn more complex patterns in sequential data, but this extra depth makes the model harder to optimize.

One last thing: recurrent models are not our only option for processing sequential data. Convolutional neural networks can also be very well applied to time series data. For example, we can represent a series of measurements as a grid of values, then build a convolutional neural network on top of it by using one-dimensional convolutional layers. One thing to keep in mind is to make sure that the convolutions use only past data and don't leak information from the future. This type of convolution is called a causal convolution or a time-delayed convolution. Another trick is to use dilated convolutions to capture longer-term dependencies by exponentially increasing the receptive field.

Google's WaveNet model makes use of these techniques to train convolutional neural networks on sequential data. You can find more information about it in the description below. You can also watch my earlier videos on Convolutional Neural Networks to learn more if you haven't watched them already.

That's all for today. The next video's topic will be unsupervised learning. We will talk about how we can train models on unlabeled data.

As always, thanks for watching, stay tuned, and see you next time.

Further reading: