Deep Unsupervised Learning

April 5, 2018

Transcript of the video

The examples we had so far in this series were examples of supervised learning. In supervised learning, the learning algorithm tries to learn a mapping between inputs and known outputs. Creating these outputs usually involves some sort of human supervision, such as labeling the inputs by hand or using data that are already annotated by humans in different ways.

Unsupervised learning, on the other hand, aims to find some structure in the data without having labels. One reason unsupervised learning might be useful is that unlabeled data is abundant and cheap, since finding a ground truth label for each sample is usually the step that takes a lot of time.

In traditional machine learning, perhaps the most common types of unsupervised learning are dimensionality reduction and clustering. Dimensionality reduction refers to reducing the number of variables needed to represent data.

One reason we might need dimensionality reduction is that the volume of the feature space increases exponentially as the number of variables increases. In higher dimensional spaces, it is harder to have a representative sample for every possible configuration of the parameters. This problem is also known as the curse of dimensionality.

In reality, natural data have regularities and don't really use the entire parameter space. If our data consist of images, for example, natural images make up only a small portion of every possible combination of pixel values. We can encode such inputs using fewer parameters by capturing a lower-dimensional manifold that explains the data.

In deep learning, dimensionality reduction is usually not done as a separate step, but the models learn compact representations within the model. It's still common to use dimensionality reduction techniques for visualization purposes though. For example, PCA or tSNE methods can be used to visualize data points with many dimensions in 3-dimensional space. Data visualization and traditional dimensionality reduction techniques are a little off-topic for this video, so I won't go into further detail, but you can find more information about them in the description below.

Let's talk about how we can train a neural network in an unsupervised fashion. We need to define some sort of target to train a neural network. One thing we can do is to train a model to learn a mapping from the input data to the input data. Since the inputs and the outputs are the same, the model would try to reconstruct its input. These types of models are called Autoencoders.

Of course, it's trivial to define a mapping from an input to itself. Why do we even need a neural network to represent an identity transform? Well, we don't actually want them to learn an identity transform. We want them to learn to represent their input efficiently.

A typical autoencoder is a feedforward neural network that consists of encoder and decoder modules with a bottleneck in between. The encoder module distills the information into a lower-dimensional feature space, sometimes called the code. The decoder tries to reconstruct the input using the code.

The inputs and outputs don't have to be identical. For example, the input can be a noisier version of the output. This noise can be introduced artificially. In this setting, the network tries to learn to denoise its inputs. Such a model is called a denoising autoencoder.

A typical choice of loss function for autoencoders is the l2 loss. In this case, the learning algorithm would basically try to minimize the mean squared error between the reconstructed and true inputs. However, my experience is that using the mean squared error on image sets usually results in blurry reconstructed images. Mean absolute error, on the other hand, usually results in crisper images. This might be one of the few domains where l1 loss actually does better than l2.

One use of unsupervised learning is to learn representations. For example, if we have large amounts of unlabeled data but not so much labeled data, it's possible to pre-train a model on the unlabeled data first. We can use the encoder part of the trained model as a feature extractor. Then, train a separate model that uses this compact representation as inputs.

For image data, this approach is not being used often anymore. In case of a lack of data, it's more common to transfer features from models trained on large datasets in a supervised way. You can check out my earlier video on transfer learning to learn more about it. You can find it in the Deep Learning Crash Course playlist in the description below.

For text data, it's still common to train models on a large corpus of text in an unsupervised manner. One key example of this is the Word2Vec model, which learns to map words to vectors of real numbers, called word embeddings.

Before we talk about how word embeddings work, let's take a look at what's a very basic way we can turn words into vectors. A naive way to represent words as vectors would be to encode them as one-hot vectors, where each dimension represents one word, and all dimensions are mutually exclusive.

This representation is not very useful for many natural language processing tasks for several reasons. First, in one-hot representation, all words are equidistant to each other. Ideally, we would want the distances between words to correlate with their semantic similarities.

Second, the size of the vectors become unmanageable when we have a lot of words in our dictionary.

For example, if we have 1,000,000 words to represent, using vectors of size 1,000,000 wouldn't be a sensible thing to do. Are there even that many words in English? Including proper nouns, yes, probably even many more than that.

So, what's a more sensible thing to do then? Assuming that the data can be explained by multiple factors, we can adopt a distributed representation.

For example, let's say we want to represent the names of thousands of animals, such as a cat, tiger, goldfish, eagle, etc. Instead of encoding them using disjoint symbols, we can use a distributed representation that might have entries that correspond to some features that describe an animal. These features might include whether its an aquatic or terrestrial animal, whether it has wings, feather or fur, and so on. Such a representation would allow for sharing attributes between different words. In this example, the animals that share more attributes would be closer to each other than the ones that share less in the vector space. This representation is also more powerful because it can encode exponentially many words.

These dense representations can be learned from data in an unsupervised way. Let's take a look at the Word2Vec model, which is pretty much the standard method for learning word embeddings at the time I'm making this video.

Word2Vec attempts to learn the meanings of words and phrases in context. It starts with randomly initialized vectors for each word in a dictionary. Then, given a large corpus of text, it updates the vectors by training the words against their neighbors.

For example, if we trained a model using a corpus that had sentences that are similar to: "rivers are essential to the Earth's water cycle," then the model would be able to learn a semantic similarity between the words' rivers' and 'water,' after seeing them being used close to each other so many times.

Word2Vec comes in two flavors. The first type called the continuous bag of words model, which uses the neighbors of a word to predict it. The other type, called the skip-gram model, does the opposite and tries to predict the neighbors of a given word. Both types seem to capture the semantics of the words well.

Let's take a look at some examples. I ran some queries on a model trained on news articles to see what words the model thinks are similar. For example, the word 'phenomenal' was closest to the words incredible, amazing, fantastic, and unbelievable. Steve Jobs used to use this word a lot. How about Chicago? It seems that it had co-occurred in the news articles frequently with Baltimore, Denver, NYC, and Atlanta. Similarly, the word 'cats' had been frequently used with the words felines, cat, pets, and kittens.

It's interesting that simple arithmetics on these learned word embeddings can be used to capture analogies. For example, using the embedding from the same model in the previous example, the result of king - man + woman gives queen. Kitten - cat + dog results in puppy. Lexus - Toyota + Honda produces a word vector that is closest to Acura.

The reason this works is that each direction in the embedding space can learn a different factor that explains the words.

For example, one dimension might correspond to whether the subject is male or female. Then, when we subtract 'man' from 'king', it might change the gender of the word. Another dimension might encode whether the subject is juvenile or adult, like in the cat and kitten example. There might also be a variable or a combination of variables that corresponds to the level of luxury, like in the case of Toyota and it's luxury counterpart Lexus.

It's very interesting that these representations can be learned automatically, just by training a neural network on a large corpus of text in an unsupervised fashion.

One might argue that this is not unsupervised learning since the words we train the model to predict can be considered as labels. Well, the boundaries between supervised and unsupervised learning can get a little blurry sometimes. This kind of task can be considered supervised, unsupervised, or something in between, depending on your perspective.

Another type of unsupervised model that is worth mentioning is generative models. Generative models try to generate samples that are similar to the ones in a training set. Generative Adversarial Networks, in particular, are one of the recent breakthroughs in this field. I think this is a topic that deserves its own video. So, in the next video, we will talk about how Generative Adversarial Networks work and what makes them so powerful.

That's all for today. Thanks for watching, stay tuned, and see you next time.

Further reading: