Video Transcript

In this video, we'll see how we can turn pictures like this one, into mind-blowing paintings like this one, using this method called contrastive unpaired translation.

This method was proposed in this paper, titled Contrastive Learning for Unpaired Image-to-Image Translation. It was published less than a year ago. It's a fairly recent paper. I'll first summarize how the predecessor of this method CycleGAN worked, what shortcomings it had, and how this new method addresses those problems. Finally, I'll show how you can try it out for yourself. If you make it to the end of the video, I also made an AI art video for you that I hope you'll enjoy.

Let's first look at the problem that this method tries to solve. Let's say we have two datasets: one has paintings and the other one has photos. We want to learn a mapping between paintings and photos so that we can turn photos into paintings or vice versa. But we don't have a one-to-one correspondence between the images in our datasets. This problem is called unpaired image translation and was previously addressed by CycleGAN.

CycleGAN trains two generators in opposite directions. For example, if we have a painting dataset and a photograph dataset and train a CycleGAN, then it learns a model that converts pictures into paintings, and paintings into pictures simultaneously. The idea is that when we go from the photograph domain to the painting domain, and from the painting domain back to the photograph domain, what we get should be close to the input. In CycleGAN, this behavior is enforced by a cycle consistency loss that minimizes the difference between the input and the reconstructed image. This loss function is used in addition to the loss that comes from the discriminators that tell whether their input is a real photo or painting from the dataset or a generated one.

This method worked really well for many cases but it also had shortcomings. And this new paper addresses those problems.

One issue with CycleGAN is that we always need to train two generators even if we only care about one direction. Naturally, this slows down training. Besides, the training setup would assume that it is equally hard and important to turn a photo into a painting and vice versa. This may not always be the case. I'd argue that it's easier to turn a photo into a painting than the other way around. So, if our application does only a one-way mapping, then there should be no need to try hard to learn the inverse mapping as well.

Another related issue is that CycleGAN assumes that the relationship between two domains is invertible and has a 1-to-1 correspondence, which may not always hold true. For example, there may be multiple plausible painting equivalents for a given photo.

The cycle consistency loss is overall a good idea but it's not foolproof. This paper found that CycleGAN is actually very good at hiding information to satisfy the cyclic consistency requirement.

If you consider the task of turning aerial photographs into maps. It's not really a 1-to-1 task like the cycle consistency loss enforces. Multiple photographs can correspond to the same map. So CycleGAN finds its way around this by hiding some information in the map in the form of some subtle noise that is invisible to the human eye but can be used to convert it back to an aerial photo. This also explains how CycleGAN was able to translate between completely unrelated domains like faces and ramen. I mean, how else can one reconstruct a recognizable face once it becomes, ramen? If you are interested in machine learning models finding ways around to satisfy certain constraints, you can check out my earlier video on reward hacking. Anyway, I digress.

Ok, so how does this new method address all these problems? First, it drops the cycle consistency constraint altogether and trains a single-sided model. Since it has only one generator, it trains much faster.

Instead of cycle consistency, it uses a patch-wise contrastive loss to maintain the relative mutual information between the inputs and outputs, and I'll explain what that means.

This is how the overall training setup looks like. We have an input sample. We have a generator that turns horses into zebras, which is just an example. It can be trained to turn photos into paintings as well.

We have a discriminator here, which the generator needs to convince that the zebras it outputs look indistinguishable from the real ones.

And the novelty of this paper is that they define a multilayer, patchwise contrastive loss. What happens during training is that they randomly sample patches in the inputs and outputs. Then, they enforce that the corresponding patches should be more similar to each other than the non-corresponding, negative pairs. The similarity is basically cosine similarity between the feature maps generated by the same encoder in the generator, across multiple layers. For example, this zebra's head in the output should have a higher similarity score with the horse's head in the input than some other random patches sampled internally from this same input image.

They formulate this contrastive loss as a classification problem to classify patches as positive and negative patches given a query. The actual loss function is basically a softmax cross-entropy function.

In one of the variants of the model they proposed in the paper, they also have an additional loss that minimizes the difference between the input and the output when the input is already in the output domain. For example, if we feed a painting to our model instead of a photo, it shouldn't change it any further if it's trained to output paintings. Naturally, this slows down training but it still trains faster than CycleGAN.

One nice thing about this model is that it's not very data-hungry. In fact, since the patches are sampled internally, you can even train this model with one image per domain. For example, this model was trained with a single Monet painting and a high-resolution photo. This model was also trained with one photo and a painting. Quite impressive.

Alright, now let's see how we can train this model ourselves to turn photos into beautiful paintings.

The authors of this paper released their code and datasets in this GitHub repository. I'll put the links in the description.

If you don't have a GPU you can always use Google Colab or an equivalent service to get free access to GPUs.

To turn photos into paintings, we need one set of photos and another set of paintings. Where to find those is up to you.

For example, I used some of my favorite wallpapers for the photos dataset and some landscape and abstract paintings for the painting dataset.

You can start with the datasets that are already provided in this repository. You can download them using this bash script. They have this Monet, Van Gogh, Ukiyo-e, and Cezanne datasets. Since these datasets are intended for painting-to-photo translation, not the other way around, we should invert the training direction by setting the direction to BtoA.

You're not limited to these datasets. You can put together your own dataset. WikiArt has a large collection of paintings. You can pick your favorites from there.

Let's take a look at some of the results I got after a couple of days of training.

I think they look great. Maybe not all of them but overall I think the results are quite impressive.

That was pretty much it. If you made it so far, wait for the AI Art at the end of the video. I didn't make it using this model, but I think it still looks cool. Let me know what you think.

I hope you liked this video. Subscribe for more videos. And I'll see you next time.