Video Transcript

We know that computers are good at crunching numbers but can they also be creative? Can they paint a painting that has never been painted before?

I generated these paintings using a machine learning model that I trained myself. All of these paintings are products of a piece of software that crunches numbers into art.

Do you think it's obvious that they are AI-generated or do you think they can pass as real paintings? Before I explain how I built this AI art model I want to show a short cutscene from the video game that inspired me to make this video: Detroit Become Human.

[Cutscene from Detroit: Become Human]

If we were to build a model that generates such paintings how would we do it? A generative adversarial network would be the natural choice for such a generative model. I have a separate video on how generative adversarial networks work but the basic idea is that we train two different models that compete against each other. Given a dataset, one model tries to generate samples that look like the ones in the training set. The other one tries to tell whether its input is a real sample from the dataset or a fake sample generated by the first model. Both models improve over time and once they reach an equilibrium, we discard the discriminator and use the generator to generate realistic new samples.

There are many GAN architectures that are powerful in their own ways, such as CycleGAN and BigGAN but my overall favorite is StyleGAN2, which is one of the best generative adversarial networks at the time of making this video.

Traditional GANs use a Gaussian random noise vector as input. StyleGAN, on the other hand, uses a mapping network that learns to generate latent representations that correspond to ‘styles’. This model was originally trained on a face dataset. So, those styles controlled the head pose, hair, face shape, facial features, eye colors, and so on. For an art dataset, the style representations would correspond to the color scheme, art movement, and various aspects of artistic style.

Speaking of datasets, how can we put together an art dataset? There are millions, or maybe even billions, of public domain paintings (and other forms of visual art) available on the web. To ensure quality, I limited my sources only to the artwork created by well-established artists, yet I was able to collect over 300,000 paintings, dating from the 1400s to the 1900s.

StyleGAN2 has a maximum output resolution of 1024x1024 pixels, which is quite high as compared to many other models but still not enough for high-quality art prints. For example, to print a 20x20 inch canvas at 300 dpi, we would need a resolution of at least 6000x6000 pixels.

I could extend StyleGAN2 to generate 8K images, but that was not very feasible because of two main reasons. The first one is the computational limitations. StyleGAN2 is already a very resource-intensive model. Since I don't have a GPU cluster at home, training at 8K resolution was out of the question. Even if I had the resources, there was another problem. The images in the dataset were not in 8K. Their resolution varied but on average they were close to 1024x1024 pixels. So, I needed a different solution and this is what I came up with.

I have a generator network, similar to StyleGAN2. I have a discriminator network that tries to tell whether it's input created by a human artist or not. I also have an upscaling network, which is similar to ESRGAN but is more lightweight and has a larger upscaling factor. At the output of the upscaler, I have another discriminator that is trained to tell whether its input is a real high-resolution image or an image upscaled by the upscaling model. You can check out my earlier video on super-resolution to learn more about neural upscaling. I also added a consistency constraint that makes sure that when we downscale the output back using a standard downscaling algorithm, we get back what the upscaler had at the input. The upscaler is resolution agnostic. It learns to upscale its input by a fixed factor, regardless of the input resolution. So, this setup allows for training a model that can generate high-resolution artwork without having any high-resolution art in the training set.

Let's name this model Marcus. The way Marcus learns to paint is not exactly the same as a human artist learns but not entirely different either. Marcus is inspired by thousands of human artists. He (or it) has seen and processed hundreds of thousands of pieces of art, tens of millions of times. So, essentially, it internalized the concept of art and was able to create art itself as a result of this training process.

You may ask, is the machine really the artist here? One may argue that the creativity here comes from the dataset and whoever builds the model, as the model is not really aware of what it's creating. The art it creates is a distillation of the work of thousands of artists that lived in the past six-seven centuries.

So, the art the model creates can be considered a massively collective work. On the other hand, human artists are also influenced by the ones that came before them, and pretty much no artist creates art in complete isolation. So, it's also possible to consider the training process more of an 'accelerated' inspiration than a mere distillation process. Personally, I think that an AI art model is more of a tool than an artist. Artists use different tools to express their art. Some use oil to paint, some use acrylic, and some use Photoshop. We, engineers, can use machine learning as a tool to create art.

I agree that what I do here to create art is rather trivial as compared to how 'real' artists create art. The process of creating the algorithms and datasets to train those models is not trivial, but now that the model is fully trained, I can even generate new pieces in a fully automatic way, although I usually tune the style knobs to get the results I like.

I have had a passion for art and technology for quite some time. I have always been fascinated by the idea of computational creativity. Back in 2011, I even proposed this project as a master's project. But my advisor said maybe I should look into something more feasible so that I can finish it within the time-frame of my program and so did I. I mean, he was right. There was no way I could build such a model in 2011, having very limited computing power, no deep learning frameworks, and very little prior work to look into. I would need to invent generative adversarial networks and implement everything from scratch. But today, it's much easier to build deep models. Thanks to the recent advances in the field, we can create art in a form that was unimaginable in the past.