Video Transcript

When I started experimenting with AI-generated images, it was still a small niche, but now they are everywhere! Generative AI has come a long way since my last video on the topic. So, it’s time for a new video!

In this video, we'll take a deep dive into the inner workings of diffusion models, the state-of-the-art approach for generating realistic and diverse images. We will cover their key concepts and techniques, and provide a concise and intuitive understanding for anyone who is interested in how they work.

So, how do diffusion models work?

The gist of it is that during training, we take an image and gradually add noise until there is nothing but noise. When we want to generate images, we reverse this process, start with noise, and gradually remove noise until we have a clean image. In the forward diffusion process, the entropy of the images increases, as adding noise makes them more and more homogeneous. This is somewhat similar to diffusion in thermodynamics, which is what they are inspired by and named after.

Adding noise to images is easy; the more interesting part is how we reverse this process. As you can guess, we use neural networks to do that. Let's take a closer look.

Denoising models usually use a fully convolutional network. These networks are image-in, image-out models that can process images of varying sizes and produce dense, pixel-wise predictions, rather than a single image label or bounding boxes.

One popular architecture for such image-in, image-out tasks is the U-Net architecture. As its name suggests, the U-Net architecture resembles the shape of the letter "U," with a series of downsampling layers followed by a corresponding series of upsampling layers. U-Net uses skip-connections to bridge the downsampling and upsampling layers at the same resolution. This structure allows the network to capture both high-level semantic information and low-level texture details, making it particularly effective for tasks like image denoising, segmentation, and restoration.

U-Net has many variants and improved versions, incorporating attention blocks, different types of activation functions, skip connections, and so on, but this is the basic idea.

It's fairly straightforward to train a U-Net-based model to denoise an image that is only a little bit noisy, but how do you go from complete noise to a fully realized, coherent image? Yes, diffusion models gradually remove the noise in a series of steps, instead of trying to remove it all at once, but how?

The naive approach would be to train a model to take a noisy image as input and output an image that is a little less noisy. But that’s not how it’s done! In stead of training the model to predict denoised images directly, diffusion models learn to predict the noise itself—all of it. So, if the predictions were perfectly accurate, all we needed to do would be to subtract them from noisy images and be done in one step. But removing all the noise from an image in one step is hard. The noisier the image, the more unreliable the predicted noise will be.

So, what diffusion models do is that they scale the predicted noise before subtracting it from the noisy input during inference. This is of course the most naive way to implement it. There are more advanced samplers that use sophisticated solvers to get to the clean images in fewer steps but you get the idea.

You may wonder why we train the models to predict the entirety of the noise all at once, even though we don’t remove the noise at once during inference. Why not train the model with noisy inputs and a little less noisy targets? The problem with noisy targets is that they have a lot more variance than clean ground truth. Because there are so many more ways an image can be noisy than it can be clean.

During inference a sampler denoises a sample one step at a time. But during training, we don’t really need to do it sequentially like that. In practice, the model inputs a batch of noisy images with different amounts of noise added to them and it tries to predict the noise that was added. The amount of noise is parametrized by the time step using a noise scheduler, rather than adding noise iteratively for that many time steps. This way we can get noisy images at a given step in one shot, without adding noise sequentially. This makes the training easier and more efficient.

The time step is given to the model as an input so that the model doesn’t need to guess how much noise there is in the inputs and have an idea of how much noise to remove. We don’t use the time step as-is though. It goes through an embedding process which makes it continuous and more neural-network friendly before it is fed into the model. In its simplest form, we basically pass the raw position indices through a bunch of sine and cosine functions having different frequencies, and what we get are the embeddings.

Diffusion models are overall a lot more stable than GANs. GANs require a delicate balance between the generator and discriminator and are highly sensitive to even minor changes, while in a diffusion step, it's harder to fail that catastrophically.

There are many popular diffusion-based image generators, including DALL-E 2 from OpenAI, Imagen from Google Research, but we'll focus on Stable Diffusion in this video because it's open source. The denoising process is more or less the same in all these models, but there are differences in how things are done.

Let's first take a look at the Latent Diffusion approach which is what the Stable Diffusion is based on. One of the shortcomings of the diffusion process I described earlier is that it's slow. Really, really slow compared to GANs and Variational Autoencoders. Latent Diffusion aims to speed up this process by offloading some of the high-resolution processing into a variational autoencoder.

The variational autoencoder is essentially an encoder-decoder network that is trained to encode images in a lower-dimensional space and then decode them back to reconstruct the original, high-resolution images. We can consider this as a kind of image compression model.

The variational autoencoder is trained separately before training a latent diffusion model. Once it's trained, it's frozen, and a diffusion model is trained in its lower-dimensional latent space.

During training, instead of adding noise to images, latent diffusion first runs them through the encoder of the variational autoencoder to move them to the latent space. Then, it adds noise in this lower-dimensional space and trains a model to reverse this process.

Once the model is trained, again, we start with pure noise, just like we did with images, and gradually denoise it. At this point, what we are denoising are not really images but lower-dimensional feature maps. The diffusion process here turns noise into valid latent vectors, so that we can decode them into high-resolution images using the decoder part of the pre-trained variational autoencoder.

We don't need the encoder part of the autoencoder to generate images from scratch, but it can still be used in image-to-image tasks. To modify an image, for example, we can pass it through the encoder and run the diffusion process for some number of steps on this encoded latent vector rather than starting with pure noise. We can inpaint or expand images this way too. We can mask images, encode them, and run the diffusion process to fill in the gaps.

So far, we've covered how to generate images, but how do we get them to generate what we want? How do we go from text prompts to images?

The short answer is that we use a tokenizer and text encoder to turn text into tokens and then into embedding vectors. Then we condition the diffusion model on those text embeddings.

Stable Diffusion uses OpenAI's CLIP as its text encoder. CLIP is a text and image encoder that was pre-trained on text-image pairs to learn a shared embedding space for text and images, where related images and text are close to each other. In this context, though, this multi-modality is not strictly necessary. It's possible to replace CLIP with a text-only encoder, which is what Google's Imagen did.

One advantage of using a multimodal encoder like CLIP is that it allows for both text and images as inputs. That's more or less how DALL-E 2 takes an image as input and generates variations of it. DALL-E 2 further aligns the text and image embeddings using a prior, but let's not digress.

Unlike DALL-E 2, Stable Diffusion uses CLIP purely as a text encoder. It uses the embeddings from the layer before the last one, which is not shared with the image encoder at that point.

These embeddings are used to condition the U-Net based diffusion model on the input text prompt.

If you've tried Stable Diffusion before, you may have noticed a parameter named the guidance scale. The higher the scale, the stronger the effect the prompt has on the generation, while a lower scale results in a more subtle influence.

Under the hood, this is how it works: given a random noise vector as input, we run the diffusion process twice – one conditioned on a prompt and the other run unconditionally. At every step, we get the noise prediction for those two, take the difference, multiply it with the guidance scale, and add it back to the original prediction.

This method is called classifier-free guidance and it essentially amplifies the effect of the prompt on the results, by moving further in the direction of the prompt.

With a small tweak, we can use this technique to have negative prompts as well. Instead of having an unconditional sample as a baseline, we can have a negative prompt there. We can push the images further away from the negative prompts by multiplying the difference between the images generated using positive and negative prompts. Negative prompts can be used to remove objects, properties, style, or qualities of generated images.

Alright, that was pretty much it. I hope you found it helpful and interesting. Thanks for watching, and see you next time.