Video Transcript

Hi! Have you ever wondered how image filters work? We'll find out how they work in this video.

First of all, what is an image filter? An image filter may mean many things. The types of image filters that we will cover in this video broadly fall into three groups.

The first type is traditional image filters, which manipulate images without adding any additional information, such as blurring, sharpening, and tone-mapping filters. This type relies on basic image processing methods.

The second type is the algorithms that do all sorts of tricks, like giving you puppy ears, big curly hair like I used to have, or a fancy beard. This type typically makes use of computer vision techniques like face landmark detection or body pose estimation.

The third type is neural filters, which can transfer styles between images, manipulate properties of images, colorize grayscale images, and so on. As their name suggests, neural filters use neural networks to do all those cool stuff.

In practice, many filters fall into more than one of these categories. For example, this anime face filter most likely uses traditional image filters to adjust the colors, facial landmark detectors to track the face, and neural filters to transform your face into an anime face.

[このアニメフィルターはすごいですね。本当に素晴らしいです。]

Alright, let's first look at the first type: traditional image filters. Before we understand how image filters work, I think it's important to first understand how images are represented. We can represent an image as a matrix, where the elements correspond to pixel intensity values. A larger number means a brighter pixel, and a smaller number means a darker pixel. An image can have more than one channel. For example, color images use a separate channel for each color component, such as red, green, and blue.

An image filter is a small matrix that we apply to image matrices to achieve certain effects, such as blurring, sharpening, and edge detection.

To apply a filter to an image, we multiply the matrices element-wise, sum the product, and do this over the entire image by sliding the filter over the image. This operation is called convolution or cross-correlation. Although convolution and cross-correlation are not exactly the same, they are usually used interchangeably in the context of image processing.

What we do here is essentially a weighted averaging of neighboring pixels. For example, if we use a filter of ones divided by the number of elements, what we get is a pure mean of the pixels in this window. This is called a box filter, also known as "box blur" or "mean blur" in some applications. Box filter can be used to blur images but it's rarely used in practice because of the boxy artifacts it creates. We can get rid of those artifacts by putting a larger weight at the center and decaying the weights as the pixels get farther. This is how Gaussian blur filters work. You can also approximate a Gaussian filter by applying a box filter multiple times. Gaussian filter is commonly used to reduce noise in images or other kinds of signals. It's simple and fast but it blurs out everything and doesn't preserve the edges.

What if we don't limit ourselves to linear operations like taking the weighted average of the neighboring pixels? What if we took the median of the pixels within this window instead of the average? That's what the median filter does. Median filter somewhat preserves the edges and is more effective against impulse-like noises, like shot noise or salt-and-pepper noise. The larger the window size, the more cartoonish the image becomes.

Another commonly used non-linear, edge-preserving filter is the bilateral filter. Bilateral filter is commonly used in skin enhancements. Because it smooths out the small imperfections without blurring out the face altogether. Like the other filters we discussed, Bilateral filter also computes a weighted average of neighboring pixels. As its name suggests, it has two components. The first component puts a larger weight on closer pixels, just like the Gaussian filter. The second component puts a larger weight on pixels that have a similar pixel intensity, given the center pixel as a reference. This prevents blurring out the edges too much since pixels have a larger difference around the edges.

Let's take a look at this example. If you visualize the pixel intensities in this image, it would look something like this. For this pixel here, the weights for the pixel distance would be like this, the farther a neighbor pixel is the smaller the weight it gets. The weights for the pixel intensity difference would be like this, the more different the smaller. Multiplied together, this is how the bilateral filter weights would look like. We use these weights to average all neighbor pixels to get the value of this pixel in the filtered image. When we do this to every pixel in this image, we get smoother surfaces while preserving the hard edges.

There are also more advanced algorithms, such as non-local methods, which make use of similar parts of the image that are not necessarily neighbors. There are also temporal denoisers, which use information from other frames in the same video.

Blurring an image has its use cases, such as reducing the noise, enhancing skin, or creating a fake depth of field effect, but how about sharpening? People usually like sharper images better than blurry ones. We can sharpen an image using an image kernel, just like the ones we use for blurring. To sharpen an image, we can simply assign a negative weight to the neighbor pixels. We can also subtract the blurred version from the original image with some weight to enhance local contrast. This is called unsharp masking.

A filter is called a spatial filter when it uses neighbor pixels in space. It's called a temporal filter if it uses neighbor pixels in time. Some filters use neither spatial nor temporal information. They input a pixel and output a pixel without needing anything else. Those filters are called point operations. They take a pixel and map its value to another value. For example, adding a constant value to all pixels increases brightness. The mean intensity of all pixels tells us about the brightness of an image, whereas their standard deviation tells us about the contrast. So you can adjust the brightness and contrast of an image by adjusting these alpha and beta parameters, which control the mean and standard deviation of pixel intensities.

You can have more control over the mapping functions using the curves adjustment tool, which is available in many image editing applications. These curves basically map the values in the x-axis to the values in the y-axis. If it's a straight diagonal line, it means x equals y, so no changes are made. If you clip it here, for example, all these values will be mapped to the same black value. If you want to add some color tint, you can add some constant value to that color channel.

Saturation and vibrance filters are also point-operations, but they operate in a color space different from the standard RGB color space. Instead of representing colors with their red, green, and blue components, sometimes it's more useful to represent them with their hue, saturation, and brightness. In this color space, we can increase the saturation by increasing the values in the saturation channel. Vibrance doesn't have a standard definition but what image editing tools usually call vibrance is a saturation-to-saturation curve that increases the saturation more for the low-saturation pixels than the ones that are already well-saturated.

Alright, now that we know the basics, let's move on to the second group of image filters, the ones that give you puppy ears, a new hairstyle, or facial hair. What those filters have in common is facial landmark detection. Facial landmark detection is an active area of research. Many modern detectors use neural networks. Neural networks are capable of approximating any function given inputs and outputs. In this case, the inputs are images and the outputs are the heatmaps or coordinates of the face landmarks. When trained on annotated face datasets, neural networks can identify predefined points on the face, such as the corners of the eyes, nose, mouth, and so on. When you connect those points together you get a 3d mesh approximation of your face.

Once you detect facial landmarks, the rest is up to your imagination. You can overlay objects on your face, warp the image to move the landmarks, or add animations triggered by some actions.

Alright, finally let's talk about neural filters. There's no clear definition of neural filters but we can consider any image-to-image transformation that uses a neural network as a neural filter. Given a dataset of pairs of images, neural networks can learn to colorize images, upscale images, turn photos into paintings, and so on. Some neural networks can also learn to disentangle properties of images in some embedding space. For example, when trained on faces, we can manipulate the age, gender, and facial expression of the subject by moving around in this embedding space.

I won't go more into detail on how those neural networks work in this video. If you are interested in learning more about neural networks, you can check out my earlier videos. I'll put the links to the relevant videos in the description.

Alright, that was it for this video. I hope you liked it. Subscribe for more videos and I'll see you next time.

Image Filters Explained

Video Transcript

Related Videos