Transcript

What's up, everybody! Today we're talking about how digital cameras convert raw images, like this one, into natural-looking pictures, like this one. It's natural to think that what we see on a display is what the camera actually sees right at the sensor. That's not really the case. You wouldn't want to look at a picture that comes directly from the sensor. Several steps of processing need to be done before it looks natural to us.

First, the sensor doesn't really see the colors. Before the photons hit the sensor and create a current, they go through a filter, a physical color filter, called a color filter array. So, in that sense, there is no such a thing called #nofilter if you are using a digital camera. All those pictures tagged as #nofilter on Instagram? That's a big lie.

Typically, this filter is a Bayer filter which has a checkerboard pattern that consists of red, green, blue, and green tiles. I counted the green twice because there are twice as many green tiles as the other ones. This was loosely motivated by the distribution of the cone cells in the human eye, as our eyes are more sensitive to green.

When the light goes through this filter, each pixel behind these tiles would see only one color, either red, green, or blue. When we separate the colors out, we end up with separate color channels with holes. These holes are filled by interpolating between the existing pixels. You can think of it as computing the average of the neighbor pixels to fill in the missing ones. This process is called demosaicing. Modern demosaicing algorithms are more sophisticated than simple pixel averaging though. For example, they employ smarter interpolation methods to prevent false color and zipper artifacts.

After demosaicing, we get a color image, but the colors don't look like how we see them when we take a picture. The picture might not look beautiful right after demosaicing, but in fact, this is how the world actually looks like to a neutral observer. Different types of lighting, such as natural and artificial light sources, have a pronounced impact on the color.

Our vision system has the ability to adjust to changes in illumination in order to preserve the appearance of the colors. For example, a white shirt looks white to us, whether we see it indoors or outdoors. This is what white-balancing algorithms aim to achieve. They compensate for the color differences based on lighting by applying a gain factor to the color channels so that a white shirt appears white regardless of the source of light.

A very simple way to do that is to normalize every color channel by their average so that the mean of all colors in a scene is a perfect gray. This method is called the gray-world algorithm. The gray-world algorithm is simple but far from perfect. It fails in many cases where the gray-world assumption doesn't hold. Modern cameras use more advanced algorithms to handle these cases. These algorithms usually do a decent job, although failures are sometimes inevitable.

In fact, even the human visual system fails to balance colors in some cases. Ambiguities in the perceived lighting might lead to a disagreement between different observers about the color of a certain object, such as this dress. The color of this dress is perceived differently depending on whether your brain assumes that it's under yellow or blue lighting.

Now we have a color-balanced image, but there's still something about the colors that don't seem right. The colors seem a little washed and they don't look very accurate. That's because different sensors produce different results for the same color. These sensor-dependent color values are corrected by using a color correction matrix. A color correction matrix maps the colors from a camera's color space into a standard color space. This matrix is computed by taking pictures of color charts and solving for a 3x3 matrix that minimizes the error between the actual and produced colors.

Camera sensors work on a linear space, meaning that the difference between no light and one light bulb is perceived as more or less the same as the difference between 49 and 50 light bulbs. That's obviously not the case for human vision. Switching the first bulb clearly makes a bigger difference for us than switching the 50th one. The relation between the actual change in a physical stimulus and the perceived change is logarithmic.

This phenomenon applies to many types of senses. For example, you can easily notice the weight difference between a feather and a book, but you wouldn't be able to tell the difference whether there's a feather or a book in your backpack if it already had a lot of stuff in it. This is known as Weber's law in the field of psychophysics.

Cameras take advantage of this phenomenon to optimize the usage of bits when encoding an image. More bits to encode the dark pixels and fewer bits to encode the bright ones. This is simply done by taking roughly the square root of the intensity values. This process is called gamma compression.

Raw images are usually noisy, especially in low light, where amplifying the signal also amplifies the noise. There are many advanced denoising algorithms that aim to reduce noise while preserving the content. Some use traditional methods, and some use deep learning. What they all have in common is that they do some sort of pixel averaging.

One strategy is, for example, to capture a burst of frames and average the pixels across the frames in the near-identical parts. When these near-identical parts have a different noise pattern, the noise gets averaged out, resulting in a cleaner image. This approach is called temporal denoising since it averages the pixels through time.

Another approach is to average the pixels within a single frame according to some criteria, such as averaging similar neighboring pixels. This type of denoising is called spatial, or single-frame, noise reduction. Spatial denoising algorithms can average pixels that are spatially close to each other, similar to each other, or aligned in image patches that are similar to each other. Many modern camera systems, particularly smartphone cameras, use a combination of both spatial and temporal noise reduction.

This chain of processes is generally referred to as the camera pipeline. Each camera does it a little bit differently, but this is how the overall pipeline usually looks like. The order of these processing blocks might be different. For example, denoising can be done earlier. There can be additional blocks as well. The ones I showed here are the most basic processing blocks that many cameras use. Other blocks might do things like sharpening, contrast enhancement, local tone mapping, compensating for lens shading, capturing multiple images and combining them to improve the dynamic range or the low-light performance, and so on.

There's still a lot of room for improvement in the camera pipeline. In fact, this entire process might change in the not-so-distant future. There's been a lot of research at the intersection of computational photography and machine learning. You can find some of these work in the description below. Using camera systems that can learn, it would be possible to effectively unify these processing steps by jointly optimizing them, leading to an overall better pipeline. A better image processing pipeline would particularly help improve the image quality that we get from cameras that have stricter physical constraints, such as smartphone cameras. I mean, they already do an amazing job despite having very small sensors. Who would have thought ten years ago that this would be possible? We will see in the future how much more we can push the boundaries of what's possible with digital cameras.

That's all for today. I hope you liked it. Give a thumbs up if you liked this video. Subscribe for more videos like this. If you have any comments, questions, or recommendations for my next videos, let me know in the comments below. As always, thanks for watching, stay tuned, and see you next time.