VisionISP: an Image Processing Pipeline for Computer Vision Applications

September 4, 2019

When we think of cameras, we usually think of them being used by humans to take pictures. Although humans are generally considered to be the primary consumers of images captured by cameras, machines have also started consuming images on a very large scale.

Today, virtually all computer vision applications use cameras optimized for human viewers. The image signal processors in those cameras are usually tuned based on photography-driven image quality characteristics that are important to the human visual system. However, tuning an imaging pipeline for the optimal image quality does not guarantee optimal results for computer vision tasks. For instance, heavier noise reduction might result in better perceptual image quality, whereas it might be more beneficial for a machine vision system to tolerate a higher level of noise in exchange for more information. An image processed for optimal computer vision performance can look vastly different from an image processed for optimal perceptual image quality.

We propose a set of processing blocks, which we collectively call the VisionISP, to build an image processing pipeline optimized for machine consumption.

The blocks in VisionISP are simple, content-aware, and trainable.

The first block is a computer-vision-driven denoiser that tunes an existing ISP without modifying the underlying algorithms and hardware design. Our ISP tuning algorithm tunes the denoising parameters to minimize a high-level content loss in the denoised images. We calculate this loss on feature maps extracted from a target neural network that is pre-trained to perform a particular computer vision task. In this way, the denoiser learns to preserve what's important in the image for the target machine vision task.

The second block in our pipeline is a trainable local tone mapping operator, which reduces the bit-depth of its input. Reducing the bit depth translates into simpler hardware, reduced bandwidth, and significant savings in power. Unlike uniform bit-depth reduction, our method makes sure that the features that are essential for computer vision applications are preserved after bit-depth reduction. We do this by using a global non-linear transformation followed by a local detail boosting operator before bit-depth reduction. The non-linear transform acts as a trainable global tone mapping operator while the detail boosting operator acts locally to preserve the details in the low-bit-per-pixel output.

The last block in our pipeline is a very simple convolutional neural network that acts as a preprocessor for a subsequent computer vision task. The first layer in this block is a 1x1 convolution layer that can be thought of as a trainable color space converter that finds an optimal color space automatically. The following layer is a 7x7 convolution layer that extracts low-level features, such as edges and textures, while reducing the input resolution. This layer has a flexible stride parameter that allows for adjusting the downscaling rate, without retraining.

Many computer vision systems use downscaled images to be able to run in real-time. However, conventional downscaling methods, such as bilinear interpolation, are content-agnostic. Therefore, small details in a scene, such as pedestrians, can easily be discarded during such downscaling. This feature extraction layer processes full resolution data and helps downscale images without dropping the features that would be needed in the next stages.

The final layer in the block projects the output feature maps into three channels since computer vision systems typically expect 3-channel images as inputs.

Although those 3-channel inputs do not look natural to human viewers when visualized as pseudo-color images, they `look good' to machines in the sense that they provide a very efficient representation of what a camera should feed into a computer vision system to perform well.

For example, this demo video shows how object detection results look on frames processed by the VisionISP as compared to a basic image signal processor that is not optimized for computer vision.

The VisionISP significantly reduces data transmission needs between an image signal processor and a computer vision engine while preserving the information relevant for a computer vision system.

Take a look at our paper to see our experimental results and learn more about how VisionISP works: