At KickView, almost all our multi-sensor processing applications utilize machine learning (and deep learning in particular) to make sense of data. In previous blog posts, David showed how to train and apply Fully Convolutional Networks (FCNs) to detect cars and pedestrians, Convolutional Neural Networks (CNNs) to detect signals and classify modulation type, and more. There is little doubt among academics and industry leaders that deep learning has set the standard for the current state-of-the-art, particularly for any sort of labeling or object recognition task.
At KickView, we like the idea of using an ensemble of different techniques to solve a problem. The real world is messy - there is very rarely a case where a single algorithm or method can solve your entire problem in one shot. More often than not, it's a combination of algorithms working in-tandem that yields a solution that is robust to dynamic conditions that you'd expect to see in the field.
In this blog post, we'll talk through the math and implementation behind a classical computer vision technique called background subtraction, which can be used to detect moving objects in videos. We'll show how to actually generate an estimate of the background, and then use it to perform detection on moving objects.
Problem Motivation and Theoretical Formulation
Imagine a stationary camera overlooking a traffic intersection, collecting color video (RGB) over time. Let's say we're interested in detecting moving objects in the scene over time. Since we're dealing with a stationary camera, background subtraction is a great tool to help us solve this problem.
The idea behind background subtraction (also commonly referred to as foreground detection) is to separate the image's foreground from the background. If we have a good idea of what the foreground is, we can extract these segments from the image and perform any follow-on processing that we choose.
How would we go about separating the background from the foreground? We can tackle this problem statistically: by creating a histogram of the RGB values at each pixel over a specified number of frames, we can fit the histogram to a probability distribution. A common choice is to use a Normal distribution, and we'll make that same assumption here. We'll take this assumption one step further, and assume further that the Normal distribution is unimodal at each pixel. This assumption turns out to be valid for non-complex regions in RGB space (such as roads), which works out well for this example because we'll be using this technique to detect cars on roads.
Once we have an estimate for the mean at each pixel, we can use the mean at pixel to compute an average image which represents the statistical background. By keeping track of the mean and variance in this manner, we can then determine foreground pixels by seeing how far away the intensity value is from the distribution. There are any number of formulations to compute this distance, and for this example, we use a Mahalanobis distance scaled across all channels. Our formulation for how we approach background subtraction is shown above in Figure 1.
Let's see how this technique works on some real video. We collected some video of a road outside our parking lot from our office in DTC. We will use this video to demonstrate how this technique works below.
Computing the Background
We extracted image frames from this video, and then ran the background subtraction algorithm with a history of 25 frames. This means that we add values into the RGB histogram for 25 frames before computing a mean image. For every frame that comes after, we add and remove a value from the histogram so that our background changes as a function of varying conditions in the video. Figures 2 and 3 below show the original collected video, as well as the video generated from background subtraction.
Pretty cool! We see a neat ghosting effect of a red truck that pulls into the parking lot, drives around for a bit, and pulls out. This ghosting effect is purely a function of the number of frames we are using to compute the histogram. If we had selected 100 frames instead of 25, this effect would be much less pronounced, perhaps not visible at all.
Detecting Moving Objects
Now that we have a good sense of how the background subtractor works on real data, let's take the next step and use it to detect moving objects in the frame. Since we've already got an estimate of the background, we can take one of our original frames, and for each pixel in this frame, compute the Mahalanobis distance across each channel. If this distance crosses a specified threshold, we can declare the pixel a foreground pixel.
Figure 4 above shows some processing steps we've taken to obtain bounding boxes on the moving cars for a single frame. First, we start with the original image shown in (1). In (2), we set a Mahalanobis distance threshold of 3 (essentially measuring whether a pixel falls 3 standard deviations past the distribution mean). In (3), we clean up the raw thresholded image a little bit by applying a median filter to remove speckle and dilating the pixels to obtain blobs. In (4), we cluster these blobs together and draw bounding boxes around the blobs.
The detection performance isn't bad! One nice thing about this approach is that we only form bounding boxes on the moving cars. In contrast, something like an FCN or CNN would put a bounding box around the parked cars as well as the moving cars.
Classical computer vision techniques like background subtraction can be great additions to have in your back pocket to use when appropriate. While there's no doubt that modern deep learning techniques outperform classical techniques generally, it's all about using the right tool for the job. The real magic happens when you figure out the right set of tools to use, and critically, how to combine them intelligently.
We'd love to hear from you! Have you had success applying classical computer vision techniques to your problems? Have you found it a good augment to deep learning?