Choosing a Convolutional Neural Network Architecture for Real-Time Object Tracking (Part 1)

Choosing a Convolutional Neural Network Architecture for Real-Time Object Tracking (Part 1)

In a previous blog post we talked about how to train a convolutional neural network(CNN) for object detection in images. Object detection combines classification and localization. One use for object detection is in the problem of object tracking in video data. Advances in deep learning have opened doors to new frontiers in the ability to track objects over time. Applications for automated tracking systems include security networks, autonomous vehicles, and defense systems among many others. There are, however, some significant challenges in the implementation of real-time object tracking, the constraints of which can guide us to the selection of effective CNN architectures.

The Convolution Revolution

Ever since AlexNet used a convolutional neural network approach to win the classification challenge of the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012, there has been an explosion of interest in CNNs and deep learning in general. Iterative advances on AlexNet such as VGGNet, InceptionNet, and ResNet have become so good at image classification that it is now considered by many to be a solved problem in the field of computer vision. Many researchers have turned their attention to the much more difficult problems of object detection and similar tasks such as semantic segmentation and instance segmentation.

The application of CNNs to object detection began shortly after the introduction of AlexNet. There are now a multitude of object detection strategies that utilize deep learning, and state-of-the-art performance on standard datasets has become quite impressive. While object detection by itself may be useful for certain real-world applications, it is still mostly an academic curiosity without additional machinery to put it to work in the real world. At KickView we fuse many of these cutting-edge detection technologies with additional hardware and software solutions to perform real-time object tracking.

Real-Time Tracking Presents Unique Challenges

Object tracking requires an iterative and intelligent combination of detecting objects and following them. Once objects have been detected, the software must keep track of which object is which and when objects enter or leave the area of monitoring. This often involves predicting where an object is likely to be found in each successive frame which can be performed using Kalman filters, or similar statistical methods.

In this post we will only discuss the challenges of performing the detections. Clearly the performance of a tracking system is dependent on the ability to produce fast and accurate detections on incoming video frames. In particular, there are three significant challenges that we face when applying an object detector to the task of object tracking.

1. Tracking requires high recall and high precision.

When discussing machine learning algorithms we sometimes talk about performance in terms of accuracy, but for real world applications it is often more useful to consider an algorithm’s precision and recall. In this case, precision is the proportion of detections that correspond to real objects. Recall is the ratio of objects detected to total number of objects that were detectable in the data. To understand how these apply to object tracking, we can look at the example case of an automated security system. In such a system, low precision in the detection algorithm would lead to a lot of false alarms. Conversely, low recall would mean that some real security threats are never detected at all. Clearly we need both precision and recall to be high for an automated tracker.

In the literature, object detection algorithms are normally compared on the metrics of mean Average Precision (mAP) and Intersection over Union (IoU). We will not go into the details of these metrics in this post, but it is important to realize that a high score on one of these metrics does not guarantee that the algorithm will perform well in terms of both recall and precision on your data. We plan to compare accuracy metrics of various networks in Part 2 of this blog post. However, before accuracy can even be considered, the next two challenges must be addressed.

2. Tracking must run in real-time.

This one is pretty self-explanatory. In the case of the autonomous vehicle, clearly the self-driving car needs to be able to track other cars in real-time in order to avoid collisions. To do this, the detection part of the tracking program needs to be able to keep up with the video frames as they come in. Frame rates vary, but any useful network must be able to run a forward pass of a single image at a rate of at least several per second. The exact speed requirements will depend on the application. Later in this post we will compare the results of runtime tests we performed on several different networks.

3. It may be difficult or impossible to obtain a large amount of labeled data for your tracking application.

Labeling images for classification is a labor-intensive process. Labeling images for object detection is another thing entirely. This requires manually drawing and recording bounding boxes for every object that should be detected in each image. In many cases it may not be possible to obtain the thousands or millions of labeled images required to effectively train a deep CNN.

Fortunately there are techniques available to mitigate this problem. As with any deep learning problem, we can employ a variety of data augmentation techniques such as flipping images and introducing gaussian noise. Also we can employ ensemble methods to boost accuracy. Most importantly we can use transfer learning to take advantage of the ability to pre-train a network on a large dataset and then fine-tune it on a smaller dataset.

Recent work has shown that it is possible to convert a CNN trained for image classification into a Fully Convolutional Network (FCN)[1]. This is done by converting the fully-connected layers into convolutional layers. When doing so, we can keep the weights learned during training for classification. Then the modified network can be fine-tuned for use in an object detection architecture. The details of creating an application for FCNs are beyond the scope of this post. However, there are two important benefits of using FCNs as part of an object detector network: one, FCNs can be employed on images that vary in shape and size, and two, FCNs can be pretrained on huge classification datasets and fine-tuned with a much smaller amount of labeled object detection data.

Figure 1: An example of a Fully Convolutional Network (from [1])

Comparison of Three Object Detection Architectures

Note: This section draws heavily on [2], a paper by Google Research comparing object detection networks.

Now that we have discussed the challenges of applying an object detection network to tracking tasks, we will want to ask: what options are available for state-of-the-art object detection? As described in [2], there are essentially three main types (“meta-architectures”) of object detection networks, each with it’s own advantages and disadvantages. These are variants of:

  • Faster R-CNN
  • R-FCN
  • SSD (Single Shot Detector)

First let us discuss what features they all have in common. Each of these use a fully convolutional version of an image classification network as a feature extractor. The feature extractor networks are pretrained on a large image classification dataset such as ImageNet. In the literature that introduced the various object detectors, a variety of networks have been used as feature extractors.

The purpose of the feature extractor is to efficiently reduce the dimensionality of the images, while preserving important visual and spatial information. This is a difficult concept to explain without a digression into linear algebra, but the important point here is that any good classification network can be converted to an FCN and used (with varying degrees of performance) as the feature extractor in any of these three object detection meta-architectures.

Figure 2: High level architectures of the three types of object detectors

The are many implementation details that vary between these three types of detectors. One important difference is how the networks handle region proposal. Faster R-CNN and R-FCN both have an integrated FCN that is trained to propose regions of interest (RoIs). These regions are then passed along to the last layers of the net for classification and bounding box regression. R-FCN is so-named because it maintains its fully convolutional structure right down to the final classification layer, whereas Faster R-CNN does not. This gives it a slight speed advantage over Faster R-CNN, but otherwise they are quite similar.

SSD detectors (including YOLO) remove the need for a region proposal network. This is done by starting with a grid of predetermined regions that are each classified with bounding box regressions. Inference takes place in a single forward pass, thus the name ‘Single Shot’. SSD detectors are much faster than the other types and achieve moderately good accuracy scores on certain types of data. Due to the built in spacial limitations of the initial proposal grid, SSD detectors struggle with picking out small objects within the test images.

In general [2] found that there is a tradeoff between speed and accuracy that comes from the overhead of generating and testing of RoIs, and that no one network can be considered best for all use cases.

Time Trials on DGX

As we excitedly announced in a previous post, KickView recently received our own NVIDIA DGX Station, which as of the time of this writing is pretty much the fastest deep learning platform that you can get your hands on. So naturally, we were curious to test some of these networks on our own beast. Below is a summary of some selected results of the time trials.

Table 1: Results of Time Trials on NVIDIA DGX Station

Name Meta Architecture Library Feature Extractor Pre-Training Data Frames/Sec
SSD SSD TensorFlow VGGNet Pascal VOC 69.4
YOLO SSD TensorFlow AlexNet Pascal VOC 28.7
Mask R-CNN Faster R-CNN Tensorflow ResNet101 COCO 4.2
DetectNet SSD Caffe InceptionNet KITTI 31.3

These numbers are more or less in-line with the findings in [2]. Clearly these few networks represent only a tiny fraction of all the possible combinations of architectures, libraries, and feature extractors. But we can see that the SSD detectors are much faster than Mask R-CNN (which in addition to detection also performs segmentation, as seen below in Figure 3).

Figure 3: The output on one of our test images using Mask R-CNN

Note that these test times represent inference on images of batch size equal to 1 (one image at a time) on only one of the DGX’s four GPUs. If given the opportunity to process multiple images in parallel, the DGX can achieve a much higher frame rate. But in order to perform true real-time tracking, the algorithm must be able to process video frames as they come in.

As of now it seems that only the Single Shot Detectors are fast enough to perform true real-time tracking. The region-proposal based detectors may still be of use if tuned properly. However, at only a few frames per second on a very fast computer, it is unlikely that these nets will be of much use in the low power embedded systems that are becoming more prevalent as the edge-computing paradigm catches on.

Now that we have taken a look at the speed considerations, in Part 2 of the blog we will discuss accuracy metrics and compare the performance of various types of detectors as we continue the share our explorations into real-time object tracking.


[1] Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014.

[2] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.