/ Deep Learning

Choosing a Convolutional Neural Network Architecture for Real-Time Object Tracking (Part 2)

This is part 2 of 3 in a series about selecting appropriate network architectures for real-time object tracking. In part 1 we compared the inference speed of various existing object detection networks. Now we turn our attention to accuracy metrics. We will examine some commonly used metrics such as mean Average Precision (mAP) and Intersection over Union (IoU), as well as the more widely used metrics precision and recall. We will try to gain some insight into value provided by each metric and which are the most important for the specific application of real-time object tracking. Finally we will discuss the results of a series of tests to compare the performance of several object detectors.

The Difficulties of Comparing Object Detectors

Before launching into a detailed description of various performance metrics, let’s first discuss why there are so many ways to compare different object detectors and why it is often difficult to do so. First of all, consider the term accuracy. In school you were required to take exams where each answer was either correct or incorrect. In this case you were graded in terms of accuracy. You answered 92% of the questions correctly, so you get an 'A'. It didn’t matter what your incorrect answers were. However in detection tasks it is often valuable to consider the breakdown of the instances that were classified incorrectly.

To understand why, consider the classic case of medical testing. If you get tested for a deadly but treatable disease, it is probably more important to avoid a false negative (in which case you may not receive life-saving treatment), than it is to avoid a false positive (in which case you may receive unnecessary treatment, or be tested again until a better diagnosis is obtained). In general any binary classifier has a tradeoff between its ability to avoid false positives and its ability to avoid false negatives. Although ideally we would like a classifier to perfectly avoid both types of error, this is generally not obtainable, so we must make decisions about which one is more important to the situation at hand. We will discuss this tradeoff further in the next section.

There are two aspects of object detection that further complicate comparison of different detectors. One, object detection is often a multi-class problem. It is important to consider the increasing difficulty that comes from adding additional classes of objects to be detected. A specific detector is trained to detect only certain object types, and it is difficult to properly compare it to a detector that has been trained to detect a different set of objects.

The other complicating factor specific to object detection (but not to image classification) is the lack of a well-defined set of true negatives. In image classification, each image either does or does not belong to a specific class. Conversely, in object detection, we have to search each image in an almost infinite combination of locations for an object that may or may not be in the image at all. As was discussed in part 1, different networks handle this hurdle in different ways. The overwhelming (and potentially uncountable) abundance of true negatives forces us to carefully consider how to define ‘accuracy’.

Precision and Recall

Because precision and recall are fundamental concepts in the field of pattern recognition, we won’t go into great detail defining them. There are several great resources on this topic - [3] is a good one specific to image classification. The figure below is a good reference for a visual intuition of the concepts. Let’s consider how these terms apply to object tracking.

Figure 1: Graphical representation of precision and recall in the task of binary classification.

In the case of real-time tracking, low recall means that in some frames important objects are not detected at all. Low precision means that the tracker detects a lot of phantom objects that aren’t actually there. Although we would like to have both high recall and high precision, there is always a tradeoff. Every object detector has its own precision-recall curve (or similarly a receiver operating characteristic curve (ROC)). For a given detector, the detection threshold can be varied to choose any point on this curve. The goal is to find the point on the the curve that is optimal for the task at hand.

Figure 2: Example of precision-recall curves for two different detectors. This allows for direct comparison at different thresholds.

In the figure above, detectors A and B are compared. Over most of the range detector B outperforms detector A, but there are some threshold levels at which A is superior. So which detector is better? Well, it depends on the application. Let us consider two possible tracking scenarios.

Scenario 1: Self-Driving Car

A self-driving car needs to quickly detect and track the objects around it in order to avoid collisions and to take evasive actions in emergency situations. The ability of the car to quickly and accurately identify a pedestrian who has stepped out into the road is fundamental to the acceptability of the car on public roads. In this case, nearly perfect recall is required. If the car thinks that a person may have stepped out in front of it, the car must act now and ask questions later. The detriment of a false positive (in which the car hits the brakes for no reason) is nowhere near as severe as the detriment of a false negative (in which the car fails to detect the person until it is too late to start braking). As with the case testing for a deadly disease, high recall will be important in cases where quick action is vital to public safety.

Scenario 2: Automated Campus Security System

An automated security system installed on a university campus allows a small number of security professionals to monitor and safeguard a large area by helping to point out possible security threats. The most common threats are likely to be people on foot. In this case an intruding individual would likely be moving slowly enough to appear on many camera frames before causing too much trouble. This means that the detector would have a lot of tries to get the detection right before harm is done. In this case, low precision (lots of false alarms) would be more detrimental because the limited security staff might end up spending all of their time sifting through footage or running around campus looking for intruders that aren’t actually there.

So, the optimal tradeoff between precision and recall depends on the application. Therefore, we need is a good way to compare different detectors over the range of possible detection thresholds. One way to do this is by comparing p-r curves, as shown in figure 2. A more concise way is to compare mean average precision(mAP).

Average Precision and Mean Average Precision (mAP)

Average precision is a single number that characterizes a detector’s overall p-r performance. Essentially, average precision is the area under the p-r curve (AUC). There are a few different ways that it can be calculated, and care should be taken to assure that it was calculated in a consistent manner when comparing performance [3]. The area under the p-r curve gives a description of how well the detector balances precision and recall.

Mean average precision is simply the mean of the average precisions of each class of object that the detector is looking for. A detector with a high mAP should be able to achieve relatively high precision and recall at the same time. Because of its general applicability, mAP is one of the best ways to get an overall sense of how ‘good’ a certain detector is on a certain dataset without needing to talk about the specific precision and recall requirements of the situation.

Being the primary metric used in both the PASCAL VOC [1] and ImageNet LSVRC [2] object detection challenges, mAP is the most widely used metric in the literature on object detection. Its general applicability makes it perfect for academic competitions. However, while it is probably the best single metric of a detector’s performance on a given dataset, its use may not give sufficient information for all use-cases.

Intersection Over Union (IoU)

The other metric commonly found in object detection literature is intersection over union (IoU). IoU allows us to see how well the actual bounding box predictions spatially overlap with the ground-truth bounding boxes. IoU is therefore, simply the sum of all the pixels that are shared between the detected bounding boxes and their corresponding ground-truth labels, divided by the sum of all the pixels that make up the unions of the same boxes.

Like mAP, this metric can be calculated over the entire dataset. However, it is more often applied to a single detection at a time. In fact, an implicit part of the mAP metric is that a true positive is defined as a detection and ground-truth bounding box pair with an IoU above some arbitrary threshold (often 0.5). This threshold is another hyperparameter that can be varied when tuning a detector to a specific application.

Figure 3: Visual representation of three different IoU values. The red boxes represent object detections, and green boxes represent ground-truth labels.

Setting up a Framework for Comparison

At KickView, we often use object detectors such as those described in part 1 as integral parts of much larger systems. In part 3 of this blog post seeries, we will discuss how a detector can be used as part of a tracking system. In order to make decisions about which detectors to use, and to help tune those detectors, KickView engineers developed a standardized test framework. This framework contains two parts: a standard test dataset, and a set of Python modules that allows flexible and rapid calculation of performance metrics on any detector and over any set of hyperparameters.

Figure 4: Sample image from KickView dataset.

The figure above shows a sample image from the KickView dataset. The dataset contains vehicles of various types and pedestrians. The perspective and resolution were chosen to represent conditions that might arise in a real-world security camera situation. The dataset contains objects that are very small (i.e. the pedestrian on the sidewalk). Very small objects have traditionally been difficult to detect, and are often ignored in object detection competitions. However they are likely to arise in many real-world situations, and therefore should be considered.

The KickView dataset also has multiple lanes of vehicles that pass each other, as well as an occlusion (the pine trees). These aspects allow us to test important cases that arise in tracking problems. In general, the detectors that we test have been trained on larger datasets. The KickView dataset simply provides a convenient way to validate models and compare metrics. In this case, we have decided to ignore the vehicles that are parked in the rows of parking spots, so these objects are filtered out before comparison to the ground-truth labels.

Figure 5: Real-time tracking of vehicles and RF signals using KickView's Intelligent Multi-Sensor Analytics platform.

Comparison of Four Object Detection Architectures

Now for the fun part. In part 1 we tested the speed of the following architectures:

  • SSD
  • YOLO
  • Mask R-CNN
  • DetectNet

It makes sense to compare the same detectors for IoU and mAP against the labeled KickView dataset. Table 1 below shows the results from our tests. Both mAP and IoU values range from 0 to 1, with 1 being a perfect score. The testing was run on KickView's NVIDIA DGX Station.

Table 1: Results of Metrics Run on KickView Dataset

Name Meta Architecture Feature Extractor Frames/Sec mAP IoU
SSD SSD VGGNet 69.4 15.2 14.6
YOLO SSD AlexNet 28.7 12.7 11.8
Mask R-CNN Faster R-CNN ResNet101 4.2 43.7 39.9
DetectNet SSD InceptionNet 31.3 35.4 26.3

As might be expected, there is a tradeoff between speed and accuracy [4]. SSD and YOLO are both fast detectors that can potentially keep up with incoming camera frames in real time, but they both show poor performance. When looking even more closely are the results, it was clear that both of these detectors perform particularly poorly on the smaller objects, of which there are many in this dataset.

Conversely, the Mask R-CNN detector showed the highest performance, but at a speed of less than 5 frames per second, it is unable to keep up with the rate of incoming images from many real camera systems. However as we will discuss in part 3, it may still be useful as a part of a real-time tracking system. The DetectNet model has a mid-level performance and is fast enough to run online in real-time, and may therefore be the sweet-spot for certain tracking problems.

Proper selection of an object detector requires careful consideration of the requirements of the situation. In real-world situations it is important to consider the balance of multiple metrics. In part 3 of this blog series, we will consider how an object detector with certain speed and accuracy traits can function as a part of a larger tracking system.


[1] Everingham, Mark, et al. The pascal visual object classes (voc) challenge. International journal of computer vision 88.2 (2010): 303-338.

[2] Russakovsky, Olga, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115.3 (2015): 211-252.

[3] Sancho McCann. It’s a bird… it’s a plane… it… depends on your classifier’s threshold. Blog post. 2011. https://sanchom.wordpress.com/2011/09/01/precision-recall/

[4] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.