Training an FCN for Object Detection

One of the many useful tasks that can be accomplished using deep learning is visual object detection. For example, a Deep Neural Network (DNN) can be trained to detect an object (such as a vehicle, pedestrian, bicycle, etc.). Similar to traditional computer vision systems, bounding boxes can be placed around the detected objects and the objects passed on for further processing. We have used DNN's for object detection in video analytics and multi-sensor processing applications such as event detection and object tracking. In this blog, I will describe how to get started with object detection and point you to some useful resources for learning.

Fig1. Bounding boxes output from an FCN trained to detect vehicles. Training process utilized the KITTI public dataset.

Fully-Convolutional Network (FCN)

NVIDIA has provided a quick way to get you up and running with object detection using DIGITS. The new version of DIGITS includes an example neural network model architecture called DetectNet. There is a very good blog post about this network called DetectNet: Deep Neural Network for Object Detection in DIGITS. The detection network architecture is based around a Fully-Convolutional Network (FCN) and is implemented in the Caffe framework. For training, there are three important processes:

  1. Data layers ingest the training images and labels and a transform layer applies data augmentation. Note - Augmentation is important to the training of a network in order for it to generalize well to new data.
  2. An FCN performs the feature extraction and object classification, and then determines bounding boxes.
  3. Loss functions measure the error in the tasks of predicting object coverage (see DetectNet link for a detailed description) and bounding box corners per grid square.

For training, input data can consist of images or video frame images that contain multiple objects. For each object in the image the training label includes the class of the object and the coordinates of the corners of its bounding box. For each image, there needs to be an associated text file that includes a fixed 3-dimensional label format that enables the network to ingest images of any size with a variable number of objects present. The NVIDIA DetectNet implementation uses the KITTI data format. This format along with the KITTI dataset can be downloaded here. This dataset is useful for training your first FCN.

The process used by the detection network to ingest labeled training images can be understood by visualized by considering a rectangular grid overlaid on top of the input image. The grid box spacing should be slightly smaller than the smallest object (i.e., if the car you want to detect is approximately 50x50 pixels in size, use a grid box size of approximately 40x40 pixels in size). This is a key concept when designing your detection network. You will need to adjust the appropriate parameters to best match the range of object sizes you want to detect. Of course, the input image resolution will also play a role in this decision. You can experiment with downsampling and up-sampling original images to determine the best detection performance for a specific target size. You can imagine the detection network providing two pieces of information: the class of object in each grid square and the pixel coordinates of the corners of the bounding box of that object relative to the center of the grid square. When no object is present in the grid square a "dontcare" class is used. A coverage value of 0 or 1 is provided to indicate if an object is present within the grid square, and in the case where multiple objects are present in the same grid square the object that occupies the largest number of pixels within the grid square is selected. In the case of multiple objects occupying the same number of pixels (e.g., two cars), the object with a bounding box having the lowest y-value is selected. This is a good choice for car based cameras or low elevation angle ground based cameras (i.e., a lower y-value can be associated with an object that is closer to the camera), but may offer less advantage at higher angles (i.e., surveillance cameras). For each grid square, the detection network predicts whether an object is present and where the bounding box corners for that object are located relative to the center of the grid square. In order to get the network to perform optimally for your given application, it is important to experiment with several parameters related to the overlay grid size, sensitivity, and bounding box determination. Also, keep in mind that the variation in your bounding boxes may have a dependency on you input image resolution. You may need to dig into the source code to get a full understanding of how to make those adjustments.

For validation, the detection network utilizes two more processes:

  1. A clustering algorithm computes the final set of predicted bounding box coordinates.
  2. A simple mean Average Precision (mAP) metric is computed to determine the performance.

To improve performance, several parameters can be adjusted during training. As mentioned above, the grid spacing and the stride in pixels are used to determine the object size sensitivity. Input data augmentation parameters can also be adjusted. Augmentation is essential to successfully train for high sensitivity and accurate object detection. Augmentation is a process that applies random transformations, like pixel shifts and flips, to the training images each time they are input into the training process. The benefit of using augmentation is that the network never "sees" the same training image twice, so it is more resilient to overfitting.

Pre-training and Fine-Tuning

Training your own FCN involves some patience and effort. However, it is well worth it to gain the insight an intuition needed for more complex models. It is our experience that training an FCN is easier when only trying to detect a single object class (e.g., vehicles). Although an FCN can be trained for multiple object classes, it can often be best to just train multiple models and run them in parallel on your desired target dataset.

For this example we will use the NVIDIA DIGITS platform since it has an FCN model called DetectNet built in to the latest versions. You should not feel restricted to this model, and it can often be useful to make modifications to better suit your intended application. We have developed several modified version of this general network for both experimentation and application. On specific projects we have started with this DetectNet and made changes to the layers in order to customize performance for specific applications.

Fig2. Visualization of KITTI dataset. Dataset consists of car-based camera images.

Something that will help reduce some of the frustration with training your FCN is to filter your KITTI dataset into a new dataset that only includes images of objects you are interested in detecting. For example, you can write a simple python script like the following to filter out images and labels for a specific category:

import re  
import os  
import shutil  
import sys

filename = sys.argv[1]  
outDir = sys.argv[2]

with open(filename) as inFile:  
  lines =

  for line in lines:
    basefile = os.path.basename(line)

    filename, ext = os.path.splitext(basefile)
    image_file = "./images/" + filename + ".png"
    txt_file = "./labels/" + basefile

    lbl_dir = outDir + "/labels/"
    image_dir = outDir + "/images/"

    shutil.copy2(txt_file, lbl_dir)
    shutil.copy2(image_file, image_dir)

Training and Validation

The FCN we used in Fig3 was initialized using a pre-trained model from our initial KITTI baseline training. This is a common trick and it improves final model accuracy and reduces training time. It is worthwhile to note that an FCN is a Convolutional Neural Network (CNN) with no fully-connected layers. You can think of it as a CNN applied in a strided sliding window fashion across the image. This allows varying input image sizes with an output that is a real-valued multi-dimensional array with classification labels for each grid in the array. The output array can be overlaid directly on the input image. The detection network uses a linear combination of two loss functions to measure optimization. The first is the coverage loss, which is the sum of squares of difference between the true and predicted object coverage for all grid squares in the training data image. The second is the bounding box loss, which is the mean absolute difference of the true and predicted bounding box corners for the object covered by each grid square. During training, the weighted sum of these two loss values is minimized.

Fig3. FCN training using filtered KITTI dataset. The "car" category was filtered from the larger dataset.

At the final stages of the detection network, a clustering algorithm is used to filter the multiple bounding boxes generated for grid squares with predicted coverage values that are greater than a set threshold. It is worth looking at the source code for the clustering, but keep in mind that it is implemented in OpenCV and clusters bounding boxes using a rectangle equivalence criterion that combines rectangles with similar sizes and locations. Clusters with less than a set threshold of rectangles are rejected prior to a bounding box being generated. For some applications you may want to change this or use a different methodology. A score for the final bounding boxes output is generated using a simplified mean Average Precision (mAP) calculation. There is some debate about how to properly compute this, but we suggest you strive for consistency. For each predicted bounding box and ground truth bounding box the Intersection over Union (IoU) score is computed. IoU is the ratio of the overlapping areas of two bounding boxes to the sum of their areas. According to the NVIDIA documentation, using a IoU threshold, predicted bounding boxes are designated as either true positive or false positive with respect to the ground truth bounding boxes. If a ground truth bounding box cannot be paired with a predicted bounding box such that the IoU exceeds the threshold, then that bounding box is a false negative (i.e., represents an undetected object). In DIGITS, the simplified mAP score output is the product of the precision (ratio of true positives to true positives plus false positives) and recall (ratio of true positives to true positives plus true negatives). See Figure 3. The mAP is a metric for how sensitive the detection network is to objects of interest and how precise the bounding box estimates are.

Fig4. FCN trained to detect pedestrians only.

We have use FCNs like the DetectNet to provide measurements in object tracking applications from video. In these applications, it takes some patience to train the initial network using the filtered KITTI dataset. Understanding when to stop the training, save the weights and initialize a new training session using a custom dataset. We have created many tools to enable the efficient generation of custom datasets from customer provided data or data we collect ourselves. Although training and seeing the results from the FCN is a lot of fun, the bulk of the work is often in creating, formatting, and filtering custom datasets.

Fig5. FCN detector used for targeted vehicle detection and tracking from video.


In addition to the benefits already mentioned, using an FCN is more efficient than using a CNN as a sliding window detector since it does not do any redundant calculations due to overlapping windows. Using dual Titan X GPUs, we have trained detection networks for vehicle detection on images ranging from 384x1248 to 1536x1024 pixels. Although training can take several hours, the deployed network can process frames in real-time or near real-time on a gaming laptop with GPU. Contact us a KickView is you are interested learning more about our advanced video and multi-sensor analytics capabilities.