Localizing Objects Fast and Accurately Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 20, 2019
  • Liu, Wei
    • Affiliation: College of Arts and Sciences, Department of Computer Science
  • A fundamental problem in computer vision is knowing what is in the image and where it is. We develop models to localize objects of multiple categories, such as person and car, fast and accurately. In particular, we focus on designing deep convolutional neural networks (CNNs) for object detection and semantic segmentation. A central theme of this dissertation is to explore the design choices of network structure to combine the full power of CNNs and the characteristics of each task to not only achieve high-quality results but also keep the model relatively simple and fast. At the heart of object detection is the question of how to search efficiently through a continuous 2D bounding boxes space of various scales and aspect ratios at every possible location in an image. A brute force approach would be searching over all possibilities, but it is apparently not scalable and is quite difficult. An alternative is to propose some potential locations which might contain objects, and then classify each of the proposal. Because the search space is much smaller after the proposal step, we can use a more powerful feature to describe each proposal. A first contribution of this dissertation is to show that fine-tuning a much deeper network can boost the detection performance significantly, compared to a relatively shallower network. A second contribution of this dissertation is that we show that the search can be approximated by discretizing the search space and then regressing the residual difference between a discrete box and a target box. This is a departure from the proposal and then classify series of methods. We present a single stage framework, SSD, which can simultaneously detect and classify objects fast and accurately. SSD splits the space of small boxes more densely and the space of larger boxes more sparsely. As a result, it can discretize the space more efficiently and ease training notably. We have empirically shown that it is as accurate as or even better than the two-stage methods and yet is much faster. Unlike object detection, semantic segmentation is usually treated as a per-pixel classification problem, especially in the era of deep networks. However, a major issue is how to incorporate global semantic context information when making local decision. Although there are concurrent works on using techniques from graphical models such as conditional random fields (CRFs) to introduce context and structure information, we present a simple yet effective method, ParseNet, by using the average feature for a layer to augment the features at each location. Experimental results show that it can be as effective as a method which uses CRFs as a post-processing step to include context information. In order to make the above methods useful for many real-time systems, such as mobile devices or self-driving cars, we have collected large-scale video datasets for multiple categories, and hope that temporal consistency information in video can help further boost the performance and speed up the operations while lowering power consumption.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Berg, Alexander
  • Anguelov, Dragomir
  • Frahm, Jan-Michael
  • Niethammer, Marc
  • Berg, Tamara
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2016

This work has no parents.