Localizing Objects Fast and Accurately

Liu, Wei

Download PDF

Request Version for Screen Reader

Last Modified

March 20, 2019

Creator

Liu, Wei
- Affiliation: College of Arts and Sciences, Department of Computer Science

Abstract

A fundamental problem in computer vision is knowing what is in the image and where it is. We develop models to localize objects of multiple categories, such as person and car, fast and accurately. In particular, we focus on designing deep convolutional neural networks (CNNs) for object detection and semantic segmentation. A central theme of this dissertation is to explore the design choices of network structure to combine the full power of CNNs and the characteristics of each task to not only achieve high-quality results but also keep the model relatively simple and fast. At the heart of object detection is the question of how to search efficiently through a continuous 2D bounding boxes space of various scales and aspect ratios at every possible location in an image. A brute force approach would be searching over all possibilities, but it is apparently not scalable and is quite difficult. An alternative is to propose some potential locations which might contain objects, and then classify each of the proposal. Because the search space is much smaller after the proposal step, we can use a more powerful feature to describe each proposal. A first contribution of this dissertation is to show that fine-tuning a much deeper network can boost the detection performance significantly, compared to a relatively shallower network. A second contribution of this dissertation is that we show that the search can be approximated by discretizing the search space and then regressing the residual difference between a discrete box and a target box. This is a departure from the proposal and then classify series of methods. We present a single stage framework, SSD, which can simultaneously detect and classify objects fast and accurately. SSD splits the space of small boxes more densely and the space of larger boxes more sparsely. As a result, it can discretize the space more efficiently and ease training notably. We have empirically shown that it is as accurate as or even better than the two-stage methods and yet is much faster. Unlike object detection, semantic segmentation is usually treated as a per-pixel classification problem, especially in the era of deep networks. However, a major issue is how to incorporate global semantic context information when making local decision. Although there are concurrent works on using techniques from graphical models such as conditional random fields (CRFs) to introduce context and structure information, we present a simple yet effective method, ParseNet, by using the average feature for a layer to augment the features at each location. Experimental results show that it can be as effective as a method which uses CRFs as a post-processing step to include context information. In order to make the above methods useful for many real-time systems, such as mobile devices or self-driving cars, we have collected large-scale video datasets for multiple categories, and hope that temporal consistency information in video can help further boost the performance and speed up the operations while lowering power consumption.

Date of publication

December 2016

Keyword

DOI

https://doi.org/10.17615/secb-b910

Resource type

Dissertation

Rights statement

In Copyright

Advisor

Berg, Alexander
Anguelov, Dragomir
Frahm, Jan-Michael
Niethammer, Marc
Berg, Tamara

Degree

Doctor of Philosophy

Degree granting institution

University of North Carolina at Chapel Hill Graduate School

Graduation year

2016

Language

English

Date uploaded

July 5, 2017

Relations

Parents:

This work has no parents.

In Collection:

UNC-Chapel Hill Artificial Intelligence Resources

Items

Thumbnail	Title	Date Uploaded	Visibility	Actions
	Liu_unc_0153D_16675.pdf	July 5, 2017	Public	Download

Localizing Objects Fast and Accurately

Downloadable Content

Relations

Items