Language and Perceptual Categorization in Computational Visual Recognition Public Deposited

Downloadable Content

Download PDF
Last Modified
  • March 19, 2019
  • Ordonez Roman, Vicente
    • Affiliation: College of Arts and Sciences, Department of Computer Science
  • Computational visual recognition or giving computers the ability to understand images as well as humans do is a core problem in Computer Vision. Traditional recognition systems often describe visual content by producing a set of isolated labels, object locations, or by even trying to annotate every pixel in an image with a category. People instead describe the visual world using language. The rich visually descriptive language produced by people incorporates information from human intuition, world knowledge, visual saliency, and common sense that go beyond detecting individual visual concepts like objects, attributes, or scenes. Moreover, due to the rising popularity of social media, there exist billions of images with associated text on the web, yet systems that can leverage this type of annotations or try to connect language and vision are scarce. In this dissertation, we propose new approaches that explore the connections between language and vision at several levels of detail by combining techniques from Computer Vision and Natural Language Understanding. We first present a data-driven technique for understanding and generating image descriptions using natural language, including automatically collecting a big-scale dataset of images with visually descriptive captions. Then we introduce a system for retrieving short visually descriptive phrases for describing some part or aspect of an image, and a simple technique to generate full image descriptions by stitching short phrases. Next we introduce an approach for collecting and generating referring expressions for objects in natural scenes at a much larger scale than previous studies. Finally, we describe methods for learning how to name objects by using intuitions from perceptual categorization related to basic-level and entry-level categories. The main contribution of this thesis is in advancing our knowledge on how to leverage language and intuitions from human perception to create visual recognition systems that can better learn from and communicate with people.
Date of publication
Resource type
Rights statement
  • In Copyright
  • Frahm, Jan-Michael
  • Efros, Alexei
  • Berg, Tamara
  • Berg, Alexander
  • Choi, Yejin
  • Doctor of Philosophy
Degree granting institution
  • University of North Carolina at Chapel Hill Graduate School
Graduation year
  • 2015
Place of publication
  • Chapel Hill, NC
  • There are no restrictions to this item.

This work has no parents.