Learning mid-level image representations with little or no supervision

Abstract

: A good image representation is essential for any high-level computer vision tasks, including object recognition, image retrieval, scene classification, and more. Building an expressive, multi-purpose image representation has, however, proved extremely challenging. Low-level representations (e.g. edge maps, histograms of gradients) are generally too simple to allow reliable inferences about images. Building high-level representations (e.g. object detectors, scene classifiers, deep neural networks), however, classically requires huge amounts of hand-labeled data. Hence, our project focuses on “mid-level” representations, which capture moderately complicated, informative patterns in images, between these two extremes. We call the basic units of our representation “mid level visual elements.” Visual elements may be objects, object parts, or other visual patterns, but they must fit two criteria: (1) they occur frequently in the visual world, and (2) they are informative, in the sense that they tell us something about the image. Empirically, we have found that optimizing elements for these two criteria leads to an intuitive and powerful representation, and furthermore, that the training can be done using only inexpensive labels. For instance, in [1] we used geotagged images (collected automatically from Google Street View) to optimize elements to be specific to particular geographic locations, e.g., the city of Paris. Surprisingly, the algorithm learned to detect many of the distinctive stylistic details of Paris, which are not only geographically informative, but also let us reason about the positions of facades within the images, and let us subdivide the city into different architectural styles. In [2,4], we applied similar mid-level representations to indoor scene classification, and in [3] we inferred the stylistic changes of cars over several decades. [1] “What makes Paris look like Paris”, Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Civic, and Alexei A. Efros, In SIGGRAPH 2012. [2] “Unsupervise discovery of mid-level discriminative patches”, Saurabh Singh, Abhinav Gupta, and Alexei A. Efros, In ECCV 2012. [3] “Mid-level visual element discovery as discriminative mode seeking”, Carl Doersch, Abhinav Gupta, and Alexei A. Efros, In NIPS 2013. [4] “Style-aware mid-level representation for discovering visual connections in space and time”, Yong Jae Lee, Alexei A. Efros, and Martial Hebert, In ICCV 2013.

Intellectual Merit

While our initial experiments have already shown state-of-the-art results, our prototype pipelines have only been designed to handle up to 100,000 images. Modern datasets like ImageNet contain millions of images. We are currently investigating approximations that would allow us to scale to datasets like these. We furthermore hope to investigate further applications of this technology, including 3D reconstruction of objects using point correspondences derived from visual elements.

Broader Impact

Our long-term goal for this project is to create an image representation to improve state-of-the-art performance in a wide range of computer vision tasks, including image classification, object localization, dataset visualization, and scene understanding. Beyond computer vision, we are investigating a number of applications including large-scale analysis of painting styles, visual data mining of historical records, changes in the appearance of faces in populations over the last 100 years, etc.

Use of FutureGrid

: Like all algorithms aimed at big visual data, our algorithm is computationally intensive. Increasing data and compute time has improved performance seemingly without bound. Unlike many existing representation learning algorithms, our element-based algorithm is highly parallelizable in practice, simply because each element can be learned independent of the others (in stark contrast to algorithms like the recently-popular deep neural networks, where the entire representation must be learned jointly). We believe Futuregrid will be an excellent platform for running our pipeline quickly and economically.

Scale Of Use

We would like to request 3 to 4 GPU-equipped nodes. Our storage requirements will be moderate, on the order of a few terabytes.

Publications

Project Number: FG-450

Project Lead: Alexei Efros

Project Manager: Tinghui Zhou

Institution: UC Berkeley

Project Status: Active

View Project Details

Project Members

Abhinav Shrivastava

Dinesh Jayaraman

Shiry Ginosar

Stefan Lee

Tinghui Zhou

Keywords

Geo-localization, Object recognition, Scene understanding, visual representations

Timeline

Updated: 2 weeks 4 days ago