Laboratory for Cosmological Data Mining

Abstract

We will evaluate the use of Hadoop and cloud computing in general to the task of large scale cosmological data mining. Specifically we will explore the use of Mahout classification an clustering codes to determine source classifications and distance estimate for objects detected in large photometric surveys. We also will explore the development of specific clustering measurement codes, such as the two-point correlation function to the Hadoop Map-Rduce framework. We also will look to push the machine learning tasks to the calibrated image data themselves, in order to obtain more accurate classifications.

Intellectual Merit

Our project will explore large scale data mining on the future grid system. Most algorithms that we will use are not traditional map-reduce tasks, thus we will help develop the cloud computing approach to general purpose data mining. In addition, our image data mining will help lead the way for other researchers who need to perform bulk image analysis and mining.

Broader Impact

Beyond guiding others in our field and outside our field who may be interested in our data mining efforts, we will also be teaching students in our research group how to use future grid and Hadoop as well as specific data mining algorithms and implementations as exist in Mahout.

Use of FutureGrid

Want to test deployment of data mining virtual machines and the use of mahout on hadoop.

Scale Of Use

We wish to scale as large as possible. Will try mahout on GPUs if deemed feasible as well.

Publications


FG-397
Robert Brunner
University of Illinois
Active

Project Members

Edward Kim
Fanshi Liu
Nick Ciaglia
Robert Santucci

Timeline

35 weeks 3 days ago