Laboratory for Cosmological Data Mining

Project Information

Discipline
Astronomy (201) 
Subdiscipline
40.03 Astrophysics 
Orientation
Research 
Abstract

We will evaluate the use of Hadoop and cloud computing in general to the task of large scale cosmological data mining. Specifically we will explore the use of Mahout classification an clustering codes to determine source classifications and distance estimate for objects detected in large photometric surveys. We also will explore the development of specific clustering measurement codes, such as the two-point correlation function to the Hadoop Map-Rduce framework. We also will look to push the machine learning tasks to the calibrated image data themselves, in order to obtain more accurate classifications.

Intellectual Merit

Our project will explore large scale data mining on the future grid system. Most algorithms that we will use are not traditional map-reduce tasks, thus we will help develop the cloud computing approach to general purpose data mining. In addition, our image data mining will help lead the way for other researchers who need to perform bulk image analysis and mining.

Broader Impacts

Beyond guiding others in our field and outside our field who may be interested in our data mining efforts, we will also be teaching students in our research group how to use future grid and Hadoop as well as specific data mining algorithms and implementations as exist in Mahout.

Project Contact

Project Lead
Robert Brunner (bigdog) 
Project Manager
Robert Brunner (bigdog) 
Project Members
Edward Kim, Robert Santucci, Fanshi Liu, Nick Ciaglia  

Resource Requirements

Hardware System
  • Not sure
 
Use of FutureGrid

Want to test deployment of data mining virtual machines and the use of mahout on hadoop.

Scale of Use

We wish to scale as large as possible. Will try mahout on GPUs if deemed feasible as well.

Project Timeline

Submitted
11/22/2013 - 11:03