Large scale data analytics

Project Information

Discipline
Electrical and Related Engineering (106) 
Subdiscipline
11.04 Information Sciences and Systems 
Orientation
Research 
Abstract

The pervasive deployment of environmental sensors and instruments that monitor natural and human activities are leading to the generation of large data sets at fine granularities of time and space. Science and eEngineering applications can shift from a modeling and empirical testing of hypothesis approach to defining predictive models based on current and historical information. Such data analytics applications for eScience and eEngineering leverage data mining and machine learning methods to analyze large scale information to support research, development and even operations. However, such applications are data and compute intensive, and the changing nature of data requires them to run often. In this project, we propose to develop and scale machine learning algorithms onto elastic Cloud infrastructure to build predictive models of power forecast in smart electricity grids. These models can subsequently be used to make realtime predictions of energy usage at campus and city scales for energy conservation and planning. Programming models such as Map-Reduce/Hadoop and DAGs will be used to describe these applications and execute them on public Cloud platforms such Eucalyptus to evaluate their efficacy for both static and streaming datasets.

Intellectual Merit

Large scale data mining and machine learning are compute and data intensive but are less studied for executing on distributed systems, limiting users to run them run on smaller samples of dataset to fit single machines even though larger datasets are available. Most algorithms that are ported to the Cloud are inherently loosely coupled, but several commonly used modelling techniques are not naively loosely-coupled. We will study scalable machine learning models for classification, such as regression trees and ANN, that use novel algorithms or mapping techniques for scalable execution on the Cloud.

Broader Impacts

The result of our work will allow for a broader and more effective use of data mining for eScience and eEngineering. We will apply the tools and algorithms we develop to the smart power grid domain for energy use forecasting, but the machine learning algorithms will themselves be generally applicable. All our research and development will be publicly available and the research results published in workshops and conferences for access by the broader community.

Project Contact

Project Lead
Yogesh Simmhan (simmhan) 
Project Manager
Yogesh Simmhan (simmhan) 
Project Members
Alok Gautam Kumbhare, Charith Wickramaarachchi, Nam Ma, Hsuan-Yi Chu  

Resource Requirements

Hardware Systems
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
 
Use of FutureGrid

We will use future grid resources to deploy, test and run scalable data analytics applications. We plan to investigate mapping these algorithms and applications to Map-Reduce/Hadoop running on top of Eucalyptus, and also to generic DAG models on Eucalytus and Nimbus.

Scale of Use

We expect to use a few VMs for regular (daily) experiments and 100's of VMs for testing scalability once a week.

Project Timeline

Submitted
06/26/2011 - 20:44