Large scale data analytics

Project Information

Discipline: Electrical and Related Engineering (106)
Subdiscipline: 11.04 Information Sciences and Systems
Orientation: Research

Abstract

The pervasive deployment of environmental sensors and instruments that monitor natural and human activities are leading to the generation of large data sets at fine granularities of time and space. Science and eEngineering applications can shift from a modeling and empirical testing of hypothesis approach to defining predictive models based on current and historical information. Such data analytics applications for eScience and eEngineering leverage data mining and machine learning methods to analyze large scale information to support research, development and even operations. However, such applications are data and compute intensive, and the changing nature of data requires them to run often. In this project, we propose to develop and scale machine learning algorithms onto elastic Cloud infrastructure to build predictive models of power forecast in smart electricity grids. These models can subsequently be used to make realtime predictions of energy usage at campus and city scales for energy conservation and planning. Programming models such as Map-Reduce/Hadoop and DAGs will be used to describe these applications and execute them on public Cloud platforms such Eucalyptus to evaluate their efficacy for both static and streaming datasets.

Intellectual Merit

Large scale data mining and machine learning are compute and data intensive but are less studied for executing on distributed systems, limiting users to run them run on smaller samples of dataset to fit single machines even though larger datasets are available. Most algorithms that are ported to the Cloud are inherently loosely coupled, but several commonly used modelling techniques are not naively loosely-coupled. We will study scalable machine learning models for classification, such as regression trees and ANN, that use novel algorithms or mapping techniques for scalable execution on the Cloud.

Broader Impacts

The result of our work will allow for a broader and more effective use of data mining for eScience and eEngineering. We will apply the tools and algorithms we develop to the smart power grid domain for energy use forecasting, but the machine learning algorithms will themselves be generally applicable. All our research and development will be publicly available and the research results published in workshops and conferences for access by the broader community.

Project Contact

Project Lead: Yogesh Simmhan (simmhan)
Project Manager: Yogesh Simmhan (simmhan)
Project Members: Alok Gautam Kumbhare, Charith Wickramaarachchi, Nam Ma, Hsuan-Yi Chu

Resource Requirements

Hardware Systems

india (IBM iDataPlex at IU)
sierra (IBM iDataPlex at SDSC)

Use of FutureGrid

We will use future grid resources to deploy, test and run scalable data analytics applications. We plan to investigate mapping these algorithms and applications to Map-Reduce/Hadoop running on top of Eucalyptus, and also to generic DAG models on Eucalytus and Nimbus.

Scale of Use

We expect to use a few VMs for regular (daily) experiments and 100's of VMs for testing scalability once a week.

Project Timeline

Submitted: 06/26/2011 - 20:44

Large scale data analytics

Project Information

Project Contact

Resource Requirements

Project Timeline

About

Support

Community

Projects