Data Semantics Aware Clouds for High Performance Analytics

Project Information

Discipline
Computer Science (401) 
Subdiscipline
14.09 Computer Engineering 
Orientation
Research 
Abstract

Today's cutting-edge research deals with the increasing volume and complexity of data produced by ultra-scale simulations, high resolution scientific equipment and experiments. Representatives include analytics- and simulation- driven applications such as astrophysics data analysis, bioinformatics, etc. In these fields, scientists are dealing with large amounts of data and processing (analyzing) them to explore new concepts and ideas. Many scientists are exploring the possibilities of deploying applications with large scale of data on cloud computing platforms such as Amazon EC2 and Windows Azure. Recently, the successful deployment of eScience applications on clouds motivates us to deploy HPC analytics applications to the cloud, especially MapReduce enabled. The reason behind this lies in a fact that eScience applications and HPC analytics applications share some important features: terascale or peta-scale data size and high cost to run on single or several supercomputers or large platforms. However, HPC analytics applications bear some distinct characteristics such as complex data access patterns, interest locality, and which pose new challenges to its adoption of clouds. However, current solutions do not deal well with these challenges and have several limitations. This project is the development of a data semantics aware framework to support HPC analytics applications on clouds.

Intellectual Merit

The intellectual merit of this proposal is the development of a data semantics aware framework to support HPC analytics applications on clouds. Our approach will gear toward data-semantics aware big scientific data processing. This infrastructure consists of three components; 1) a MapReduce API with data semantics awareness (MARS) to develop highperformance MapReduce applications, 2) a translation layer equipped with data-semantics aware HPC interfaces (HIAS), and 3) a neural network based Data-Affinity-Aware data placement scheme (NDAFA).

Broader Impacts

The broader impacts of the proposal are as follows: high productivity improvement of the economic impact through supporting cost-effective scientific and engineering big data processing; delivering an open source software to speed up the 21st century scientific discovery process in many areas of data-intensive HPC analytics such as cosmology, astrophysics, chromodynamics, bioinformatics, etc; potential for rapid dissemination of research outcomes in reducing costs and increasing operational efficiency in high-end data and computing systems; educational benefits arising from broadening the experience of underrepresented students from our collaborative effort with several NSF-funded STEM projects such as developing artistic nuggets for data semantics aware clouds; and broadening external collaboration and community ties through the integration into the NSF-funded FutureGrid, DOE’s Scientific Computing Cloud—Magellan and etc. The synergy in both the University of Central Florida and the Department of Energy national laboratories and other institutions such as University of Florida, will be helpful in providing new and beneficial perspectives in the students' graduate education, and prepare a work force in both cloud computing and computer systems.

Project Contact

Project Lead
Jun Wang (wangjun) 
Project Manager
Jun Wang (wangjun) 

Resource Requirements

Hardware Systems
  • alamo (Dell optiplex at TACC)
  • foxtrot (IBM iDataPlex at UF)
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
  • xray (Cray XM5 at IU)
  • bravo (large memory machine at IU)
 
Use of FutureGrid

Conduct developing, debugging, testing and evaluation of our proposed new data-semantics aware software systems and tools.

Scale of Use

We need a stable testing environment which can enable us to set up an upto 128-node or 256-node Hadoop cluster testbed. We are able to perform sensitivity test with nodes varying from 16-node to 256-node Hadoop cluster. The storage capacity should be as much as tens/hundreds of Terabytes. We plan to use for 3 years.

Project Timeline

Submitted
02/21/2012 - 09:21