Data Semantics Aware Clouds for High Performance Analytics

Abstract

Today's cutting-edge research deals with the increasing volume and complexity of data
produced by ultra-scale simulations, high resolution scientific equipment and experiments.
Representatives include analytics- and simulation- driven applications such as astrophysics
data analysis, bioinformatics, etc. In these fields, scientists are dealing with large amounts of
data and processing (analyzing) them to explore new concepts and ideas.

Many scientists are exploring the possibilities of deploying applications with large scale
of data on cloud computing platforms such as Amazon EC2 and Windows Azure. Recently, the
successful deployment of eScience applications on clouds motivates us to deploy HPC analytics
applications to the cloud, especially MapReduce enabled. The reason behind this lies in a fact
that eScience applications and HPC analytics applications share some important features: terascale
or peta-scale data size and high cost to run on single or several supercomputers or large
platforms. However, HPC analytics applications bear some distinct characteristics such as
complex data access patterns, interest locality, and which pose new challenges to its
adoption of clouds. However, current solutions do not deal well with these challenges and have
several limitations.

This project is the development of a data semantics aware framework to support HPC analytics applications on clouds.

Intellectual Merit

The intellectual merit of this proposal is the development of a data semantics aware
framework to support HPC analytics applications on clouds. Our approach will gear toward
data-semantics aware big scientific data processing. This infrastructure consists of three
components; 1) a MapReduce API with data semantics awareness (MARS) to develop highperformance
MapReduce applications, 2) a translation layer equipped with data-semantics
aware HPC interfaces (HIAS), and 3) a neural network based Data-Affinity-Aware data
placement scheme (NDAFA).

Broader Impact

The broader impacts of the proposal are as follows: high productivity improvement of
the economic impact through supporting cost-effective scientific and engineering big data
processing; delivering an open source software to speed up the 21st century scientific discovery
process in many areas of data-intensive HPC analytics such as cosmology, astrophysics,
chromodynamics, bioinformatics, etc; potential for rapid dissemination of research outcomes in
reducing costs and increasing operational efficiency in high-end data and computing systems;
educational benefits arising from broadening the experience of underrepresented students from
our collaborative effort with several NSF-funded STEM projects such as developing artistic
nuggets for data semantics aware clouds; and broadening external collaboration and
community ties through the integration into the NSF-funded FutureGrid, DOE’s Scientific
Computing Cloud—Magellan and etc. The synergy in both the University of Central Florida and
the Department of Energy national laboratories and other institutions such as University of
Florida, will be helpful in providing new and beneficial perspectives in the students' graduate
education, and prepare a work force in both cloud computing and computer systems.

Use of FutureGrid

Conduct developing, debugging, testing and evaluation of our proposed new data-semantics aware software systems and tools.

Scale Of Use

We need a stable testing environment which can enable us to set up an upto 128-node or 256-node Hadoop cluster testbed. We are able to perform sensitivity test with nodes varying from 16-node to 256-node Hadoop cluster. The storage capacity should be as much as tens/hundreds of Terabytes. We plan to use for 3 years.

Publications


FG-190
Jun Wang
University of Central Florida
Active

FutureGrid Experts

Gregor von Laszewski
Zhenhua Guo

Timeline

2 years 33 weeks ago