Performance Evaluation of Data Intensive Scientific Applications

Abstract

We would like to perform a detailed benchmarking effort for data intensive applications using the resources provided by FutureGrid. While synergistic with our other research funded by the NSF Cluster Exploratory (CluE) and SDSC's Triton Resource Opportunity (TRO) program, this work will showcase the use of supercomputing facilities for large-scale data processing, complementing work that we will perform on other private and public grid and cloud-based environments. Under the scope of this project, we wish to install our benchmark data sets; and execute a number of benchmark queries on these data sets, using both "traditional" and Hadoop-based solutions for serving these data sets. The application domains that we are interested in spans multiple scientific disciplines - especially geosciences and bioinformatics. In the area of geosciences, we are interested in benchmarking the performance of database and Hadoop-based implementations to serve high resolution geospatial datasets. In the area of bioinformatics, we are interested in study the performance of various traditional and cloud-enabled codes for next generation sequencing.

Intellectual Merit

The intellectual merit of this research is in its contribution towards understanding the performance tradeoffs and feasibility of shared-nothing Hadoop-style programming model for large-scale scientific data-intensive computing, and comparing it with traditional HPC-style approaches. The results from this study will provide a basis for developing data-intensive scienfic codes that leverage grid and cloud resources in an optimal fashion.

Broader Impact

The broader impact of this study is a reassessment of how scientific data intensive applications are implemented, and how data sets are hosted and served to a broad community. A direct impact is the development of scientific codes in the geosciences and bioinformatics that are optimized to leverage the capabilities of grid and cloud resources, and the associated programming models.

Use of FutureGrid

FutureGrid resources will be used to install our benchmark data sets; and execute a number of benchmark queries on these data sets, using both "traditional" and Hadoop-based solutions for serving these data sets. Since the FutureGrid provides a variety of environments, we will use it to experiment with both shared-nothing and traditional HPC-style environments.

Scale Of Use

We anticipate needing around 50,000 hours over the course of the next year.

Publications

Project Number: FG-98

Project Lead: Sriram Krishnan

Institution: University of California, San Diego

Project Status: Closed

View Project Details

Keywords

bioinformatics, data intensive computing, geosciences, performance

Timeline

Completed: 1 year 27 weeks ago

Updated: 1 year 27 weeks ago

Performance Evaluation of Data Intensive Scientific Applications

Abstract

Intellectual Merit

Broader Impact

Use of FutureGrid

Scale Of Use

Publications

Keywords

Timeline

About

Support

Community

Projects