Exploring HPC Fault Tolerance on the Cloud

Project Information

Discipline: Computer Science (401)
Subdiscipline: 11.07 Computer Science
Orientation: Research

Abstract

The potential of running traditional High-Performance Computing (HPC) applications on the emerging cloud computing environment has been gained intensive attractions recently. While most existing studies focus on the performance, scalability and productivity of the HPC applications on the cloud, the reliability issue, however, has been rarely studied in the context. The reliability in HPC community has already been recognized as a critical challenge in limiting the scalability, especially for the upcoming exascale computing environment. The cloud environment introduces more complicated software architecture and exposes the system with a higher risk of failures. Distinctive applications running on different virtual machines may share the same physical resource, which causes resource contention, application interaction and threatens the reliability further. Checkpointing is the mostly used mechanism to support fault tolerance in HPC applications. While checkpointing and its behaviors have been well studied on high-end computing systems (HEC), there is limited study on evaluating the impact of cloud on the performance of checkpointing. Checkpointing on the cloud may present distinctive features differentiating it from the traditional HEC checkpointing. For example, in the cloud, hard disk of one physical machine is shared by multiple virtual machines, which burdens the performance since the checkpointing requests are usually issued in a burst. Also, if the checkpointing images go to the dedicated storage service in the cloud, more variations may be presented due to the fact that multiple virtual machines of one node share one NIC and makes the performance difficult to predict. The objective of this project is to study the reliability implications of running HPC applications on the cloud. As the first stage of the project, we will focus on evaluating the performance of parallel checkpointing on the cloud. On the FG platform, we will scale the application size of our previous experiments with higher degree. The performance will be tested on both local storage of each virtual machine and the shared storage.

Intellectual Merit

This project will evaluate the performance results of parallel checkpointing, observe the bottlenecks and propose solutions for the potential problems. The project will deliver a paper submission to a referenced conference and be part of the Ph.D thesis of Hui Jin.

Broader Impacts

This project will present a comprehensive study on the reliability potentials and issues for the emerging cloud computing environment. The project will provide guideline in designing and deploying HPC application on the cloud from the perspective of reliability. The project will also proposal possible solutions to boosting the application performance under failures for the cloud environment.

Project Contact

Project Lead: Hui Jin (hjin)
Project Manager: Hui Jin (hjin)
Project Members: Xi Yang

Resource Requirements

Hardware System

I don't care (what I really need is a software environment and I don't care where it runs)

Use of FutureGrid

We will use FutureGrid platform to examine the performance of parallel checkpointing on the cloud environment. we will also study the performance issues of MapReduce under failures and look into the causes for the performance degradation in the presence of failures.

Scale of Use

We expect to reach scales of at least 2048 VM instances and evaluate the corresponding checkpointing performance. The initial part of the project is planned to last about 3 months, with an estimated total CPU hours required of 409600 (2048vms*4hrs*50runs).

Project Timeline

Submitted: 01/26/2011 - 19:46

Exploring HPC Fault Tolerance on the Cloud

Project Information

Project Contact

Resource Requirements

Project Timeline

About

Support

Community

Projects