Fault Tolerance of HPC systems

Project Information

Discipline: Computer Science (401)
Subdiscipline: 14.10 Electrical, Electronics, and Communications Engineering
Orientation: Research

Abstract

Cloud computing offers new capacity and flexibility solution to high performance computing applications with provisioning of great number of virtual machines for computational intensive applications. Fault tolerance allows HPC system on cloud with multiple of nodes to complete execution of computational intensive applications (HPC applications) in the present of fault. The most commonly used fault tolerance techniques for HPC is checkpoint/restart. However, checkpoint/restart increases the wall clock time of the execution of the computational intensive applications which increases the cost of running the application. We want to develop middleware that will reduce the execution time of computational intensive applications in the presence of fault

Intellectual Merit

The activities will create a low latency middleware that will reduce the impact of failure on computational intensive applications running on HPC systems

Broader Impacts

Applications running on HPC systems will greatly benefit because most the computional intensive applications are scientific application which are usually affected by failure/fault

Project Contact

Project Lead: Ifeanyi Egwutuoha (egwutuoha)
Project Manager: Ifeanyi Egwutuoha (egwutuoha)
Project Members: David Levy

Resource Requirements

Hardware Systems

alamo (Dell optiplex at TACC)
foxtrot (IBM iDataPlex at UF)
hotel (IBM iDataPlex at U Chicago)
sierra (IBM iDataPlex at SDSC)

Use of FutureGrid

For my research - Currently, I am a PhD student

Scale of Use

I will need to provision across different systems/nodes for VMs for an experiment

Project Timeline

Submitted: 03/23/2012 - 23:02

Fault Tolerance of HPC systems

Project Information

Project Contact

Resource Requirements

Project Timeline

About

Support

Community

Projects