Skip to:

e-Science 2008 4th IEEE International Conference on e-Science

Exhibits, Demos & Posters

Scientific Computing Autonomic Reliability Framework

Authors

  • Abhishek Dubey, Department of EECS, Vanderbilt University
  • Sandeep Neema, Department of EECS, Vanderbilt University
  • Jim Kowalkowski, Fermi National Accelerator Laboratory, Batavia, Illinois
  • Amitoj Singh, Fermi National Accelerator Laboratory, Batavia, Illinois

Abstract

Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.

This framework enables the specification of workflows, mitigation strategies, and the state variables to monitor in the system. Currently, we have deployed the runtime framework on three LQCD clusters. The sensor data that we have collected so far provides us the ability to perform analysis in response to a failure as well as develop the correlation models and nominal behavior models that can be used in the analyzer.

More Information

Show your support for e-Science 2008

Add one of our badges to your site:

  • Teal eScience 2008 Web badge
  • Green eScience 2008 Web badge
  • Orange eScience 2008 Web badge