Exhibits, Demos & Posters
Scientific Computing Autonomic Reliability Framework
Authors
- Abhishek Dubey, Department of EECS, Vanderbilt University
- Sandeep Neema, Department of EECS, Vanderbilt University
- Jim Kowalkowski, Fermi National Accelerator Laboratory, Batavia, Illinois
- Amitoj Singh, Fermi National Accelerator Laboratory, Batavia, Illinois
Abstract
Large scientific computing clusters require a distributed dependability subsystem that can provide fault isolation and recovery and is capable of learning and predicting failures to improve the reliability of scientific workflows. In this paper, we outline the key ideas in the design of a Scientific Computing Autonomic Reliability Framework (SCARF) for large computing clusters used in the Lattice Quantum Chromo Dynamics project at Fermi Lab.
This framework enables the specification of workflows, mitigation strategies, and the state variables to monitor in the system. Currently, we have deployed the runtime framework on three LQCD clusters. The sensor data that we have collected so far provides us the ability to perform analysis in response to a failure as well as develop the correlation models and nominal behavior models that can be used in the analyzer.