Main Conference Sessions
Distributed Replica-Exchange Simulations on Production Environments Using SAGA and Migol
Authors
- Shantenu Jha, LSU
- Andre Luckow
Abstract
There exists a class of scientific applications for which utilizing distributed resources is critical for reducing the time-to-solution. However, the ability to orchestrate many distributed jobs in a dynamic and inherently unreliable distributed environments is a major challenge. The more resources and components involved, the more complicated and error-prone the system becomes. We discuss a specific class of applications— Replica-Exchange simulations—where utilizing as many (often heterogeneous) distributed resources as possible, is critical for the effective solution of the scientific problem. Such applications require effective mechanisms to handle the unreliability inherent in dynamic distributed systems. In this paper, we describe the design, development, and deployment of a unique framework for constructing fault-tolerant distributed simulations. The framework is scalable, general purpose, and extensible and consists of two primary components: SAGA and Migol. SAGA is a high-level programmatic abstraction layer that provides a standardized interface for the primary distributed functionality required for application development. We provide details of a newly developed functionality in SAGA, the Checkpoint and Recovery API. Migol is an adaptive Grid middleware, which addresses the fault tolerance of Grid applications and services by providing the capability to recover applications from checkpoint files transparently. In addition to describing the integration of SAGA-CPR with the Migol infrastructure, we outline our experiences with running a large-scale, general-purpose, SAGA-CPR based Replica-Exchange application in a production distributed environment.
Date and Time
Wednesday, December 10, 10:45 a.m. to 11:15 a.m.
Room Number
206