STAMPEDE

Abstract

Large-scale applications today make use of distributed resources to support computations and as part of their execution, generate large amounts of log information. Up to now, we have been using the Netlogger analysis tools to perform off-line log analysis. Stampede extends the current offline workflow log analysis capability and develops a comprehensive middleware solution that will allow users of complex scientific applications to track the status of their jobs in real time, to detect execution anomalies automatically, and to perform on-line troubleshooting without logging in to remote nodes or searching through thousands of log files.

Intellectual Merit

The system will be able to capture application-level logs from jobs as they are executing on the cyberinfrastructure. At the same time, it will also collect log information from the underlying cyberinfrastructure services, such as resource management and data transfer. These end-to-end logs will be combined and brokered through a subscription interface. External components will use the subscription interface to provide monitoring services.

Broader Impact

We build on an important class of applications, scientific workflows, that are being used today in a number of scientific disciplines including astronomy, biology, ecology, earthquake science, gravitational-wave physics, and many others that are running on today's large-scale infrastructure such as the OSG or the TeraGrid. This solution will be modular and distributed, and reusable across a broad class of applications and workflow systems.

Use of FutureGrid

Large-scale workflow experiments with induced failures.

Scale Of Use

From one to hundreds of VMs for hours at a time
see also http://pegasus.isi.edu/projects/stampede

Publications


FG-180
Dan Gunter
LBNL
Active

Project Members

Ahmed El-Hassany
Gaurang Mehta
Karan Vahi
Taghrid Samak

Keywords

Timeline

2 years 32 weeks ago