HPC Scheduling

Project Information

Discipline
Biosciences, n.e.c. (617) 
Subdiscipline
11.01 Computer and Information Sciences, General 
Orientation
Research 
Abstract

Catalina is an open source external scheduler for use with resource managers such as LoadLeveler, PBS, and SLURM. It's capabilities include: system reservations, user reservations, user-settable reservations, standing reservations, priority-ordered queueing, and scheduling policies, Grid Universal Remote (GUR) and Master Control Program (MCP) are metascheduling open source tools that are being currently used in TeraGrid. Catalina has been used in production for almost a decade on NSF supercomputers (Blue Horizon, DataStar, SDSC IA64) for local scheduling. To accomodate network topologies (3D-torus, with leaf nodes) extension of Catalina with topology-aware scheduling would greatly benefit scheduling of workloads on 3D-torus machines. GUR is a tool for conducting multi-site,coordinated calculations. It has been used to create synchronized reservations across separate compute clusters to run single MPI jobs. Originally, GUR had the capability to stage data and remotely compile code. These capabilities are being revamped. GUR requires user-settable reservations on the local scheduler. Moab has this feature, as does the Catalina Scheduler. If virtualization is available, we would use that feature to bring up virtual Catalina clusters for testing and development. http://www.teragrid.org/userinfo/jobs/gur.php MCP is a command line utility that provides automatic resource selection for running a single parallel job on high performance computing resources. MCP optimizes job start times by submitting copies of a job and canceling the extra job request after one copy begins to execute. We would use FutureGrid to explore using MCP for ensemble runs. http://www.teragrid.org/userinfo/jobs/mcp.php We would like to use FutureGrid as a test bed to further develop Catalina/GUR/MCP with access to a wide variety of experimental and production architectures. Software requirements include Python 2.2 or greater (2.6.4 is current version), expect, C compiler, and openssh for communication. It would be nice to have the Remote Login kit from CTSS, but we can get by with openssh instead.

Intellectual Merit

Catalina, GUR, and MCP development will lead to novel strategies of job scheduling on local clusters and metascheduling across geographically distributed clusters. Currently, GUR and MCP are the TeraGrid CTSS components deployed for co-scheduling and metascheduling. Catalina is the scheduler for the Trestles machine.

Broader Impacts

Catalina scheduling would benefit all jobs with improved turnaround, better utilization and increased performance (from correct topology scheduling for efficient communication). GUR and MCP have been integrated with Globus infrastructure and are therefore suitable for the wide range of compute resources accesible through that middleware. Any scientific application that requires co-scheduling or metascheduling can benefit from GUR and MCP capabilities. Improvement in GUR and MCP would allow more efficient use of grids.

Project Contact

Project Lead
Kenneth Yoshimoto (kenneth) 
Project Manager
Kenneth Yoshimoto (kenneth) 
Project Members
Subhashini Sivagnanam, Ismael Farfán  

Resource Requirements

Hardware Systems
  • alamo (Dell optiplex at TACC)
  • foxtrot (IBM iDataPlex at UF)
  • hotel (IBM iDataPlex at U Chicago)
  • india (IBM iDataPlex at IU)
  • sierra (IBM iDataPlex at SDSC)
  • xray (Cray XM5 at IU)
  • Network Impairment Device
 
Use of FutureGrid

Futuregrid resources will be used for development and testing of scheduling and metascheduling software. Futuregrid VMs serve as scheduling targets.

Scale of Use

A moderate to high number of very small footprint VMs to simulate an HPC cluster. (I'm not actually going to run work on the VMs, so they don't need a lot of cputime nor memory.)

Project Timeline

Submitted
04/26/2011 - 16:50