Optimizing Shared Resource Contention in HPC Clusters

Abstract

Contention for shared resources in HPC clusters occurs when jobs are concurrently executing on the same multicore node (there is a contention for allocated CPU time, shared caches, memory bus, memory controllers, etc.) and when jobs are concurrently accessing cluster interconnects as their processes communicate data between each other. The cluster network also has to be used by the cluster scheduler in a virtualized environment to migrate job virtual machines across the nodes. We argue that contention for cluster shared resources incurs severe degradation to workload performance and stability and hence must be addressed. We also found that the state-of-the-art HPC cluster schedulers are not contention-aware. The goal of this work is the design, implementation and evaluation of a scheduling framework that optimizes shared resource contention in a virtualized HPC cluster environment.

Intellectual Merit

The proposed research demonstrates how the shared resource contention in HPC clusters can be addressed via contention-aware scheduling of HPC jobs. The proposed framework is comprised of a novel scheduling algorithm and a set of Open Source software that includes the original code and patches to the widely-used tools in the field. The solution (a) allows an online monitoring of the cluster workload and (b) provides a way to make and enforce contention-aware scheduling decisions on practice.

Broader Impact

This research suggests a way to upgrade the HPC infrastructure used by U.S. academic institutions, industry and government. The goal of the upgrade is a better performance for general cluster workload.

Use of FutureGrid

I would like to perform experiments on the FutureGrid hardware to test the efficiency of the proposed solution. To do that, I was thinking of doing the following:
1) Create an image that contains a Linux distribution, a set of widely-used software tools and the code of my framework like so:
(a) The Linux distro is preferably Gentoo (but can be different). In it, I need to install a number of standard Linux packages, turn some kernel options on and modify the default Linux perf tool with my framework patches.
(b) The cluster software I use is Torque for Resource Manager and Maui for cluster scheduler. I need to modify them with my framework patches. For the scheduling algorithm, I use Matlab and/or Choco.
(c) I also need to install a user-level monitoring daemon (original code).
2) Book 16-64 nodes from any FutureGrid resource that supports HPC (e.g. india, alamo, etc.) in exclusive mode for several hours.
3) Netboot them bare-metal with the created image (via xCat), thus creating a small HPC cluster out of the booked nodes.
4) Run MPI workload (mostly SPEC MPI 2007) on this cluster with and without contention-awareness.
5) Gather the workload execution time, amount of resources consumed. The ability to measure power consumption via tools like HP iLO3 or Dell iDRAC is highly desirable.
5) Analyze the results, make modifications as necessary. I am planning to only run multinode experiments on FutureGrid resources. I can perform small modifications and testing on my lab machines. It is the scale (16-64) nodes that I'm looking for in FutureGrid.

I do require baremetal access to the nodes as opposed to VM access with Grid Aplliances. The reasons are:
(a) I have made various modifications to the source code of Linux, Maui and Torque to implement my solution, so I would like to recreate the experimental setup from my lab machines as fully as possible.
(b) It is usually very hard to get access to the baremetal hardware counters from guest OSes (running on Xen or KVM). My solution relies on the counters.
(c) I would like to use container-based virtual solution with OpenVZ as opposed to Xen or KVM.

Scale Of Use

Book 16-64 nodes from any FutureGrid resource that supports HPC (e.g. india, alamo, etc.) in exclusive mode for several hours.

Publications


Results

Accepted publications:

Tyler Dwyer, Alexandra Fedorova, Sergey Blagodurov, Mark Roth, Fabien Gaud and Jian Pei,
A Practical Method for Estimating Performance Degradation on Multicore Processors and its
Application to HPC Workloads, in Supercomputing Conference (SC), 2012. Acceptance rate 21%.
MAS rank: 51/2872 (top 2%)
http://www.sfu.ca/~sba70/files/sc12.pdf

Presented posters:

Sergey Blagodurov, Alexandra Fedorova, Fabien Hermenier, Clavis-HPC: a Multi-Objective Virtualized Scheduling Framework for HPC Clusters, in OSDI 2012.

Public software releases:

Clavis-HPC: a multi-objective virtualized scheduling framework for HPC clusters.
http://hpc-sched.cs.sfu.ca/

The source code is available for download from github repository:
https://github.com/blagodurov/clavis-hpc

Documentation:

Below is the link to our project report for the FutureGrid Project Challenge. A shorter version of it will appear in HPCS 2012 proceedings as a Work-In-Progress paper:
http://www.sfu.ca/~sba70/files/report188.pdf

A very brief outline of the problem, the framework and some preliminary results:
http://www.sfu.ca/~sba70/files/ClusterScheduling.pdf
FG-188
Sergey Blagodurov
Simon Fraser University
Active

FutureGrid Experts

Gregor von Laszewski

Timeline

1 year 41 weeks ago