GridProphet, A workflow execution time prediction system for the Grid

Abstract

Workflow applications have provided a dynamic and heterogeneous paradigm for execution of computationally large experiments on the Grid and have greatly increased the pace of scientific work. Through their distributed task based execution mechanism, they have eliminated the need for resource homogeneity. A Grid workflow application represents a collection of computational tasks (activities) interconnected in a directed graph through control and data flow dependencies that are suitable for execution on the Grid. The complexity of the workflows has increased over the years with the increasing complexity of scientific applications. A common measure for the performance of scientific workflow applications is the total execution time needed to finish the entire workflow. The objective of this project is to develop a grid performance prediction system, which can estimate the execution time of individual workflow tasks, single-entry-single-exit sub-workflows (e.g. loops), and entire workflows for scientific applications such that the prediction technology can be used to rank different workflow transformations or workflow versions with respect to their execution time behavior. The proposed system can be used for optimization of workflow applications, thus enabling scientists to better utilize computing resources and reach their scientific results in shorter time.

Intellectual Merit

The intellectual merits of this research lie in the following contributions to the fields of scientific workflows and Grid computing:
The development of a prediction model based on advanced statistical techniques and machine learning methods to support :
1. The modeling of execution behaviour of highly distributed grid workflows,
2. The development of an execution trace collection and performance prediction system for Grid workflow execution environments.
3. Querying and utilization of historical trace data on-the-fly for accurate prediction of grid workflow execution time using machine learning based prediction system.

Broader Impact

The success of this project will provide a general-purpose tool for execution time prediction for Grid workflow and will help the Grid users for efficient grid resource utilization. The tool would be customizable for use with other Grid workflow systems as well.

Use of FutureGrid

The primary use of the FutureGrid infrastructure is to execute large number of Grid workflow applications and collect the execution trace for these execution to be fed to the machine learning system for training the machine learning.

If the required software deployments are availble the existing Globus based grid infrastructure will be used after deploying necessary software and required services into the Globus Grid server. Or we will deploy customized OS images on multiple VMs for creating a dedicated grid environment to execute our experiments.

Scale Of Use

The project will complete in two phases. The first phase of the project is meant for collecting execution traces of grid workflows by executing large number of Grid workflows. This calls for an extensive resource reservation and utilization. The planned resource usage during this phase will start with around 128 core grid setup and will gradually increase upto 2048 cores or more if available.

In the second phased the collected trace data will be used for development of the machine learning based prediction system. In this phase a small set of FutureGrid resources will be used but for longer periods, as demanded by the machine learning algorithm for the training phase.

Publications


Results

Project brief:

This project was initiated as part of a larger project titled “A provenance and performance prediction system for Grid systems”. The objective of the main project is to develop a grid performance prediction system, which can estimate the execution time of individual workflow tasks, single-entry-single-exit sub-workflows (e.g. loops), and entire workflows for scientific applications such that the prediction technology can be used to rank different workflow transformations or workflow versions with respect to their execution time behavior. The proposed system can be used for optimization of workflow applications, thus enabling scientists to better utilize computing resources and reach their scientific results in shorter time.

The objective of the utilization of Future Grid resources was to collect trace data for training the machine learning systems. The data collected using the Future Grid resources is used along with the data traces collected in the Austrian Grid and the Grid5000.

Experimental setup:

Grid-Appliance provided by Future Grid portal is used in varying configuration to setup the Virtual Grid required to serve the project objective.

Based on the project requirements trace collection was to be performed for the following applications.

  • MeteoAG (Meteorology Domain)
  • Wien2K (Material Science Domain)
  • InvMod (Alpin River Modeling)
The goal was to record trace collection data for atleast 5000 workflow runs in total with varying background load and dynamic distribution of tasks on different sites in the virtual Grid.

For this purpose the Grid-Appliance was customized in different aspects. Additional software packages were added required for the execution of the workflow execution system (ASKALON) and the workflows themselves. A database server was installed to collect the trace data during the experiments.

Trace Data:

A set of key features having noticeable importance during the execution of these workflows on the Grid infrastructure was identified. These selected features covered most of the factors associated to Grid workflow execution such as input to the application workflow, size of the input data, size of application executables, Network associated features like available bandwidth, bandwidth background load, time required to transfer the application data across computer nodes. Moreover both the dynamic and static environment associated parameters are also collected which include the information about the machine architecture, compute power, cache memory and disk space etc. A total number of 65 parameters are selected for use to get accurate predictions and for a rich machine learning based training of the prediction model.

Optimization of the Feature Vectors:

For use with the machine learning system the main feature vector is shortlisted to select a small number of parameters, so that the machine learning process can be carried out swiftly and accurately. Having a large number of input parameters results in very long training times and also introduces lots of noise in the data.

We recorded a large number of run-time parameters so as not to miss any important feature. But for the training of the model we needed to optimize the feature space so that the problems associated with the noise and long training durations can be avoided. Principal component analysis and Principal Feature selection algorithms are used for optimization of the feature space and an optimized feature vector is generated that have maximum influence on the execution of the tasks in distributed environments.

Utilization of Trace Data:

A neural network based machine learning system known as Multilayer Perceptron (MLP) is used. MLP is a Feedforward neural network system for training machine learning models and is used for pattern matching in non linear problem spaces. It maps the sets of inputs presented at the input layers of the network to outputs at the output layer. In contrast to the traditional neural networks MLP may have one or more hidden layers. An activation function determines the threshold value of the network at each node which acts a neuron for the neural network.

For our experiments the trace data collected from the Future Grid infrastructure was used along-with the data collected from other Grid infrastructures like that of Austrian Grid and the Grid5000.

The training results presented herewith are therefore not specific to Future Grid only.

Performance Prediction Results:

Based on our experiments and the machine learning system described above the following activity level predictions accuracy has been achieved.

Workflow: Wien2k Total successful runs: 700 One activity maximum prediction accuracy: 65.70% Two activities maximum prediction accuracy: 52.70%
Activity Cluster Prediction Accuracy
LAPW1 1 65.70%
LAPW1 2 63.00%
LAPW2 1 64.00%
LAPW2 2 60.00%
LAPW1,LAPW2 1 58.00%
LAPW1,LAPW2 2 56.00%
LAPW1,LAPW2 1 54.00%
LAPW1,LAPW2 2 52.00%

Single workflow prediction accuracy

The results presented above are quite promising for an initial investigation and therefore we are quite eager to continue this research to get even better results. Experimental workflow runs are in progress using the Future Grid resources to have more trace data for improved performance prediction accuracy.

FG-175
Thomas Fahringer
University of Innsbruck
Active

Project Members

Juan J. Durillo
Muhammad Junaid Malik
Simon Ostermann

FutureGrid Experts

Gregor von Laszewski