Pegasus on FutureGrid Tutorial

Table of Contents

Start Your Virtual Machine

Please refer to "Starting The Virtual Machine" to start a virtual machine on FutureGrid.

The referenced section also includes the first instruction, how to log into your virtual machine instance as user tutorial using ssh. Please do so now.
 

Mapping and Executing Workflows using Pegasus

In this section you will be introduced to planning and executing a workflow through Pegasus WMS locally. Once logged into the VM you started, you can use cd to return to your home directory, and quickly check your current working directory with the pwd command.

tutorial@vm-126.uc.futuregrid.org:~ $ cd
tutorial@vm-126.uc.futuregrid.org:~ $ pwd
/home/tutorial

Note: The full prompt is shown once here for illustration purposes. Your prompt will be different. We will abbreviate the prompt to just the dollar ($) sign. You do not type the dollar sign at the beginning, or any other part of the prompt.

All the exercises in this Chapter will be run from the $HOME/pegasus-wms/ directory. All the files that are required reside in this directory

$ cd $HOME/hello_world

Files for the exercise are stored in subdirectories:

$ ls 
bin  dax-generator.py  input  pegasus.conf  rc.dat  sayhiinquire.dax  sites.xml  tc.dat

 

Creating a DAX

A DAX is the workflow description used by Pegasus. There is a small piece of Python code that uses the DAX API to generate the DAX. Open the file dax-generator.py in a file editor:

$ nano dax-generator.py

In your editor, either search for, or scroll down to the comment reading

# Add the inquire job (depends on the sayhi job)
that constructs the DAX. Towards the end of the function there is some code that is commented out. It is your first exercise to remove the comments (#) in front of each line. What you should end up with is:

# Add the inquire job (depends on the sayhi job)
inquire = Job(name="inquire")
inquire.addArguments('f.b')
c = File("f.c")
inquire.uses(b, link=Link.INPUT)
inquire.uses(c, link=Link.OUTPUT)
dax.addJob(inquire)
# Add control-flow dependencies
dax.addDependency(Dependency(parent=sayhi, child=inquire)) 

The above snippet of code, adds a job to the DAX. It illustrates how to specify

  1. the arguments for the job

  2. the logical files used by the job

  3. the dependencies to other jobs

  4. adding the job to the dax

After uncommenting the code, run the dag-generator.py program.

$ ./dax-generator.py

Let us view the generated sayhiinquire.dax.

$ cat sayhiinquire.dax

We won't repeat the full output here. But inside the DAX, you notice multiple sections:

  1. list of input file locations

  2. list of executable locations

  3. list of aggregate executables

  4. definitions of every job in the workflow. There are two jobs in total in the current workflow.

  5. list of control-flow dependencies. This section specifies a partial order in which jobs are to executed.
     

The Replica Catalog

In this exercise you will insert entries into the Replica Catalog. The replica catalog that we will use today is a simple file based catalog. We also support and recommend the following for production runs

  • Globus RLS

  • JDBC implementation

A replica catalog maintains the LFN (Logical File Name) to PFN (Physical File Name) mapping for the input files of your workflow. Pegasus queries it to determine the locations of the raw input data files required by the workflow. Additionally, all the materialized data is registered into replica catalog for data reuse later on.

A pre-populated replica catalog has been provided. Let us see what the file looks like.

$ cat rc.dat 
# file-based replica catalog: 2012-09-26T18:36:59.169-04:00                                                              f.a file:///home/tutorial/hello_world/input/f.a pool="local" 

 

The Site Catalog

The site catalog contains information about the layout of your grid and cloud where you want to run your workflows. For each site following information is maintained

  • grid gateways

  • head node filesystem

  • worker node filesystem

  • scratch and shared file systems on the head nodes and worker nodes

  • replica catalog URL for the site

  • site wide information like environment variables to be set when a job is run.

In-depths details about the Pegaus site catalog can be found in the Pegasus documentation. This section will provide a quick glance at the pre-installed site catalog.

A pre-populated site catalog has been provided. Let's examine the site catalog for the Pegasus VM. It refers to two sites local and PegasusVM:

$ cat sites.xml
(output not included)

Please refer to the Pegasus manual for more details about the various implementations of site catalogs that Pegasus supports. However, we recommend to stick with the XML implementation, as it is most versatile to describe remote resources. You may also want to check out the tools pegasus-config to write the information for the special local site.
 

The Transformation Catalog

The transformation catalog maintains information about where the application code resides on the grid. It also provides additional information about the transformation as to what system they are compiled for, what profiles or environment variables need to be set when the transformation is invoked etc.

In our case, the pre-populated transformation catalog contains the locations where the sayhi and inquire codes are installed in the Pegasus VM. Let us examine the transformation catalog

For each transformation the following information is captured

  1. tr - A transformation identifier. (Normally a Namespace::Name:Version. The Namespace and Version are optional.)

  2. pfn - URL or file path for the location of the executable. The pfn is a file path if the transformation is of type INSTALLED and generally a URL (file:/// or http:// or gridftp://) if of type STAGEABLE

  3. site - The site identifier for the site where the transformation is available

  4. type - The type of transformation. Whether it is installed ("INSTALLED") on the remote site or is availabe to stage ("STAGEABLE") to the remote site.

  5. arch, os, osrelease, osversion - These attribute specify the transformation further. osrelease and osversion are optional.

    ARCH can have one of the following values x86, x86_64, sparcv7, sparcv9, ppc, aix. The default value for arch is x86

    OS can have one of the following values linux,sunos,macosx. The default value for OS if none specified is linux

  6. Profiles - One or many profiles can be attached to a transformation for all sites or to a transformation on a particular site.

$ cat tc.dat
tr sayhi {
   site PegasusVM { 
      pfn "/home/tutorial/hello_world/bin/sayhi"
      arch "x86_64"
      os "linux"
      type "INSTALLED"
   }
}


tr inquire {
   site PegasusVM {
      pfn "/home/tutorial/hello_world/bin/inquire"
      arch "x86_64"
      os "linux"
      type "INSTALLED"
   }
}
 

The Configuration File (Properties)

Pegasus Workflow Planner is configured in the language of Java properties.

$ cat pegasus.conf   
# This tells Pegasus where to find the Site Catalog pegasus.catalog.site=XML3 pegasus.catalog.site.file=sites.xml   # This tells Pegasus where to find the Replica Catalog pegasus.catalog.replica=File pegasus.catalog.replica.file=rc.dat   # This tells Pegasus where to find the Transformation Catalog pegasus.catalog.transformation=Text pegasus.catalog.transformation.file=tc.dat   # Use Condor file transfers to handle data transfers pegasus.data.configuration = Condor  

The Pegasus manual contains several sections dealing with properties, as you will need to know about them to fully facilitate the features of Pegasus.
 

Planning and Running Workflows

In this exercise we are going to run pegasus-plan to generate a executable workflow from the abstract workflow diamond.dax that we created earlier. The mapping of our logical description of the work, using the catalogs we have modified earlier, generates a set of Condor submit files and a Condor DAGMan workflow control file. These are submitted, and thus executed, using the pegasus-run command.

  • Run pegasus-plan to generate the Condor submit files from the DAX.

  • Run pegasus-run to submit the workflow locally.

Let us run pegasus-plan on the diamond DAX:

$ pegasus-plan --conf pegasus.conf --dax sayhiinquire.dax --force \
               --dir dags --sites PegasusVM --output local --nocleanup

The above command says that we need to plan the Hello World workflow. The condor submit files are to be generated in a directory structure whose base is dags. We also are requesting that no cleanup jobs be added as we require the intermediate data to be saved. Here is the output of pegasus-plan.


			
				 
			
I have concretized your abstract workflow. The workflow has been entered  
into the workflow database with a state of "planned". The next step is
to start or execute your workflow. The invocation required is

pegasus-run /home/tutorial/hello_world/dags/tutorial/pegasus/sayhi_inquire/run0001 

2012.10.02 17:46:06.423 EDT:   Time taken to execute is 0.656 seconds 
  

Now invoke pegasus-run as mentioned in the output of pegasus-plan. Do not copy the command below - it is just for illustration purpose. You will have to copy and paste the directory shown in the previous output, here in cursive.

$ pegasus-run /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/home/tutorial/hello_world/dags/tutorial/pegasus/sayhi_inquire/run0001 
  

-----------------------------------------------------------------------

File for submitting this DAG to Condor           : sayhi_inquire-0.dag.condor.sub 
Log of DAGMan debugging messages                 : sayhi_inquire-0.dag.dagman.out
Log of Condor library output                     : sayhi_inquire-0.dag.lib.out
Log of Condor library error messages             : sayhi_inquire-0.dag.lib.err
Log of the life of condor_dagman itself          : sayhi_inquire-0.dag.dagman.log 

Submitting job(s).
1 job(s) submitted to cluster 44.

-----------------------------------------------------------------------  

Your Workflow has been started and runs in base directory given below  

cd /home/tutorial/hello_world/dags/tutorial/pegasus/sayhi_inquire/run0001  

*** To monitor the workflow you can run ***  

pegasus-status -l /home/tutorial/hello_world/dags/tutorial/pegasus/sayhi_inquire/run0001  

*** To remove your workflow run *** 

pegasus-remove /home/tutorial/hello_world/dags/tutorial/pegasus/sayhi_inquire/run0001


Monitoring, Debugging and Statistics

In this section, we are going to list ways to track your workflow, how to debug a failed workflow and how to generates statistics and plots for a workflow run.

Tracking the progress of the workflow and debugging the workflows.

We will change into the directory, that was mentioned by the output of pegasus-run command.

$ cd  /home/tutorial/hello_world/dags/tutorial/pegasus/sayhi_inquire/run0001

Again, do not copy the above statement exactly, but use the directory stated in the output of pegasus-run. The directory is for Pegasus the handle to that particular workflow instance.

In this directory you will see a whole lot of files. That should not scare you. Unless things go wrong, you need to look at just very few files to track the progress of the workflow

Run the command pegasus-status as mentioned by pegasus-run above to check the status of your jobs. It comes with a built-in watch functionality that defaults to re-run the output every 30 seconds. However, we are keen, and setting this to every 2 seconds by providing a number for the optional argument to the -w option. For large workflows, you should not set it that low.

$ pegasus-status -w 2 -l /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001

Tip

If your current working directory is the workflow instance directory, as ensured by the cd above, you can omit the directory handle argument to pegasus-status.

It will clear the screen periodically and provide an output similar to the following:

Press Ctrl+C to exit

STAT  IN_STATE  JOB
Run      00:59  sayhi_inquire-0
Idle     00:13   ┗━sayhi_ID0000001
Summary: 2 Condor jobs total (I:1 R:1)

UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME
5     0     0     2     0     3     0  30.0 Running *sayhi_inquire-0.dag
Summary: 1 DAG total (Running:1)
  

Tip

The watch mode does not end with ESC nor (q)uit, but with Ctrl+C.

The above output shows that a couple of jobs are running under the main dagman process. Keep a lookout to track whether a workflow is running or not. If you do not see any of your job in the output for sometime (say 30 seconds), we know the workflow has finished. We need to wait, as there might be delay in Condor DAGMan releasing the next job into the queue after a job has finished successfully. Do not be too impatient.

If the queue portion of the output of pegasus-status is empty, then either your workflow has

  1. successfully completed

  2. stopped midway due to non recoverable error.

We can now run pegasus-analyzer to analyze the workflow. You use the same directory as before, since the directory is the handle to your workflow. It may very likely differ from the example below.

$ pegasus-analyzer /home/tutorial/hello_world/dags/tutorial/pegasus/sayhi_inquire/run0001
pegasus-analyzer: initializing...

************************************Summary*************************************

 Total jobs         :      9 (100.00%)
 # jobs succeeded   : 9 (100.00%)
 # jobs failed      : 0 (0.00%)
 # jobs unsubmitted : 0 (0.00%)  |
 

Successfully Completed

Let us again look at the jobstate.log. This time we need to look at the last few lines of jobstate.log to determine the status of the workflow:

$ tail jobstate.log
1331684922 stage_out_local_local_2_0 POST_SCRIPT_SUCCESS 0 local - 7
1331684929 register_local_2_0 SUBMIT 28.0 local - 8
1331684934 register_local_2_0 EXECUTE 28.0 local - 8
1331684934 register_local_2_0 JOB_TERMINATED 28.0 local - 8
1331684934 register_local_2_0 JOB_SUCCESS 0 local - 8
1331684934 register_local_2_0 POST_SCRIPT_STARTED 28.0 local - 8
1331684939 register_local_2_0 POST_SCRIPT_TERMINATED 28.0 local - 8
1331684939 register_local_2_0 POST_SCRIPT_SUCCESS 0 local - 8
1331684939 INTERNAL *** DAGMAN_FINISHED 0 ***
1331684940 INTERNAL *** MONITORD_FINISHED 0 ***

Looking at the last two lines we see that DAGMan finished, and pegasus-monitord finished successfully with a status 0. This means workflow ran successfully. Congratulations you ran your workflow on the local site successfully.

The workflow generates a final output file f.c that resides in the directory /home/tutorial/output/f.c. To view the file, you can do the following:

$ cat f.c
Hello Pete! How are you?
 

 

Debugging a failed workflow using pegasus-analyzer

In this section, we will run the diamond workflow but remove the input file so that the workflow fails during execution. This is to highlight how to use pegasus-analyzer to debug a failed workflow.

First of all let's rename the input file f.a

$ cd ~/hello_world
$ mv input/f.a input/f.a.old 

We will now repeat the exercise and submit the workflow again.

Plan and Submit the diamond workflow. Pass --submit to pegasus-plan to submit in case of successful planning. This eliminates the separate invocation of pegasus-run.

$ pegasus-plan --conf pegasus.conf --dax sayhiinquire.dax --force \
               --dir dags --sites PegasusVM --output local \
               --nocleanup --submit

Use pegasus-status to track the workflow and wait it to fail:

$ pegasus-status -w 2 /home/tutorial/hello_world/tutorial/pegasus/sayhi_inquire/run0001

(no matching jobs found in Condor Q)
UNREADY READY     PRE QUEUED    POST SUCCESS FAILURE %DONE  
9       0     0   0      0    2       2       15.4
 Summary: 1 DAG total (Failure:1)

The --long option to pegasus-status of a running workflow gives more detail.

$ pegasus-status -l /home/tutorial/hello_world/tutorial/pegasus/sayhi_inquire/run0001
(no matching jobs found in Condor Q)
UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME
9 0     0     0     0     2     2  15.4 Failure *sayhi_inquire-0.dag
Summary: 1 DAG total (Failure:1)  

We will now run pegasus-analyzer on the failed workflow's submit directory, the Pegasus handle to your workflow instance, to see which job failed.

$ pegasus-analyzer /home/tutorial/hello_world/tutorial/pegasus/sayhi_inquire/run0001
...

The above tells us that the stage-in job for the workflow failed, and points us to the stdout of the job. By default, all jobs in Pegasus are launched via pegasus-kickstart that captures runtime provenance of the job and helps in debugging. Hence, the stdout of the job is the kickstart stdout which is in XML.

Removing a running workflow

Sometimes you may want to halt the execution of the workflow or just permanently remove it. You can stop or halt a workflow by running the pegasus-remove command. The handle, and in fact the proper invocation, is mentioned in the output of pegasus-run

$ pegasus-remove /home/tutorial/hello_world/dags/tutorial/pegasus/sayhi_inquire/run0001
 Job 2788.0 marked for removal


Generating statistics and plots of a workflow run

In this section, we will generate statistics and plots of the diamond workflow we ran using pegasus-statistics and pegasus-plots

Generating Statistics Using pegasus-statistics

Pegasus-statistics generates workflow execution statistics. To generate statistics run the command as shown below, using the appropriate workflow instance handle (path):

$ cd $HOME/pegasus-wms
$ pegasus-statistics -s all dags/tutorial/pegasus/blackdiamond/run0001/

**********************************************SUMMARY***********************************************
#legends

Workflow summary - Summary of the workflow execution. It shows total
                tasks/jobs/sub workflows run, how many succeeded/failed etc.
                In case of hierarchical workflow the calculation shows the
                statistics across all the sub workflows.It shows the following
                statistics about tasks, jobs and sub workflows.
                * Succeeded - total count of succeeded tasks/jobs/sub workflows.
                * Failed - total count of failed tasks/jobs/sub workflows.
                * Incomplete - total count of tasks/jobs/sub workflows that are
                not in succeeded or failed state. This includes all the jobs
                that are not submitted, submitted but not completed etc. This
                is calculated as  difference between 'total' count and sum of
                'succeeded' and 'failed' count.
                * Total - total count of tasks/jobs/sub workflows.
                * Retries - total retry count of tasks/jobs/sub workflows.
                * Total Run - total count of tasks/jobs/sub workflows executed
                during workflow run. This is the cumulative of retries,
                succeeded and failed count.

Workflow wall time - The walltime from the start of the workflow execution
                to the end as reported by the DAGMAN.In case of rescue dag the value
                is the cumulative of all retries.

Workflow cumulative job wall time - The sum of the walltime of all jobs as
                reported by kickstart. In case of job retries the value is the
                cumulative of all retries. For workflows having sub workflow jobs
                (i.e SUBDAG and SUBDAX jobs), the walltime value includes jobs from
                the sub workflows as well.

Cumulative job walltime as seen from submit side - The sum of the walltime of
                all jobs as reported by DAGMan. This is similar to the regular
                cumulative job walltime, but includes job management overhead and
                delays. In case of job retries the value is the cumulative of all
                retries. For workflows having sub workflow jobs (i.e SUBDAG and
                SUBDAX jobs), the walltime value includes jobs from the sub workflows
                as well.

-------------------------------------------------------------------------------------------------------------------------------------------------
Type                Succeeded           Failed              Incomplete          Total                    Retries             Total Run (Retries Included)
Tasks               4                   0                   0                   4                   ||   0                   4
Jobs                8                   0                   0                   8                   ||   0                   8
Sub Workflows       0                   0                   0                   0                   ||   0                   0
-------------------------------------------------------------------------------------------------------------------------------------------------

Workflow wall time                               : 5 mins, 6 secs,      (total 306 seconds)

Workflow cumulative job wall time                : 4 mins, 0 secs,      (total 240 seconds)

Cumulative job walltime as seen from submit side : 4 mins, 5 secs,      (total 245 seconds)

Summary                           : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/statistics/summary.txt

Workflow execution statistics     : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/statistics/workflow.txt

Job instance statistics           : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/statistics/jobs.txt

Transformation statistics         : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/statistics/breakdown.txt

Time statistics                   : /home/tutorial/pegasus-wms/dags/tutorial/pegasus/blackdiamond/run0001/statistics/time.txt

****************************************************************************************************

The final lines in the above output show the name and location of various statistics generated by pegasus-statistics. You can either use the more or less tool to view these textual files. Alternatively, you may point your web browser to the tutorial VM, replacing the /home prefix in the paths with http://vm-127.uc.futuregrid.org where the latter needs to be replaced with the IP address or name of your remote tutorial VM.

 

Summary Table (summary.txt)

The summary table in file summary.txt contains high-level information about the workflow run like total execution time, job's failed etc. This effectively repeats the information shown as output from pegasus-statistics minus the file locations.

Table 4.1. Table: Summary Statistics

Workflow wall time 5 mins, 6 secs
Workflow cumulative job wall time 4 min. 0 sec.
Cumulative job walltime as seen from submit side 4 min. 5 sec.
Total jobs 8
# jobs succeeded 8
# jobs failed 0
# jobs incomplete 0

 

Job Instance Statistics Table (jobs.txt)

The job statistics table in file jobs.txt contains a break-down of job-level statistics. Jobs are constituents of the planned-out workflow (DAG file).

  • Job - the name of the job

  • Try - the number representing the job instance run count.

  • Site - the site where the job ran

  • Kickstart - the actual duration of the job instance in seconds on the remote compute node

  • Post - the postscript time as reported by DAGMan

  • CondorQTime - the time between submission by DAGMan and the remote Grid submission. It is an estimate of the time spent in the condor queue on the submit node

  • Resource - the time between the remote Grid submission and start of remote execution. It is an estimate of the time job spent in the remote queue

  • Runtime - the time spent on the resource as seen by Condor DAGMan . Is always >=kickstart

  • Seqexec - the time taken for the completion of a clustered job

  • Seqexec-Delay - the time difference between the time for the completion of a clustered job and sum of all the individual tasks kickstart time

Table 4.2. Table: Job Statistics

Job Try Site Kickstart Post CondorQTime Resource Runtime Seqexec Seqexec-Delay Exitcode Hostname
analyze_j4 1 local 60.002 5.0 5.0 - 60.0 - - 0 vm-127.uc.futuregrid.org
create_dir_blackdiamond_0_local 1 local 0.018 5.0 5.0 - 0.0 - - 0 vm-127.uc.futuregrid.org
findrange_j2 1 local 60.002 5.0 0.0 - 65.0 - - 0 vm-127.uc.futuregrid.org
findrange_j3 1 local 60.002 5.0 10.0 - 60.0 - - 0 vm-127.uc.futuregrid.org
preprocess_j1 1 local 60.002 5.0 5.0 - 60.0 - - 0 vm-127.uc.futuregrid.org
register_local_2_0 1 local 0.263 5.0 5.0 - 0.0 - - 0 vm-127.uc.futuregrid.org
stage_in_local_local_0 1 local 0.127 5.0 5.0 - 0.0 - - 0 vm-127.uc.futuregrid.org
stage_out_local_local_2_0 1 local 0.082 5.0 5.0 - 0.0 - - 0 vm-127.uc.futuregrid.org

 

Transformation Statistics Table (breakdown.txt)

The logical transformation statistics table in file breakdown.txt contains information about each type of transformation in the workflow. A transformation is the abstract-level task that you see in the DAX file. Pegasus will add some auxilliary transformation as part of the planning process, which will also show in the logical transformation statistics.

Table 4.3. Table: Transformation Statistics

Transformation Count Succeeded Failed Min Max Mean Total
dagman::post 8 8 0 5.0 5.0 5.0 40.0
pegasus::analyze:4.0 1 1 0 60.002 60.002 60.002 60.002
pegasus::dirmanager 1 1 0 0.018 0.018 0.018 0.018
pegasus::findrange:4.0 2 2 0 60.002 60.002 60.002 120.004
pegasus::pegasus-transfer 2 2 0 0.082 0.127 0.105 0.209
pegasus::preprocess:4.0 1 1 0 60.002 60.002 60.002 60.002
pegasus::rc-client 1 1 0 0.263 0.263 0.263 0.263

 

Workflow Statistics Table (workflow.txt)

The workflow statistics table in file workflow.txt is not very interesting for the diamond workflow. It becomes more interesting when you execute nested workflows, and desire statistics per (sub-) workflow. In the current example it reflect the summary information, as there was only one workflow.

 

Time Statistics Table (time.txt)

The timing statistics in file time.txt aggregate run-time information per time interval and per site. In the current simple and short-running example, the output is not particularly interesting. If you have long-running workflows across multiple sites, the statistics contained in this file will become more interesting.

 

Terminate Your Virtual Machine

Please refer to chapter "Terminate Your Virtual Machine" to shut down the currently running virtual machine.

However, if you feel like going on, skip the shutdown, go to the next chapter, and skip the start-up and log-in at the beginning of the next chapter.