P B Portable Batch System S _________________________________________________ Administrator Guide Albeaus Bayucan Robert L. Henderson Lonhyn T. Jasinskyj Casimir Lesiak Bhroam Mann Tom Proett Dave Tweten Numerical Aerospace Simulation Systems Division NASA Ames Research Center Release: 2.0 Printed: November 12, 1998 _________________________ MRJ Technology Solutions., NASA Contract NAS 2-14303, Moffett Field, CA November 12, 1998 TOC PBS Administrator Guide Portable Batch System, Rights to Use and Redistribute This program is confidential and proprietary to MRJ Technol- ogy Solutions and may not be reproduced, published or dis- closed to others without written authorization from MRJ. Copyright (c) 1998 MRJ Technology Solutions, Inc. All Rights Reserved. This program is derived from prior work some of which was originally written as a joint project between the Numerical Aerospace Simulation (NAS) Systems Division of NASA Ames Research Center and the National Energy Research Supercom- puter Center (NERSC) of Lawrence Livermore National Labora- tory. This product includes software developed by the NetBSD Foun- dation, Inc. and its contributors. PBS Revision History Revision 1.0 June, 1994 -- Alpha Test Release Revision 1.1 March 15, 1995 ... Revision 1.1.7 June 6, 1996 Revision 1.1.8 August 19, 1996 Revision 1.1.9 December 20, 1996 Revision 1.1.10July 31, 1997 Revision 1.1.11December 19, 1997 Revision 1.1.12July 9, 1998 Revision 2.0 October 14, 1998 -ii- PBS Administrator Guide TOC Table of Contents Revision History ....................... ii 1. Introduction ........................ 1 1.2. Installation ...................... 2 1.3. Release Contents .................. 2 1.4. Build Overview .................... 2 1.5. Build Details ..................... 6 1.5.1. Make File Targets ............... 6 1.5.2. Configure Options ............... 6 1.6. Machine Dependent Build Instruc- tions .................................. 10 1.6.1. Cray Systems .................... 10 1.6.2. IBM Workstations ................ 11 1.6.3. IBM SP .......................... 11 1.6.4. SGI Workstations Running IRIX 5 ...................................... 12 1.6.5. SGI Workstations Running IRIX 6 ...................................... 12 1.6.6. FreeBSD and NetBSD .............. 12 1.6.7. Linux ........................... 12 1.6.8. SUN Running SunOS ............... 12 1.7. Additional Build Options .......... 12 1.7.1. pbs_ifl.h ....................... 12 1.7.2. server_limits.h ................. 13 1.8. Site Modifiable Source Files ...... 14 2. Batch System Configuration .......... 16 2.1. Nodes ............................. 16 2.1.1. Defining Nodes .................. 17 2.1.2. Where Jobs May Be Run ........... 18 2.1.3. Specifying Nodes ................ 19 2.2. Network Addresses ................. 20 2.3. Starting Daemons .................. 20 2.4. Configuring the Job Server, pbs_server ............................. 21 2.4.1. Server Configuration ............ 22 2.4.2. Queue Configuration ............. 23 2.4.3. Recording Server Configuration ........................................ 25 2.5. Configuring the Execution Server, pbs_mom ........................ 25 2.6. Configurating the Scheduler, pbs_sched .............................. 27 2.7. Alternate Test Systems ............ 27 2.8. Installing an Updated Batch Sys- tem .................................... 28 3. Scheduling Policies ................. 29 3.1. Scheduler - Server Interaction ........................................ 30 3.2. BaSL Scheduling ................... 31 3.3. Tcl Based Scheduling .............. 31 3.4. C Based Scheduling ................ 32 3.5. C Based Sample Schedulers ......... 32 3.5.1. FIFO Scheduler .................. 32 TOC TOC PBS Administrator Guide 3.6. Scheduling and File Staging ....... 40 4. GUI System Administrator Notes ...... 40 4.1. xpbs .............................. 40 4.2. xpbsmon ........................... 40 5. Operational Issues .................. 41 5.1. Security .......................... 41 5.1.1. Internal Security ............... 41 5.1.2. Host Authentication ............. 42 5.1.3. Host Authorization .............. 42 5.1.4. User Authentication ............. 42 5.1.5. User Authorization .............. 42 5.1.6. Group Authorization ............. 43 5.1.7. Root Owned Jobs ................. 43 5.2. Job Prologue/Epilogue Scripts ..... 43 5.3. Use and Maintenance of Logs ....... 45 5.4. Problem Solving ................... 46 5.4.1. Clients Unable to Contact Server ................................. 46 5.4.2. Non Delivery of Output .......... 46 5.4.3. Job Cannot be Executed .......... 46 5.4.4. Running Jobs with No Active Processes .............................. 47 5.5. Communication with the User ....... 47 6. Advice for Users .................... 47 6.1. Modification of User shell ini- tialization files ...................... 47 6.2. Shell Invocation .................. 47 6.3. Job Exit Status ................... 48 6.4. Delivery of Output Files .......... 48 6.5. Stage in and Stage out problems ........................................ 49 6.6. Dependent Jobs and Test Systems ........................................ 50 TOC 1. Introduction This document is intended to provide the system administra- tor with the information required to build, install, config- ure, and manage the Portable Batch System. It is very likely that some important tidbit of information has been left out. No document of this sort can ever be complete, and until it has been updated by several different adminis- trators at different sites, it is sure to be lacking. You are strongly encouraged to read the PBS External Refer- ence Specification, ERS, included with the release. Look for pbs_ers.ps in the src/doc directory. 1.1. What is PBS The Portable Batch System, PBS, is a batch job and computer system resource management package. It was developed with the intent to be conformant with the POSIX 1003.2d Batch Environment Standard. As such, it will accept batch jobs, a shell script and control attributes, preserve and protect the job until it is run, run the job, and deliver output back to the submitter. PBS may be installed and configured to support jobs run on a single system, or many systems grouped together. Because of the flexibility of PBS, the systems may be grouped in many fashions. 1.2. Components of PBS PBS consist of four major components: commands, the job server, the job executor, and the job scheduler. A brief description of each is given here to help you make decisions during the installation process. Commands PBS supplies both command line commands that are POSIX 1003.2d conforming and a graphical interface. These are used to submit, monitor, modify, and delete jobs. The commands can be installed on any system type sup- ported by PBS and do not require the local presence of any of the other components of PBS. There are three classifications of commands: user commands which any authorized user can use, operator commands, and manager (or administrator) commands. Operator and manager com- mands require different access privileges. Job Server The Job Server is the central focus for PBS. Within this document, it is generally referred to as the Server or by the execution name pbs_server. All com- mands and the other daemons communicate with the Server via an IP network. The Server's main function is to Chapt Revision: 2.0.2.2 1 Installation PBS Administrator Guide provide the basic batch services such as receiving/cre- ating a batch job, modifying the job, protecting the job against system crashes, and running the job (plac- ing it into execution). Job Executor The job executor is the daemon which actually places the job into execution. This daemon is called Mom as it is the mother of all executing jobs, and pbs_mom. Mom places a job into execution when it receives a copy of the job from a Server. Mom creates a new session as identical to a user login session as is possible. For example, if the user's login shell is csh, then Mom creates a session in which .login is run as well as .cshrc. Mom also has the responsibility for returning the job's output to the user when directed to do so by the Server. Job Scheduler The Job Scheduler is another daemon which contains the site's policy controlling which job is run and where and when it is run. Because each site has its own ideas about what is a good or effective policy, PBS allows each site to create its own scheduler. When run, the scheduler can communicate with the various Moms to learn about the state of system resources and with the Server to learn about the availability of jobs to execute. The interface to the Server is through the same API as the commands. In fact, the Scheduler just appears as a batch Manager to the Server. Installation This section attempts to explain the steps to build and install PBS. PBS installation is accomplished via the GNU autoconf process. This installation procedure requires more manual configuration than is "typical" for many packages. There are a number of options which involve site policy and therefore cannot be determined automagically. 1.3. Release Contents 1.3.1. Tar File PBS is provided as a single tar file. The tar file con- tains: - This document in both postscript and text form. - All source code, header files, and make files required to build and install PBS. 2 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation - A full set of documentation sources. These are troff input files. 1.3.2. Additional Requirements PBS uses a configure script generated by GNU autoconf to produce makefiles. If you have a POSIX make program then the makefiles generated by configure will try to take advan- tage of POSIX make features. If your make is unable to pro- cess the makefiles while building you may have a broken make. Should make fail during the build, try using GNU make. If the Tcl based GUI (xpbs and xpbsmon) or the Tcl based scheduler is used, the Tcl header and library are required. If the BaSL scheduler is used, yacc and lex (or GNU bison and flex) are required. The offical site for Tcl is: http://www.scriptics.com/ ftp://ftp.scriptics.com/pub/tcl/tcl8_0 Versions of Tcl previous to 8.0 can no longer be used with PBS. Tcl and Tk version 8.0 or greater must be used. A possible site for yacc and lex is: prep.ai.mit.edu:/pub/gnu 1.4. Build Overview The normal PBS build procedure is to separate the source from the target. This allows the placement of a single copy of the source on a shared file system from which multiple different target systems can be built. Also, the source can be protected from accidental destruction or modification by making the source read-only. However, if you choose, objects may be made within the source tree. In the following descriptions, the source tree is the result of un-tar-ing the tar file into a directory (and subdirecto- ries). A diagram of the source tree is show in figure 1-1. Chapt Revision: 2.0.2.2 3 Installation PBS Administrator Guide PBS SRC configure doc Makefile Makefile admin ers ids src + Makefile include server sched* | lib | cmds | | | | | | | | | Makefile | | | Makefile | Makefile MakefileMakefile | resmom net log | | Makefile | | Makefile Makefile aix4 irix5 unicos8 Figure 1-1: Source Tree Structure The target tree is a set of parallel directories in which the object modules are actually compiled. This tree may be separate from the source tree. An overview of the installation steps is listed here. Detailed explanation of symbols will follow. It is recom- mended that you read completely through these instructions before beginning the installation. To install PBS: 1. Place the tar file on the system where you would like 4 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation to maintain the source. 2. Untar the tar file. It will untar in the current directory producing several files and subdirectories. This is the source tree. You may write protect the source tree at this point should you so choose. In the top directory are two files, named "How_to_Install" and "INSTALL". The "How_to_Install" file is actually a set of release notes containing information about the release contents and pointing to this guide for installation instructions. The "INSTALL" file consists of standard notes about the use of GNU's configure. 3. If you choose to have separate build and src trees cre- ate the top level directory of what will become the target tree at this time. This need not be within the source tree. It must reside on a system of the same architecture as the target system for which you are generating the PBS binaries. This may well be the same system as holds the source or it may not. Change directories to the top of the target tree. 4. Make a job scheduler choice. A unique feature of PBS is its external scheduler module. This allows a site to implement any policy of its choice. To provide even more freedom in implementing policy, PBS provides three scheduler frameworks. Schedulers may be devel- oped in the C language, the Tcl scripting language, or PBS's very own C language extensions, the Batch Scheduling Language, or BaSL. As distributed, configure will default to a C language based scheduler known as fifo. This scheduler can be configured to several common simple scheduling poli- cies, not just first in - first out as the name sug- gests.. When this scheduler is installed, certain con- figuration files are installed in PBS_HOME/sched- uler_priv/. You will need to modify these files for your site. These files are discussed in chapter 4.5 "C based Sample Scheduler", in the section "FIFO Sched- uler". To change the selected scheduler, see the configure options --set-sched and --set-sched-code in the Fea- tures and Package Options section of this chapter. Additional information on the types of scheduler and how to configure fifo can be found in the Scheduling Policies chapter later in this guide. 5. From within the top of the target tree, type the fol- lowing command {source_tree}/configure [options] Chapt Revision: 2.0.2.2 5 Installation PBS Administrator Guide Where {source_tree} is the full relative or absolute path to the configure script in the source tree. If you are building in the source tree type ./configure at the top level of the source tree where the configure script is found. This will generate the complete target tree and a set of header files and makefiles used to build PBS. Note, unlike in prior versions of PBS, the top two subdirec- tories will be doc and src rather than doc and obj; this is a result of the one-for-one copy of the tree made by configure. This step will only need to be redone if you choose to change options specified on the configure command line. See section 2.3 Build Details for information on the configure options. There are certain options which you must specify. Because the number of options may be large and each option is very wordy you may wish to create a shell script consisting of the configure command and the selected options. 6. The next step is to compile PBS by typing make from the top of the target tree. 7. The documentation is not generated by default. You may make it by specifying the --enable-docs option to con- figure or by changing into the doc subdirectory in the target tree and typing make. In order to build and print PostScript copies of the documentation from the included source, you will need the GNU groff formating package including the "ms" for- matting macro package. You may choose to print using different font sets. In the source tree is a file "doc/doc_fonts" which may require editing. Please read the comments in that file. Note that font position 4 is left with the symbol font mounted. 8. To install PBS you must be running with root privi- leges. As root, type make install This generates the working directory structures required for running PBS and installs the programs in the proper executable directories. You must be root to perform this step. When the directories are made, they are also checked to ensure that they have been setup with the correct own- ership and permissions. This is performed to insure that files are not tampered with and the security of PBS compromised. Part of the check is to insure that all parent directories and all files are: 6 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation - owned by root (bin, sys, or any uid < 10), EPERM returned if not; - that group ownership is by a gid < 10, EPERM returned if not; - that the directories are not world writable, or where required to be world writable that the sticky bit is set, EACCESS returned if not; and - that the file or directory is indeed a file or directory, ENOTDIR returned if not. The various PBS daemons will also perform similar checks when they are started. 9. The three daemons, pbs_server, pbs_sched and pbs_mom must be run by root in order to function. Typically they should be started at system boot time. To have a fully functional system, each of the daemons will require certain configuration information. This is explained in detail in this guide in chapter 2. Batch System Configuration. Note that not all three daemons must be or even should be present on all systems. In the case of a large sys- tem, all three may be present. In the case of a clus- ter of workstations, you may have the server (pbs_server) and the scheduler (pbs_sched) on one sys- tem only and a copy of Mom (pbs_mom) on each node where jobs may be executed. At this point, it is assumed that you plan to have all three daemons running on one system. A. One time only, start pbs_server with the "-t cre- ate" option. See the ERS for command details. This option causes the server to initialize vari- ous files. This option will not be required after the first time unless you wish to clear the server database and start over. See the pbs_server(8) man page for more information. A copy of the sec- tion 8 man pages can be found in the External Ref- erence Spec, ERS. For man pages on pbs_server and pbs_mom, see ERS chapter 6. For man pages on the scheduler, see ERS chapter 7. B. Start the execution server, pbs_mom. No options or arguments are required. Mom will look for a configuration file in PBS_HOME/mom_priv/config. If it is not found, Mom will continue to function, but certain requests may be made of Mom from the local system only. See the pbs_mom(8) man page for more information on the config file. Chapt Revision: 2.0.2.2 7 Installation PBS Administrator Guide C. Start the selected job scheduler, pbs_sched. a. For C language based schedulers, such as the default, see the man page pbs_sched_cc(8) for more detail. b. For the BaSL scheduler, the scheduling policy is written in a specialized batch scheduling language that is similar to C. The scheduling code, containing BaSL constructs, must first be converted into C using the basl2c utility. This is done by setting the configure option --set-sched-code=file where file is the rela- tive (to src/scheduler.basl/samples) or abso- lute path of a basl source file. The file name should end in .basl. A good sample pro- gram is "fifo_byqueue.basl" that can schedule jobs on a single-server, single-execution host environment, or a single-server, multi- ple-node hosts environment. Read the header of this sample scheduler for more information about the algorithm used. The scheduler configuration file is an impor- tant entity in BaSL because it is where the list of servers and host resources reside. Execute the basl based scheduler by typing: pbs_sched -c config_file The scheduler searches for the config file in PBS_SERVER_HOME/sched_priv by default. More information can be found in the man page pbs_sched_basl(8). c. The Tcl scheduler requires the Tcl code pol- icy module. Samples of Tcl scripts may be found in src/scheduler.tcl/sample_scripts For the Tcl based scheduler, the Tcl body script should be placed in PBS_HOME/sched_priv/some_file and the Sched- uler run via pbs_sched -b PBS_HOME/sched_priv/some_file More information can be found in the man page pbs_sched_tcl(8). 10. Log onto the system as root and define yourself to pbs_server as a manager by typing: # qmgr Qmgr: set server managers=your_name@your_host From this point, you no longer need root privilege. Note, "your_host" can be any host on which PBS' qmgr command is installed. You can now configure and manage a remote batch system from the comfort of your own workstation. 8 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation Now you need to define at least one queue. Typically it will be an execution queue unless you are using this server purely as a gateway. You may chose to establish queue minimum, maximum, and/or default resource limits for some resources. For example, to establish a mini- mum of 1 second, a maximum of 12 cpu hours, and a default of 30 cpu minutes on a queue named "qname"; issue the following commands inside of qmgr: Qmgr: create queue qname queue_type=e Qmgr: s q qname resources_min.cput=1,resources_max.cput=12:00:00 Qmgr: s q qname resources_default.cput=30:00 Qmgr: s q qname enabled=true Lastly, allow the the scheduling of jobs by pbs_sched by issuing: Qmgr: s s scheduling=true When the attribute scheduling is set to true, the server will call the the job scheduler, if false the job scheduler is not called. The value of scheduling may be specified on the pbs_server command line with the -a option. 1.5. Build Details While the overview gives sufficient information to build a basic PBS system, there are lots of options available to you and custom tailoring that should be done. 1.5.1. Make File Targets The follow target names are applicable for make: all The default target, it compiles everything. build Same as all. depend Builds the header file dependency rules. install Installs everything. clean Removes all object and executable program files in the current subtree. distclean Leaves the object tree very clean. It will remove all files that were created during a build. 1.5.2. Configure Options The following is detailed information on the options to the configure script. Chapt Revision: 2.0.2.2 9 Installation PBS Administrator Guide 1.5.2.1. Generic Configure Options The following are generic configure options that do not effect PBS. --cache-file=file Cache the system configuration test results in file. Default: config.cache --help Prints out information on the available options. --no-create Do not create output files. --quiet, --silent Do not print "checking" messages. --version Print the version of autoconf that created configure. --enable-depend-cache This turns on configure's ability to cache makedepend information across runs of configure. This can be bad if the user makes certain configuration changes in rerunning configure, but it can save time in the hands of experienced developers. Default: disabled 1.5.2.2. Directory and File Names These options specify where PBS objects will be placed. --prefix=PREFIX Install files in subdirectories of PREFIX directory. Default: /usr/local --exec-prefix=EPREFIX Install architecture dependent files in subdirectories of EPREFIX. Default: see PREFIX --bindir=DIR Install user executables (commands) in subdirectory DIR. Default: EPREFIX/bin (/usr/local/bin) --sbindir=DIR Install System Administrator executables in subdirec- tory DIR. This includes certain administrative com- mands and the daemons. Default: EPREFIX/sbin (/usr/local/sbin) 10 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation --libdir=DIR Object code libraries are placed in DIR. This includes the PBS API library, libpbs.a. Default: PREFIX/lib (/usr/local/lib) --includedir=DIR C language header files are installed in DIR. Default: PREFIX/include (/usr/local/include) --mandir=DIR Install man pages in DIR. Default: PREFIX/man (/usr/local/man) --srcdir=SOURCE_TREE PBS sources can be found in directory SOURCE_TREE. Default: location of configure script. 1.5.2.3. Features and Package Options In general, these options take the following forms: --disable-FEATURE Do not compile for FEATURE, same as --enable-FEATURE=no --enable-FEATURE Compile for FEATURE --with-PACKAGE Compile to include PACKAGE --without-PACKAGE Do not compile to include PACKAGE, same as with-PACKAGE=no --set-OPTION Set the value of OPTION For PBS, the recognized --enable/disable, --with/without, and --set options are: --enable-docs Build (or not build) the PBS documentation. To do so, you will need the following GNU utilities: groff, gtbl and gpic. Even if this option is not set, the man pages will still be installed. Default: disabled --enable-server Build (or not build) the PBS job server, pbs_server. Normally all components (Commands, Server, Mom, and Scheduler) are built. Default: enabled --enable-mom Build (or not build) the PBS job execution daemon, pbs_mom. Default: enabled --enable-clients Build (or not build) the PBS commands. Default: enabled --with-tcl=DIR_PREFIX Chapt Revision: 2.0.2.2 11 Installation PBS Administrator Guide Use this option if you wish Tcl based PBS features com- piled. These features include the GUI interface, xpbs. If the following option, --with-tclx, is set, use this option only if the Tcl libraries are not co-located with the Tclx libraries. When set, DIR_PREFIX must specify the absolute path of the directory containing the Tcl Libraries. Default: without, Tcl utilities are not built. --with-tclx=DIR_PREFIX Use this option if you wish the Tcl based PBS features to be based on Tclx. This option implies --with-tcl. Default: Tclx is not used. --enable-gui Build the xpbs GUI. Only valid if --with-tcl is set. Default: enabled --set-cc[=ccprog] Specify which C compiler should be used. This will override the CC environment setting. If only --set-cc is specified, then CC will be set to cc. Default: gcc (after all, configure is from GNU also) --set-cflags[=FLAGS] Set the compiler flags. This is used to set the CFLAGS variable. If only --set-cflags is specified, then CFLAGS is set to "". This must be set to -64 to build 64 bit objects under Irix 6, e.g. --set-cflags=-64. Note, multiple flags, such as -g and -64 should be enclosed in quotes, e.g. --set-cflags='-g -64' Default: CFLAGS is set to a best guess for the system type. --enable-debug Builds PBS with debug features enabled. This allows the daemons to remain attached to standard output and produce vast quantities of messages. Default: disabled --set-tmpdir=DIR Set the tmp directory in which pbs_mom will create tem- porary scratch directories for jobs. Used on Cray sys- tems only. Default: /tmp --set-server-home=DIR Sets the top level directory name for the PBS working directories, PBS_HOME. This directory MUST reside on a local file system. PBS uses synchronous writes to files to maintain state. Default: /usr/spool/pbs --set-server-name-file=FILE 12 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation Set the file name which will contain the name of the default server. This file is used by the commands to determine which server to contact. If FILE is not an absolute path, it will be evaluated relative to the value of --set-server-home. Default: server_name --set-default-server=HOSTNAME Set the name of the host that clients will contact when not otherwise specified in the command invocation. It must be the primary network name of the host. Default: the name of the host on which PBS is being compiled. --set-environ=PATH Set the path name of the file containing the environ- | ment variables used by the daemons and passed to the | jobs. For AIX based systems, we suggest setting this | option to /etc/environment. Relative path names are | interpreted relative to the value of --set-server-home, | PBS_HOME. | Default: the file pbs_environment in the directory | PBS_HOME. | For a discussion of this file and the environment, see | section 5.1.1. Internal Security. You may edit this | file to modify the path or add other environmental | variables. --enable-plock-daemons=WHICH Enable daemons to lock themselves into memory to improve performance. The argument WHICH is the logi- cal-or of 1 for pbs_server, 2 for pbs_sheduler, and 4 for pbs_mom (7 is all three daemons). This option is recommended for Unicos systems. It should not be used for AIX systems. Default: disabled. --enable-syslog Enable the use of syslog for error reporting. This is in addition to the normal PBS logs. Default: disabled. --set-sched=TYPE Set the scheduler (language) type. If set to c, a C based scheduler will be compiled. If set to tcl, a Tcl based scheduler will be used. If set to basl, a BAtch Scheduler Language scheduler will be generated. If set to no, no scheduler will be compiled, jobs will have to be run by hand. Default: c --set-sched-code=PATH Sets the name of the file or directory containing the Chapt Revision: 2.0.2.2 13 Installation PBS Administrator Guide source for the scheduler. This is only used for C and BaSL schedulers, where --set-sched is set to either c or basl. For C Schedulers, this should be a directory name. For BaSL schedulers, it should be file name ending in .basl. If the path is not absolute, it will be interpreted relative to SOURCE_TREE/src/sched- ulers.SCHED_TYPE/samples. For example, if --set-sched is set to basl, then set --set-sched-code to fifo_byqueue.basl. Default: fifo (C based scheduler) --enable-tcl-qstat Builds qstat with the Tcl interpreter extensions. This allows site and user customizations. Only valid if --with-tcl is already present. Default: disabled --set-tclatrsep=CHAR Set the character to be used as the separator character between attribute and resource names in Tcl/Tclx scripts. Default: "." --set-qstatrc-file=FILE Set the name of the file that qstat will use if there is no .qstatrc file in the user's home directory. This option is only valid when --enable-tcl-qstat is set. If FILE is a relative path, it will be evaluated rela- tive to the PBS Home directory, see --set-server-home. Default: PBS_HOME/qstatrc --with-scp Directs PBS to attempt to use the Secure Copy Program, scp, when copying files to or from a remote host. This applies for delivery of output files and stage- in/stage-out of files. If scp is to used and the attempt fails, PBS will then attempt the copy using rcp in case that scp did not exist on the remote host. For local delivery, "/bin/cp -r" is always used. For remote delivery, a varient of rcp is required. The program must always provide a non-zero exit status on any failure to deliver files. This is not true of all rcp implementation, hence a copy of a known good rcp is included in the source, see mom_rcp. More information can be found in chapter 6.4 "Delivery of Output Files." Default: sbindir/pbs_rcp (from the mom_rcp source directory) is used, where sbindir is the value from --sbindir. --enable-shell-pipe When enabled, pbs_mom passes the name of the job script to the top level shell via a pipe rather than placing the script file as the shell's standard input. This 14 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation does result in a second shell being created to process the script. However, a command in the script that reads from standard input will not be able to access the script contents, which can cause some shells to by- pass commands in the script. Default: enabled --enable-sp2 Turn on special features for the IBM SP. This requires access to a special IBM supplied library, lib- SDR.a. This library is not provided by IBM except on special request. This option is only valid when the PBS machine type is aix4. The PBS machine type is automatically determined by the configure script. Default: disabled --enable-srfs This option enables support for Session Reservable File Systems. It is only valid on Cray systems with the NASA modifications to support Session Reservable File System, SRFS. Default: disabled --enable-array Setting this under Irix 6.x enables the use of SGI Array Session tracking. Enabling this feature is recommanded if MPI jobs use the Array Services Daemon. The PBS machine type is set to irix6array. Default: disabled 1.6. Machine Dependent Build Instructions There are a number of possible variables that are only used for a particular type of machine. If you are not building for one of the following types, you may ignore this section. 1.6.1. Cray Systems 1.6.1.1. Cray C90, J90, and T90 Systems On the traditional Cray systems such as the C90, PBS sup- ports Unicos versions 8, 9 and 10. If your system supports the Session Reservable File System enhancement by NASA, run configure with the --enable-srfs option. If enabled, the server and MOM will be compiled to have the resource names srfs_tmp, srfs_big, srfs_fast, and srfs_wrk. These may be used from qsub to request SRFS allo- cations. The file /etc/tmpdir.conf is the configuration file for this. An example file is: # Shell environ var Filesystem TMPDIR Chapt Revision: 2.0.2.2 15 Installation PBS Administrator Guide BIGDIR /big/nqs FASTDIR /fast/nqs WRKDIR /big/nqs The directory for TMPDIR will default to that defined by JTMPDIR in Unicos's /usr/include/tmpdir.h. Without the SRFS mods, Mom under Unicos will create a tempo- rary job scratch directory. By default, this is placed in /tmp. The location can be changed via --set-tmpdir=DIR. 1.6.1.2. Unicos 10 with MLS If you are running Unicos MLS, required in Unicos 10.0 and later, the following action is required after the system is built and installed. MOM updates ue_batchhost and ue_batchtime in the UDB for the user. In an MLS system, MOM must have the security capability to write the protected UDB. To grant this capability, change directory to wherever pbs_mom has been installed and type: spset -i 16 -j daemon -k exec pbs_mom You, the administrator, must have capabilities secadm and class 16 to issue this command. You use the setucat and setucls commands to get to these levels if you are autho- rized to do so. The UDB reclsfy permission bit gives a user the proper authorization to use the spset command. WARNING There has been only limited testing in the weakest of MLS environments, problems may appear because of differences in your environment. 1.6.1.3. Cray T3E Systems For Cray T3E systems, TBD. 1.6.2. IBM Workstations PBS supports IBM workstations running AIX 4.x. When man pages are installed in mandir, the default man page file name suffix, "B", must be removed. Currently, this must be done by hand. For example, change man3/qsub.3B to man3/qsub.3. Do not use the configure option --enable-plock. It will crash the system by using up all of memory. 1.6.3. IBM SP Every thing under IBM Workstation section above applies to the IBM SP. Be sure to read the section 3.1 Nodes before configuring the Server. 16 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation Set special SP-2 code to interface with the IBM Resource (Job) Manager with --enable-sp2. 1.6.4. SGI Workstations Running IRIX 5 If, and only if, your system is running Irix 5.3, you will need to add -D_KMEMUSER to CFLAGS because of a quirk in the Irix header files. 1.6.5. SGI Workstations Running IRIX 6 If built for Irix 6.x, pbs_mom will track which processes are part of a PBS job using POSIX session numbers. This method is fine for workstations and multiprocessor boxes not running the SGI Array Services Daemon (arrayd) and not using SGI's mpirun command. The PBS machine type (PBS_MACH) is set to irix6. Where arrayd and mpirun is being used, the tasks of a paral- lel job are started through requests to arrayd and hence are not part of the job's POSIX session. In order to relate processes to the job, the SGI Array Session Handle (ASH) must be used. This feature is enabled by setting --enable- array. The PBS machine type (PBS_MACH) is set to irix6array IRIX 6 supports both 32 and 64 bit objects. In prior ver- sions of PBS, PBS was typically built as a 32 bit object. Irix 6.4 introduced system supported checkpoint/restart, PBS will include support for checkpoint/restart if the file /usr/lib64/libcpr.so is detected during the build process. To interface with the SGI checkpoint/restart library, PBS must be made as a 64 bit object. Add -64 to the CFLAGS. This can be done via the configure option --set-cflags=-64 WARNING Because of changes in structure size, PBS will not be able to recover any server, queue, or job information recorded by a PBS built with 32 bit objects, or vice versa. Please read chapter 3.8 of the Admin Guide entitled Installing an Updated Batch System for instructions on dealing with this incompatibility. If libcpr.so is not present, PBS may be built as either a 32 bit or a 64 bit object. To build as 32 bit, add -n32 instead of -64 to CFLAGS. 1.6.6. FreeBSD and NetBSD There is a problem with FreeBSD up to at least version 2.2.6. It is possible to lose track of which session a set Chapt Revision: 2.0.2.2 17 Installation PBS Administrator Guide of processes belongs to if the session leader exits. This means that if the top shell of a job leaves processes run- ning in the background and then exits, MOM will not be able to find them when the job is deleted. This should be fixed in a future version. 1.6.7. Linux Redhat version 4.x is supported. Version 5.x is not cur- rently supported; there are system header file differences. 1.6.8. SUN Running SunOS The native SunOS C compiler is not ANSI and cannot be used to build PBS. GNU gcc is recommended. 1.7. Additional Build Options Two header files within the subdirectory src/include provide additional configuration control over the server and MOM. The modification of any symbols in the two files should not be undertaken lightly. 1.7.1. pbs_ifl.h This header file contains structures, symbols and constants used by the API, libpbs.a, and the various commands as well as the daemons. Very little here should ever be changed. Possible exceptions are the following symbols. They must be consistent between all batch systems which might intercon- nect. PBS_MAXHOSTNAME Defines the length of the maximum possible host name. This should be set at least as large as MAXHOSTNAME which may be defined in sys/params.h. PBS_MAXUSER Defines the length of the maximum possible user login name. PBS_MAXGRPN Defines the length of the maximum possible group name. PBS_MAXQUEUENAME Defines the length of the maximum possible PBS queue name. PBS_USE_IFF If this symbol is set to zero (0), before the library and commands are built, the API routine pbs_connect() 18 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation will not attempt to invoke the program pbs_iff to gen- erate a secure credential to authenticate the user. Instead, a clear text credential will be generated. This credential is completely subject to forgery and is useful only for debugging the PBS system. You are strongly advised against using a clear text credential. PBS_BATCH_SERVICE_PORT Defines the port number at which the server listens. PBS_MOM_SERVICE_PORT Defines the port number at which MOM, the execution miniserver, listens. PBS_SCHEDULER_SERVICE_PORT Defines the port number at which the scheduler listens. 1.7.2. server_limits.h This header file contains symbol definitions used by the server and by MOM. Only those that might be changed are listed here. These should be changed with care. It is strongly recommended that no other symbols in server_lim- its.h be changed. If server_limits.h is to be changed, it may be copied into the include directory of the target (build) tree and modified before compiling. NO_SPOOL_OUTPUT If defined, directs MOM to not use a spool directory for the job output, but to place it in the user's home directory while the job is running. This allows a site to invoke quota control over the output of running batch jobs. PBS_BATCH_SERVICE_NAME This is the service name used by the server to deter- mine to which port number it should listen. It is set to pbs, in quotes as it is a character string. Should you wish to assign PBS a service port in /etc/services, change this string to the service name assigned. You should also update PBS_SCHEDULER_SERVICE_NAME as required. PBS_DEFAULT_ADMIN Defined to the name of the default administrator, typi- cally "root". Generally only changed to simplify debugging. PBS_DEFAULT_MAIL Set to user name from which mail will be sent by PBS. The default is "adm". This is overridden if the server attribute mail_from is set. Chapt Revision: 2.0.2.2 19 Installation PBS Administrator Guide PBS_JOBBASE The length of the job id string used as the basename for job associated files stored in the spool directory. It is set to 11, which is 14 minus the 3 characters of the suffixes like .JB and .OU. Fourteen is the guaran- teed length for a file name under POSIX. The actual length that a file name can be depends on the file sys- tem and must be determined at run time, but PBS is too lazy to go to that trouble. If the server and MOM run on a file system that support longer names (most do), then you may up this value so that the names are more readable. PBS_MAX_HOPCOUNT Used to limit the number of hops taken when being routed from queue to queue. It is mainly to detect loops. PBS_NET_MAX_CONNECTIONS The maximum number of open file descriptors and sockets supported by the server. PBS_NET_RETRY_LIMIT The limit on retrying requests to remote servers. PBS_NET_RETRY_TIME The time between network routing retries to remote queues and for requests between the server and MOM. PBS_RESTAT_JOB To refrain from over burdening any given MOM, the server will wait this amount of time (default 30 sec- onds) between asking her for updates on running jobs. In other words, if a user asks for status of a running job more often than this value, the prior data will be returned. PBS_ROOT_ALWAYS_ADMIN If defined (set to 1), "root" is an administrator of the batch system even if not listed in the managers attribute. PBS_SCHEDULE_CYCLE The default value for the elapsed time between schedul- ing cycles with no change in jobs queued. This is the initial value used by the server, but it can be changed via qmgr(1B). 1.8. Site Modifiable Source Files It is safe to skip this section untill you have played with PBS for a while and want to start tinkering. 20 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation Dave Tweten of NASA has said, "If it ain't source, it ain't software." This is part of PBS's philosophy that source distribution should be a major part of any software product. Otherwise, the product becomes "hard"-ware. The first exam- ple of this philosophy is the PBS job scheduler. The implementation of the site policy is left to the site. PBS provides three tools for that implementation, the BaSL scheduler, the Tcl scheduler, and the C scheduler. The philosophy does not stop with the scheduler. With dis- tribution of the source, a site has the ability to modify any part of PBS as they so choose. Of course, in indiscrim- inate modification is not without dangers. Not the least of which is conflicts with future releases by the developers. Certain functions of PBS appear to be likely targets of widespread modification by sites for a number of reasons. When identified, the developers of PBS have attempted to improve the easy of modification in these areas by the inclusion of special site specific modification routines. These are identified in the IDS under chapter headings of "Site Modifiable Files" in the sections on the Server and MOM. The distributed default version of these files build a private library, libsite.a, which is include in the linking phase for the server and for MOM. They may be replace as needed by a site. The procedure is described in the IDS under "libsite.a - Site Modifiable Library" in chapter 10. The files include: Server site_allow_u.c The routine in this file, site_allow_u(), provides an additional point at which a user can be denied access to the batch system (server). It may be used instead of or in addition to the server Acl_User list. site_alt_rte.c The function site_alt_router() allows a site to add decision capabilities to job routing. This function is called on a per-queue basis if the queue attribute alt_router is true. As provided, site_alt_router() just invokes the default router, default_router(). site_check_u.c The routine in this file, site_check_user_map(), provides the service of authenticating that the job owner is privileged to run the job under the user name specified or selected for execution on the server system. Please see the IDS for the default authentication method. Chapt Revision: 2.0.2.2 21 Installation PBS Administrator Guide site_map_usr.c For sites without a common user name/uid space, this function, site_map_user(), provides a place to add a user name mapping function. The mapping occurs at two times. First to determine if a user making a request against a job is the job owner, see "User Authorization". Second, to map the sub- mitting user (job owner) to an execution uid on the local machine. site_*_attr_*.h These files provide a site with the ability to add local attributes to the server, queues, and jobs. The files are installed into the target tree "include" subdirectory during the first make. As delivered, they contain only comments. If a site wishes to add attributes, these files can be care- fully modified. The files are in three groups, by server, queue, and job. In each group are site_*_attr_def.h files which are used to defined the name and sup- port functions for the new attribute or attributes, and site_*_attr_enum.h files which insert a enumerated label into the set for the corresponding parent object. For server, queue, node attributes, there is also an additional file that defines if the qmgr(1) command will include the new attribute in the set "printed" with the print server, print queue, or print node sub-com- mands. Detailed information on how to modify these files can be found in the IDS under the "Site Modifiable Files" section of the Server chapter, chapter 5. You should note that just adding attributes will have no effect on how PBS processes jobs. The main usage for new attributes would be in provid- ing new scheduler controls and/or information. The scheduling algorithm will have to be modified to use the new attributes. If you need MOM to do something different with a job, you will still need "to get down and dirty" with her source code. MOM site_mom_chu.c If a server is feeding jobs to more than one MOM, additional checking for execution privilege may be required at MOM's level. It can be added in this function site_mom_chkuser(). site_mom_ckp.c Provide post-checkpoint, site_mom_postchk() and 22 Chapt Revision: 2.0.2.2 PBS Administrator Guide Installation pre-restart site_mom_prerst() "user exits" for the Cray. site_mom_jset.c The function site_job_setup() allows a site to perform specific actions once the job session has been created and before the job runs. 2. Batch System Configuration Now that the system has been built and installed, the work has just begun. The scheduling policy must be implemented and the Server and Moms configured. These items are closely coupled. Managing which and how many jobs are scheduled into execution can be done in several methods. Each method has an impact on the implementation of the scheduling policy and server attributes. An example is the decision to sched- ule jobs out of a single pool (queue) or divide jobs into one of multiple queues each of which is managed differently. More on this type of discussion is covered under the sec- tions Server Management and Scheduling Policies. 2.1. Nodes Before proceeding, you will need to make some decisions about how the overall batch system is to be arranged. To do that, we need to define some terms: Node A computer system with a single Operating System image, a unified virtual memory image, and one or more IP addresses. Frequently, the term execution host is used for node. A box like the SGI Origin 2000, with con- tains multiple processing units running under a single OS copy is one node. A box like the IBM SP which con- tains many units, each with their own copy of the OS, is a collection of many nodes. Under PBS, a node may be allocated exclusively, shared, or timeshared. Cluster A collection of nodes managed by one batch system. A cluster may be made up of nodes that are allocated to only one job at a time or of nodes that have many jobs executing on each at once or a combination of both. Cluster Node To confuse the issue, we use the term cluster node to identify a node that is allocated specifically to one job (see exclusive node), or a few jobs (see shared nodes). This type of node may also be called space shared. Hosts that are timeshared among many jobs are Chapt Revision: 2.0.2.2 23 Configuration PBS Administrator Guide called "timeshared." Exclusive Nodes An exclusive node is one that is used by one and only one job at a time. A set of nodes is assigned exclu- sively to a job for the duration of that job. This is typically done to improve the performance of message passing programs. Shared Nodes A shared node is one which is shared by multiple jobs. If several jobs which requested multiple shared nodes are running, some nodes may be common to the subset allocated to each job and some may be unique. When a node is allocated as a shared node, it remains so until all jobs using it are terminated. Then the node may be next allocated for shared or exclusive use. Timeshare In our context, to always allow multiple jobs to run concurrently on an execution host or node. Often the term host rather than node is used in conjunction with timeshare, as in timeshared host. If the term node is used without the timeshare prefix, the node is one which is allocated either exclusively or shared. If a host, or node, is indicated to be timeshared, it will never be allocated (by the server) exclusively or shared. Load Balance A policy wherein jobs are distributed across multiple timeshared hosts to even out the work load on each host. Being a policy, the distribution of jobs across execution hosts is solely a function of the Job Sched- uler. Node Property In order to have a means of grouping nodes for alloca- tion, a set of zero or more node properties may be given to each node. The property is nothing more than a string of printable characters without meaning to PBS. You, as the PBS administrator, may chose whatever property names you wish. Your choices for property names should be relayed to the users. Batch System A PBS Batch System consists of one Job Server (pbs_server), one or more Job Schedulers (pbs_sched), and one or more execution servers (pbs_mom). With prior versions of PBS, a Batch System could be set up to support only a cluster of exclusive nodes or to sup- port one or more timeshared hosts. There was no sup- port for shared nodes. With this release, a PBS Batch 24 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration System may be set up to feed work to one large time- shared system, multiple time shared systems, a cluster of nodes to be used exclusively or shared, or any com- bination of the preceding. 2.1.1. Defining Nodes In prior PBS systems, nodes were only specified when the system was built and configured to support exclusive nodes. A single pbs_mom was given a list of nodes. When jobs were run, the top level shell for all jobs ran on the host where that Mom lived. Mom allocated nodes to the job and provided the job a file containing the list of nodes allo- cated to it. If the PBS batch system was supporting one or more timeshared hosts, only the Job Scheduler knew of those hosts. It directed where the server should send the job for execution. In this version of PBS, allocation of nodes is handled by the Server instead of Mom. Each node to be used by a job must have its own copy of Mom running on it. If only time- shared hosts are to be served by the PBS batch system, then as before, the Job Scheduler must direct where the job should be run. If unspecified, the Server will execute the job on the host where it is running. See the next section for full details. If nodes are to be allocated exclusively or shared, a list of the nodes must be specified to the server. This list may also contain timeshared nodes. Nodes marked as time- shared will be listed by the server in a node status report along with the other nodes. However, the server will not attempt to allocate them to jobs. The presence of time- shared nodes in the list is solely as a convenience to the Job Scheduler and other programs, such as xpbsmon. The node list is given to the server in a file named nodes in the server's home directory PBS_HOME/server_priv. This is a simple text file with the specification of a single node per line in the file. The format of each line in the file is: node_name[:ts] [property ...] The node name is the network name of the node (host name). The optional :ts appended to the name indicates that the node is a timeshared node. Zero or more properties may be specified. Each item on the line must be separated by white space. Comment lines may be included if the first non-white space character is the pound sign '#'. This is the same format used by Mom in earlier versions with the addition of the :ts suffix. The following is an example of a possible nodes file: # The first set of nodes are space shared (cluster) nodes. Chapt Revision: 2.0.2.2 25 Configuration PBS Administrator Guide # Note that the properties are provided to group # certain nodes together. curly stooge odd moe stooge even larry stooge even harpo marx odd groucho marx odd chico marx even # And for fun we throw in one timeshared node. chaplin:ts After the pbs_server is started with a nodes file containing | at least one node definition, the list of nodes may be | altered via the qmgr command. | Add nodes: | create node node_name [attributes=values] | where attributes are state, properties, and ntype. The | possible values for the attributes are: | stateWhich can take the values: free, down, or offline. | properties | Which can be any string or comma separated strings | which must be enclosed in quotes. For example: | properties="green,blue,yellow" | ntype What can take the values: cluster or time-shared. | Delete nodes: | delete node node_name | Modify nodes: | set node node_name [attributes=values] | where attributes are the same as for create. 2.1.2. Where Jobs May Be Run Where jobs may be or will be run is determined by an inter- action between the Scheduler and the Server. This interac- tion is effected by the existence of the nodes file. 2.1.2.1. No Node File If a nodes file does not exist, the server only directly knows about its own host. It will assume that jobs may be executed on it. When told to run a job without a specific execution host named, it will default to its own host. Oth- erwise, it will attempt to execute the job where directed in the Run Job request. 26 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration 2.1.2.2. Node File Exists If a nodes file exists, then the following rules come into play 1. If a specific host is named in the Run Job request and the host is specified in the nodes file as a timeshared host, the Server will attempt to run the job on that host. 2. If a specific host is named in the Run Job request and the named node is not in the nodes file as a timeshared host or if there are multiple nodes named in the Run Job request, then the Server attempts to allocate the named cluster node or nodes to the job. All of the named nodes must appear in the server's nodes file. If the alloca- tion succeeds, the job is run directly on the first of the nodes allocated. 3. If no location was specified on the Run Job request, but the job requests nodes, then cluster nodes which match the request are allocated if pos- sible. If the allocation succeeds, the job is run on the node allocated to match the first specifica- tion in the node request. Note, the Scheduler may modify the job's original node request, see the job attribute neednodes. 4. Otherwise, a set of nodes to allocate is specified by a server attribute, default_node. By default, this is one shared node, 1#shared. It may be set to any valid node request or any single timeshared node listed in the nodes file. What the above all means can be boiled down into the follow- ing set of guidelines: - If the batch system consists of a single timeshared host on which the server and Mom are running, no problem - all the jobs run there. The scheduler only needs to say which job it wants run. - If you are running a timeshared complex with one or more back-end hosts, where Mom is on a different host than is the server, then load balancing jobs across the various hosts is a matter of the sched- uler determining on which host to place the selected job. This is done by querying the resource monitor side of Mom using the resource monitor API - the addreq() and getreq() calls. The Scheduler tells the server where to run each job. You should change the server attribute default_node to a value of one of the time shared Chapt Revision: 2.0.2.2 27 Configuration PBS Administrator Guide hosts (the host if you only have one). - If your cluster is made up of cluster nodes and you are running distributed (multiple node) jobs, as well as serial jobs, the Scheduler typically uses the Query Resource or Avail request to the server for each queued job under consideration. The Scheduler then selects one of the jobs that the Server replied could run, and directs that the job should be run. The Server will then allocate the nodes to the job. By leaving the server attribute default_node set to one shared node, 1#shared, jobs which do not request nodes will be placed together on a few nodes running shared jobs. - If you have a batch system supporting both cluster nodes and one timeshared node, the situation is like the above, only you may wish to change default_node to point to the timeshared host. Jobs that do not ask for nodes will end up running on the timeshared host. - If you have a batch system supporting both cluster nodes and multiple time shared hosts, you have a complex system which requires a smart scheduler. The Scheduler must recognize which jobs request nodes and use the Avail request to the Server. It must also recognize which jobs are to be load bal- anced among the timeshared hosts, and provide the host name to the Server when directing that the job be run. The supplied fifo scheduler has this capa- bility. 2.1.3. Specifying Nodes The nodes resource is set by the user to declare the node requirements for the job. It is a string of the form node_spec[+node_spec...] where node_spec is number | property[:property...] | number:property[:property...] The node_spec may have an optional global modifier appended. This is of the form #property. For example: 6+3:fat+2:fat:hippi+disk or 6+3:fat+2:fat:hippi+disk#prime. Where fat, hippi, and disk are examples of property names assigned by the administrator in the "nodes" file. The above example translates as the user requesting 6 plain nodes plus 3 "fat" nodes plus 2 nodes that are both "fat" and "hippi" plus one "disk" node, a total of 12 nodes. Where #prime is appended as a global modifier, the global property, "prime" is appended by the Server to each element of the spec. It would be equivalent to 28 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration 6:prime+3:fat:prime+2:fat:hippi:prime+disk:prime. A major use of the global modifier is to provide the shared keyword. This specifies that all the nodes are to be shared nodes. The keyword shared is only recognized as such when used as a global modifier. Two additional Read-Only resources exist for jobs. They are nodect and neednodes. Nodect (node count) is set by the server to the integer number of nodes desired by the user as declared in the "nodes" resource specification. That decla- ration is parsed and the resulting total number of nodes is set in nodect. This is useful when an administrator wishes to place an integer limit, resources_min or resources_max, on the number of nodes used by a job entering a queue. Based on the above example, it would be set to 12 (6+3+2+1). Neednodes is initially set by the Server to the same value as nodes. Neednodes may be modified by the job scheduler for special policies. The contents of neednodes determines which nodes are actually assigned to the job. Neednodes is visible to the administrator but not to the user. If you wish to set up a queue default value for "nodes" (a value to which the resource is set if the user does not sup- ply one), corresponding default values must be set for "nodect" and "neednodes". For example Qmgr: set queue foo resources_default.nodes=1 Qmgr: set queue foo resources_default.nodect=1 Qmgr: set queue foo resources_default.neednodes=1 Minimum and maximum limits are set for "nodect" only. For example: Qmgr: set queue foo resources_min.nodect=1 Qmgr: set queue foo resources_max.nodect=15 Minimum and maximum values must not be set for nodes or neednodes as those are string values. 2.2. Network Addresses PBS makes use of fully qualified host names for identifying the jobs and their location. A PBS batch system is known by the host name on which the server, pbs_server, is running. If the server is started with an non-standard port number, see -p option in the pbs_server(8) man page, the server "name" becomes host_name.domain:port, where port is the numeric port number being used. See the discussion of Alternate Test Systems later in this section. The three daemons and the commands will attempt to use | /etc/services to identify the standard port numbers to use | for communication. The port numbers need not be below the | magic 1024 number. The service names that should be added | to /etc/services are | pbs 15001/tcp # pbs server (pbs_server)| pbs_mom 15002/tcp # mom to/from server | Chapt Revision: 2.0.2.2 29 Configuration PBS Administrator Guide pbs_resmom 15003/tcp # mom resource management requests| pbs_resmom 15003/udp # mom resource management requests| pbs_sched 15004/tcp # scheduler | The numbers listed are the default number used by this ver- | sion of PBS. If you change them, be careful to use the same | numbers on all systems. If the services cannot be found in /etc/services, the PBS components will default to the above listed numbers. As of release 1.1.9, there is one additional port used. pbs_dis is used for requests of the server in the new Data Is Strings data encoding. As of release 1.1.10, port 15003/tcp is used by pbs_mom for task management (and alter- nately for scheduler communication) as well as 15003/udp. 2.3. Starting Daemons All three of the daemon processes, Server, Scheduler and MOM, must run with the real and effective uid of root. Typ- ically, the daemons are started from the systems boot files, e.g. /etc/rc.local. However, it is recommended that the Server be brought up "by hand" the first time and configured before being run at boot time. 2.3.1. Starting the Server The initial run of the Server or any first time run after recreating the home directory should be with the -t create option. This option directs the Server to create a new server database. If one is already present, it is discarded after receiving a positive validation response. At this point it is necessary to configure the Server. See the sec- tion 3.4.1 Server Configuration. The create option leaves the server in a "idle" state. In this state the server will not contact the scheduler and jobs are not run, except manu- ally via the qrun(1B) command. Once the server is up, it can be placed in the "active" state by setting the server attribute scheduling to a value of true: qmgr -c "set server scheduling=true" The value of scheduling is retained across server termina- tions/starts. After the server is configured it may be placed into ser- vice. Normally it is started in the system boot file. The -t start_type option may be specified where start_type is one of the options specified in the ERS (and the pbs_server man page). The default is warm. Another useful option is the -a true|false option. This turns on|off the invocation of the PBS job scheduler. 2.3.2. Starting the Scheduler The Scheduler should also be started at boot time. 30 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration Typically the only required option for the BaSL based sched- uler is the -r script_file option specifying the script file. For the Tcl based scheduler, the option -b script_file is used to specify the Tcl script to be called. 2.3.3. Starting MOM MOM should also be started at boot time. Typically there are no options unless MOM is being restarted on a running system or the pbs_sever or pbs_sched are running on a dif- ferent host. If MOM is taken down and the host system continues to run, MOM should be restarted with the -r option. This directs MOM to kill off any jobs which were left running. See the ERS for a full explanation. By default, MOM will only accept connections from a privi- leged port on her system, either the port associated with "localhost" or the name returned by gethostname(2). If the server or scheduler are running on a different host, the host name(s) must be specified in MOM's configuration file. See the -c option on the pbs_mom(8B) man page and in the Admin Guide, see sections 2.5 3Configurating the Execution Server, pbs_mom for more information on the configuration file. Should you wish to make use of the prologue and/or epilogue script features, please see section 6.2 "Job Prologue/Epi- logue Scripts". 2.4. Configuring the Job Server, pbs_server Server management consist of configuring the server attributes and establishing queues and their attributes. Unlike Mom and the Job Scheduler, the Job Server (pbs_server) is configured while it is running, except for the nodes file. Configuring server and queue attributes and creating queues is done with the qmgr(1B) command. This must be either as root or as a User who has been granted PBS Manager privilege as shown in the last step in the Build Overview section of this guide. Exactly what needs to be set depends on your scheduling policy and how you chose to implement it. The system needs at least one queue estab- lished and certain server attributes initialized. The server attributes are discussed in section 2.4 of the ERS. The following are the "minimum required" server attributes and the recommended attributes. For the sake of examples, we will assume that your site is a sub-domain of a large network. All hosts at your site have names of the form: host.foo.bar.com Chapt Revision: 2.0.2.2 31 Configuration PBS Administrator Guide and the batch system consists of a single large machine named big.foo.bar.com. 2.4.1. Server Configuration The following attributes are required or recommended. They are set via the set server (s s) subcommand to the qmgr(1B) command. Not all of the Server Attributes are discussed here, only what is needed to get a reasonable system up and running. See the pbs_server_attributes man page for a complete list. 2.4.1.1. Required Server Attributes default_queue Declares the default queue to which jobs are submitted if a queue is not specified on the qsub(1B) command. The queue must be created first. Example: c q dque queue_type=execution s s default_queue=dque 2.4.1.2. Recommended Server Attributes acl_hosts A list of hosts from which jobs may be submitted. For example, if you wish to allow all the systems on your sub-domain plus one other host, boss, at headquarters to submit jobs, then set: s s acl_hosts=*.foo.bar.com,boss.hq.bar.com acl_host_enable Enables the server host access control list, see above. s s acl_host_enable=true default_node Defines the node on which jobs are run if not otherwise directed. Please see section 2.1.2 Where Jobs May be Run for a discussion of how to set this attibute depending on your system. The default value (also the value assumed if the attribute is unset) is 1#shared. s s default_node=big.foo.bar.com managers Defines which users, at a specified host, are granted batch system administrator privilege. For example, to grant privilege to "me" at all systems on the sub- domain and "sam" only from this system, big, then: s s managers=me@*.foo.bar.com,sam@big.foo.bar.com operators Defines which users, at a specified host, are granted 32 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration batch system operator privilege. Specified as are the managers. resources_cost If you are planning to use the "synchronous job starts" feature across multiple execution hosts, you may wish to establish arbitrary costs for various resources on each system. See the ERS section on Synchronize Job Starts (section 3.2.2). resources_defaults This attribute establishes the resource limits assigned to jobs that were submitted without a limit and for which there are no queue limits. It is important that a default value be assigned for any resource require- ment used in the scheduling policy. s s resources_defaults.cput=5:00 s s resources_defaults.mem=4mb resources_max This attribute sets the maximum amount of resources which can be used by a job entering any queue on the server. This limit is checked only if there is not a queue specific resources_max attribute defined for the specific resource. system_cost See resources_cost. 2.4.2. Queue Configuration Each server must have at least one queue defined. It may be either a routing queue or an execution queue. Typically it will be an execution queue; jobs cannot be executed while residing in an routing queue. Queue attributes fall into three groups, those which are applicable to both types of queues, those applicable only to execution queues and those applicable only to routing queues. If an "execution queue only" attribute is set for a routing queue, or vice versa, it is simply ignored by the system. However, as this situation might indicate the administrator made a mistake, the Server will issue a warn- ing message about the conflict. The same message will be issued if the queue type is changed and there are attributes that do not apply to the new type. Not all of the Queue Attributes are discussed here, only what is needed to get a reasonable system up and running. See the pbs_queue_attributes man page for a complete list. Chapt Revision: 2.0.2.2 33 Configuration PBS Administrator Guide 2.4.2.1. Required Attributes for All Queues queue_type Must be set to either execution or routing (e or r will do). The queue type must be set before the queue can be enabled. If the type conflicts with certain attributes which are valid only for the other queue type, the set request will be rejected by the server. s q dque queue_type=execution enabled If set to true, jobs may be enqueued into the queue. If false, jobs will not be accepted. s q dque enabled=true started If set to true, jobs in the queue will be processed, either routed by the server if the queue is a routing queue or scheduled by the job scheduler if an execution queue. s q dque started=true 2.4.2.2. Required Attributes for Routing Queues route_destinations List the local queues or queues at other server to which jobs in this routing queue may be sent. For example: s q routem route_destination=dque,overthere@another.foo.bar.com 2.4.2.3. Recommended Attributes for All Queues resources_max If you chose to have more than one execution queue based on the size or type of job, you may wish to establish maximum and minimum values for various resource limits. This will restrict which jobs may enter the queue. A routing queue can be established to "feed" the execution queues and jobs will be dis- tributed by those limits automatically. A resources_max value defined for a specific resource at the queue level will override the same resource resources_max defined at the server level. Therefore, it is possible to define a higher as well as a lower value for a queue limit than the server's corresponding limit. If there is no maximum value declared for a resource type, there is no restriction on that resource. For example: 34 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration s q dque resource_max.cput=2:00:00 places a restriction that no job requesting more than 2 hours of cpu time will be allowed in the queue. There is no restriction on the memory, mem, limit a job may request. resources_min Defines the minimum value of resource limit specified by a job before the job will be accepted into the queue. If not set, there is no minimum restriction. 2.4.2.4. Recommended Attributes for Execution Queues resources_default Defines a set of default values for jobs entering the queue that did not specify certain resource limits. There is a corresponding server attribute which sets a default for all jobs. The limit for a specific resource usage is established by checking various job, queue, and server attributes. The following list shows the attributes and their order of precedence: 1. The job attribute Resource_List, i.e. what was requested by the user. 2. The queue attribute resources_default. 3. The server attribute resources_default. 4. The queue attribute resources_max. 5. The server attribute resources_max. * Under Unicos, a user supplied value must be within the system's User Data Base, UDB, limit for the user. If the user does not supply a value, the lower of the defaulted value from the above list and the UDB limit is used. Please note, an unset resource limit for a job is treated as an infinite limit. 2.4.2.5. Selective Routing of Jobs into Queues Often it is desirable to route jobs to various queues on a server, or even between servers, based on the resource requirements of the jobs. The queue resource_min and resource_max attributes discussed above make this selective routing possible. As an example, let us assume you wish to Chapt Revision: 2.0.2.2 35 Configuration PBS Administrator Guide establish two execution queues, one for short jobs of less than 1 minute cpu time, and the other for long running jobs of 1 minute or longer. Call them short and long. Apply the resources_min and resources_max attribute as follows: set queue short resources_max.cput=59 set queue long resources_min.cput=60 When a job is being enqueued, it's requested resource list is tested against the queue limits: resource_min <= job_requirement <= resource_max. If the resource test fails, the job is not accepted into the queue. Hence, a job asking for 20 seconds of cpu time would be accepted into queue short but not into queue long. Note, if the min and max limits are equal, only that exact value will pass the test. You may wish to set up a routing queue to feed jobs into the queues with resource limits. That will place the job into the appropriate queue. For example: create queue feed queue_type=routing set queue feed route_destination="short,long" set server default_queue=feed A job will end up in either short or long depending on its cpu time request. You should always list the destination queues in order of the most restrictive first as the first queue which meets the job's requirements will be its destination (assuming that queue is enabled). Extending the above example to three queues: set queue short resource_max.cput=59 set queue long resource_min.cput=1:00,resource_max.cput=1:00:00 create queue verylong queue_type=execution set queue feed route_destination="short,long,verylong" A job asking for 20 minutes (20:00) of cpu time will be placed into queue long. A job asking for 1 hour and 10 min- utes (1:10:00) will end up in queue verylong by default. One word of caution is required. If a job does not specify a resource list (requirement) for the tested resource. No test is made. In the above case, a job without a cpu time limit will be allowed into queue short. For this reason, together with the fact that an unset limit is considered to be an infinite limit, you may wish to add a default value to the queues. set queue short resource_default.cput=40 will see that a job without a cpu time specification is lim- ited to 40 seconds if it enters the short queue. Be aware of two facts: 1. If a default value is assigned, it is done so after the tests against min and max. 2. Default values assigned to a job from a queue 36 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration resource_default are not carried with the job if the job moves to another queue. Those resource limits becomes unset as when the job was speci- fied. If the new queue specifies default values, those values are assigned to the job while it is in the new queue. Minimum and maximum queue limits work with numerical valued resources, including time and size values. Generally, they do not work with string valued resources because of charac- ter comparison order. However, setting the min and max to the same value to force an exact match will work even for string valued resources. For example, set queue big resource_max.arch=unicos8 set queue big resource_min.arch=unicos8 can be used to limit jobs entering queue big to those speci- fying arch=unicos8. Again, remember that if arch is not specified by the job, no test is performed and it will be accepted. 2.4.3. Recording Server Configuration Should you wish to record the configuration of a server for re-use, you may used the print subcommand of qmgr(8B). For example, qmgr -c "print server" > /tmp/server.con will record in the file server.con the qmgr subcommands required to recreate the current configuration including the queues. The commands could be feed back into qmgr via stan- dard input: qmgr < /tmp/server.con 2.5. Configuring the Execution Server, pbs_mom Mom is configured via a configuration file which she reads at initialization time and when sent the SIGHUP signal. This file is described in the pbs_mom(8) man page as well as in the following section. If the -c option is not specified when mom is run, she will open PBS_HOME/mom_priv/config if it exists. If it does not, mom will continue anyway. This file may be placed else where or given a different name. In which case, pbs_mom must be started with the -c option. The file provides several types of run time information to pbs_mom: static resource names and values, external resources provided by a program to be run on request via a shell escape, and values to pass to internal set up func- tions at initialization (and re-initialization). Each item type is on a single line with the component parts Chapt Revision: 2.0.2.2 37 Configuration PBS Administrator Guide separated by white space. If the line starts with a hash mark (pound sign, #), the line is considered to be a comment and is skipped. 2.5.1. "Access Control and Initialization Values" An initialization value directive has a name which starts with a dollar sign ($) and must be known to MOM via an internal table. Currently the entries in this table are: clienthost A $clienthost entry causes a host name to be added to the list of hosts which will be allowed to connect to MOM as long as they are using a privileged port. For example, here are two configuration file lines which will allow the hosts "fred" and "wilma" to connect: $clienthost fred $clienthost wilma Two host names are always allowed to connect to pbs_mom, "localhost" and the name returned to pbs_mom by the system call gethostname(). These names need not be specified in the configuration file. The hosts listed as "clienthosts" comprise a "sisterhood" of hosts. Any one of the sisterhood will accept connec- tions from a scheduler [Resource Monitor (RM) requests] or server [jobs to execute] from within the sisterhood. They will also accept Internal MOM (IM) messages from within the sisterhood. For a sisterhood to be able to communicate IM messages to each other, they must all share the same RM port. For a scheduler to be able to query resource informa- tion from a Mom, the scheduler's host must be listed as a clienthost. If the Server is provided with a nodes file, the IP addresses of the hosts (nodes) in the file will be for- warded by the Server to the Mom on each host listed in the node file. These hosts need not be in the various Mom's configuration file as they will be added inter- nally when the list is received from the Server. The Server's host must be either the same host as the Mom or be listed as a clienthost entry in each Mom's config file. restricted A $restricted host entry causes a host name to be added to the list of hosts which will be allowed to connect to MOM without needing to use a privilaged port. These names allow for wildcard matching. For example, here is a configuration file line which will allow queries from any host from the domain "ibm.com". $restricted *.ibm.com Connections from the specified hosts are restricted in 38 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration that only internal queries may be made. No resources from a config file will be reported and no control requests can be issued. This is to prevent any shell commands from being run by a non-root process. This type of entry is typically used to specify hosts on which a monitoring tool, such as xpbsmon, can be run. Xpbsmon will query Mom for general resource information. logevent A $logevent entry sets the mask that determines which event types are logged by pbs_mom. For example: $logevent 0x1ff $logevent 255 The first example would set the log event mask to 0x1ff (511) which enables logging of all events including debug events. The second example would set the mask to 0x0ff (255) which enables all events except debug events. The values of events are listed in chapter 6.3 "Use and Maintenace of Logs" of this guide. cputmult A $cputmult entry sets a factor used to adjust cpu time used by a job. This is provided to allow adjustment of time charged and limits enforced where the job might run on systems with different cpu performance. If Mom's system is faster than the reference system, set cputmult to a decimal value greater than 1.0. If Mom's system is slower, set cputmult to a value between 1.0 and 0.0. The value is given by value = speed_of_this_system / speed_of_reference_system For example: $cputmult 1.5 $cputmult 0.75 wallmult A $wallmult entry sets a factor used to adjust wall time usage by to job to a common reference system. The factor is used for walltime calculations and limits in the same way as cputmult is used for cpu time. 2.5.2. "Static Resources" For static resource names and values, the configuration file contains a list of resource name/value pairs, one pair per line and separated by white space. An Example of static resource names and values could be the number of tape drives of different types and could be specified by tape3480 4 Chapt Revision: 2.0.2.2 39 Configuration PBS Administrator Guide tape3420 2 tapedat 1 tape8mm 1 The names can be anything and are not restricted to actual hardware. For example the entry pong 1 could be used to indicate to the scheduler that a certain piece of software is available on this system. 2.5.3. "Shell Commands" If the first character of the value portion of a name/value pair is the exclamation mark (!), the entire rest of the line is saved to be executed through the services of the system(3) standard library routine. The first line of out- put from the shell command is returned as the response to the resource query. The shell escape provides a means for the resource monitor to yield arbitrary information to the scheduler. Parameter substitution is done such that the value of any qualifier sent with the resource query, as explained below, replaces a token with a percent sign (%) followed by the name of the qualifier. For example, here is a configuration file line which gives a resource name of "escape": escape !echo %xxx %yyy If a query for "escape" is sent with no qualifiers, the com- mand executed would be "echo %xxx %yyy". If one qualifier is sent, "escape[xxx=hi there]", the command executed would be "echo hi there %yyy". If two qualifiers are sent, "escape[xxx=hi][yyy=there]", the command executed would be "echo hi there". If a qualifier is sent with no matching token in the command line, "escape[zzz=snafu]", an error is reported. A possible use of the shell command configuration entry is to provide a means by which the use of floating software licenses may be tracked. If a program can be written to query the license server, the number of available licenses could be returned to tell the scheduler if it is possible to run a job that needs a certain licensed package. [You get the fun and games of writing this program.] 2.6. Configurating the Scheduler, pbs_sched The configuration required for a scheduler depends on the scheduler itself. If you are starting with the delivered fifo scheduler, please jump ahead to section 4.5.1 "FIFO Scheduler" in this guide. 40 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration 2.7. Alternate Test Systems Alternate or test copies of the various daemons may be run through the use of the command line options which set their home directory and service port. For example, the following commands would start the three daemons with a home directory of /tmp/altpbs and four ports around 13000. pbs_server -t create -d /tmp/altpbs -p 13000 -M 13002 -S 13004 pbs_mom -d /tmp/altpbs -M 13002 -R 13003 pbs_sched -d /tmp/altpbs -S 13004 -r script_file The home directories must be pre-built. The easiest method is to alter the PBS_SERVER_HOME variable by use of the --set-server-home option to configure, rerun configure and remake PBS. Jobs may be directed to the test system by using the server:port syntax on the -q option. Status is also obtained using the :port syntax: For example, to submit a job to the default queue on the above test server, request the status of the test server, and request the status of jobs at the test server: qsub -q @host:13000 job qstat -Bf host:13000 qstat @host:13000 If you or users are using job dependencies on or between test systems, there are minor problems of which you (and the users) need to be aware. The syntax of both the dependency string, depend_type:job_id:job_id and the job id seq_num- ber.host:port use colons in an indistinguishable manner. The way to work around this is covered in the Advice for Users section at the end of this guide. 2.8. Installing an Updated Batch System Once you have a running batch system, there will come a time when you wish to update it or install a new version. It is assumed that you will wish to build and test the new version using alternative directories and port numbers described above. You may change the location of PBS_SERVER_HOME for the test version, see configure option --set-server-home. Once you are satisfied with the new system, it is suggested that you rebuild the three daemons with PBS_SERVER_HOME set to directory which will be used in normal operation. Other- wise you will always have to use the -d option when starting the daemons. When the new batch system is ready to be placed into ser- vice, you will wish to move jobs from the old system to the new. The following procedure is suggested. All servers must be run by root. The qmgr and qmove commands should be run by an batch administrator (likely, root is good). Chapt Revision: 2.0.2.2 41 Configuration PBS Administrator Guide 1. With the old batch system running, disable the queues and stop scheduling by setting "scheduling=false". 2. Backup the pool of jobs in PBS_SERVER_HOME(old)/server_priv/jobs. Tar may used for this. Assuming the change is a minor update (change in third digit of the release version number) or a local change where the job structure did not change from the old version to the new, it is likely that you could start the new system in the old HOME and all jobs would be recovered. However if the job structure has changed you will need to move the jobs from the old system to the new. The release notes will con- tain a warning if the job structure has changed or the move is required for other reasons. To move the jobs, continue with the following steps: 3. It is likely that PBS_SERVER_HOME will have changed and have been made during testing. If not, build a (tempo- rary) server directory tree by changing PBS_SERVER_HOME using --set-server-home and typing buildu- tils/pbs_mkdirs server in the top of the object tree. 4. Start the new PBS server in its new home. If the new home is different from the directory when it was com- piled, use the -d option. Use the -t option if the server has not been configured for the new directory. Also start with an alternative port using the -p option. Turn off attempts to schedule with the -a option: pbs_server -t create -d new_home -p 13000 -a false Remember, you will need to use the :port syntax when commanding the new server. 5. Duplicate on the new server the current queues and server attributes (assuming you wish to do so). Enable each queue which will receive jobs at the new server. qmgr -c "print server" > /tmp/config qmgr host:13000 < /tmp/config qenable queue1@host:13000 qenable queue2@host:13000 6. Now list the jobs at the original server and move a few jobs one at a time from the old to the new server: qstat qmove queue@host:13000 job qstat @host:13000 If all is going well, move the remaining jobs a queue at a time: qmove queue1@host:13000 `qselect -qqueue1` qstat queue1@host:13000 qmove queue2@host:13000 `qselect -qqueue2` 42 Chapt Revision: 2.0.2.2 PBS Administrator Guide Configuration qstat queue2@host:13000 7. At this point, all of the jobs should be under control of the new server and located in the new server home. If the new server home is a temporary directory, shut down the new server and move everything to the real home using cp -R new_home real_home or, if the real (new) home is already set up, cd new_home/server_priv/jobs cp * real_home/server_priv/jobs to copy just the jobs. At this point, you are ready to bring up and enable the new batch system. You should be aware of one quirk when using qmove. If you wish to move a job from a server running on a test port to the server running on the normal port (15000), you may attempt, unsuccessfully, to use the following command: qmove queue@host 123.job.host:13000 However, that will only move the job to the end of the queue it is already in. The server receiving the move request (13000), will compare the destination server name, host, with its own name only, not including the port. Hence it will match and it will not send the job where you intended. To get the job to move to the server running on the normal port you have to specify that port in the destination: qmove queue@host:15000 123.job.host:13000 3. Scheduling Policies PBS provides a separate process to schedule which jobs should be placed into execution. This is a flexible mecha- nism by which you may implement a very wide variety of poli- cies. The scheduler uses the standard PBS API to communi- cate with the Server and an additional API to communicate with the PBS resource monitor, pbs_mom. Should the provided schedulers be insufficient to meet your site's needs, it is possible to implement a replacement scheduler using the APIs to accomplish your heart's desires. The first generation batch system, NQS, and many of the other batch systems use various queue based controls to limit or schedule jobs. Queues would be turned on and off to control job ordering over time or have a limit of the number of running jobs in the queue. While PBS supports multiple queues and the queues have some of the "job scheduling" attributes used by other batch sys- tems, the PBS Server does not by itself run jobs or enforce any of the restrictions implied by these queue attributes. In fact, the Server will happily run a held job that resides Chapt Revision: 2.0.2.2 43 Scheduling PBS Administrator Guide in a stopped queue with a zero limit on running jobs, if it is directed to do so. The direction may come from the oper- ator, administrator, or the Scheduler. In fact, the Sched- uler is nothing more than a client with administration priv- ilege. If you chose to implement your site scheduling policy using a multiple queue - queue control based scheme, you may do so. The server and queue attributes used to control job scheduling may be adjusted by a client with privilege, such as qmgr(8B), or by one of your own creation. However, the controls actually reside in the Scheduler, not in the Server. The Scheduler must check the status of the Server and queues, as well as the jobs, determining the setting of the server and queue controls. It then must use the set- tings of those controls in its decision making. Another approach is the "whole pool" approach, wherein all jobs are in a single pool (single queue). The Scheduler evaluates each job on its merits and decides which, if any, to run. The policy can easily include factors such as time of day, system load, size of job, etc. Ordering of jobs in the queue need not be considered. The PBS team believes that this approach is superior for two reasons: 1. Users are not tempted to lie about their require- ments in order to "game" the queue policy. 2. The scheduling can be performed against the com- plete set of current jobs resulting in better fits against the available resources. 3.1. Scheduler - Server Interaction In developing a scheduling policy, it may be important to understand when and how the server and the scheduler inter- act. The server always initiates the scheduling cycle. The server opens a connection to the scheduler and sends a com- mand indicating the reason for the scheduling cycle. The reason or event that triggers a cycle are: - A job newly becomes eligible to execute. The job may be a new job in an execution queue, or a job in an exe- cution queue that just changed state from held or wait- ing to queued. [ SCH_SCHEDULE_NEW ] - An executing job terminates. [ SCH_SCHEDULE_TERM ] - The time interval since the prior cycle specified by the server attribute schedule_iteration is reached. [ SCH_SCHEDULE_TIME ] - The server attribute scheduling is set or reset to 44 Chapt Revision: 2.0.2.2 PBS Administrator Guide Scheduling true. If set true, even if it's value was true, the scheduler will be cycled. This provides the adminis- trator/operator a means on forcing a scheduling cycle. [ SCH_SCHEDULE_CMD ] - If the scheduler was cycled and it requested one and only one job to be run, then the scheduler will be recycled by the server. This event is a bit abstruse. It exists to "simplify" a scheduler. The scheduler only need worry about choosing the one best job per cycle. If other jobs can also be run, it will get another chance to pick the next job. Should a sched- uler run none or more than one job in a cycle it is clear that it need not be recalled until conditions change and one of the above trigger the next cycle. [ SCH_SCHEDULE_RECYC ] - If the server recently recovered, the first scheduling cycle, resulting from any of the above, will be indi- cated uniquely. [ SCH_SCHEDULE_FIRST ] Once the server has contacted the scheduler and sent the reason for the contact, the scheduler then becomes a privi- leged client of the server. As such, it may command the server to perform any action allowed to a manager. When the scheduler has completed all activities it wishes to perform in this cycle, it will close the connection to the server. While a connection is open, the server will not attempt to open a new connection. One point should be clarified about job ordering: Queues "are" and "are not" FIFOs. What is meant is that while jobs are ordered first in - first out in the server and in each queue, that fact does NOT imply that running them in that order is mandated, required, or even desirable. That is a decision left com- pletely up to site policy and implementation. The server will maintain the order across restarts solely as a aid to sites that wish to use a FIFO ordering in some fashion. 3.2. BaSL Scheduling The provided BaSL Scheduler uses a C-like procedural lan- guage to write the scheduling policy. The language provides a number of constructs and predefined functions that facili- tate dealing with scheduling issues. Information about a PBS server, the queues that it owns, jobs residing on each queue, and the computational nodes where jobs can be run are accessed via the BaSL data types Server, Que, Job, CNode, Set Server, Set Que, Set Job, and Set CNode. Chapt Revision: 2.0.2.2 45 Scheduling PBS Administrator Guide The idea is that a site must first write a function (con- taining the scheduling algorithm) called sched_main() (and all functions supporting it) using BaSL constructs, and then translate the functions into C using the BaSL compiler basl2c, which would also attach a main program to the resulting code. This main program performs general initial- ization and housekeeping chores such as setting up local socket to communicate with the server running on the same machine, cd-ing to the priv directory, opening log files, opening configuration file (if any), setting up locks, fork- ing the child to become a daemon, initializing a scheduling cycle (i.e. get node attributes that are static in nature), setting up the signal handlers, executing global initializa- tion assignment statements specified by the scheduler writer, and finally sitting on a loop waiting for a schedul- ing command from the server. The name of the resulting code is pbs_sched.c. When the server sends the scheduler an appropriate schedul- ing command { SCH_SCHEDULE_NEW , SCH_SCHEDULE_TERM , SCH_SCHEDULE_TIME , SCH_SCHEDULE_RECYC , SCH_SCHEDULE_CMD , SCH_SCHEDULE_FIRST }, the scheduler wakes up and obtains information about server(s), jobs, queues, and execution host(s), and then it calls sched_main(). The list of servers, execution hosts, and host queries to send to the hosts' MOMs are specified in the scheduler configuration file. Global variables defined in the BaSL program will retain their values in between scheduling cycles while locally- defined variables do not. 3.3. Tcl Based Scheduling The provided Tcl based Scheduler uses the basic Tcl inter- preter with some extra commands for communicating with the PBS Server and Resource Monitor. The scheduling policy is defined by a script written in Tcl. A number of sample scripts are provided in the source directory src/sched- uler.tcl/sample_scripts. The Tcl based Scheduler works, very generally, in the fol- lowing way: 1. On start up, the Scheduler reads the initialization script (if it exists) and executes it. Then, the body script is read into memory. This is the file that will be executed each time a "schedule" command is received from the server. It then waits for a "schedule" com- mand from the server. 2. When a schedule command is received, the body script is executed. No special processing is done for the script 46 Chapt Revision: 2.0.2.2 PBS Administrator Guide Scheduling except to provide a connection to the Server. A typi- cal script will need to retrieve information for candi- date jobs to run from the server using pbsselstat or pbsstatjob. Other information from the Resource Moni- tor(s) will need to be retrieved by opening connections with openrm and submitting queries with addreq and get- ting the results with getreq. The Resource Monitor connections must be closed explicitly with closerm or the Scheduler will eventually run out of file descrip- tors. When a decision is made to run a job, a call to pbsrunjob must be made. 3. When the script evaluation is complete, the Scheduler will close the TCP/IP connection to the Server. 3.3.1. Tcl Based Scheduling Advice The Scheduler does not restart the Tcl interpreter for each cycle. This gives the ability to carry information from one cycle to the next. It also can cause problems if variables are not initialized or "unset" at the beginning of the script when they are not expected to contain any information later on. System load average is frequently used by a script. This number is obtained from the system kernel by pbs_mom. Most systems smooth the load average number over a time period. If one scheduling cycle runs one or more jobs and the next scheduling cycle occurs quickly, the impact of the newly run jobs will likely not be reflected in the load average. This can cause the load average to shoot way up especially when first starting the batch system. Also when jobs terminate, the delay in lowering the load average may delay the scheduling of additional jobs. The Scheduler redirects the output from "stdout" and "stderr" to a file. This makes it easy to generate debug output to check what your script is doing. It is advisable to use this feature heavily until you are fairly sure that your script is working well. 3.4. C Based Scheduling The C based scheduler is similar in structure and operation to the Tcl scheduler except that C functions are used rather than Tcl scripts. 1. On start up, the scheduler calls schedinit(argc, argv) to initialize whatever is required to be initialized. This is called one time only. 2. When a schedule command is received, the function schedule(cmd, connector) is invoked. All scheduling activities occur within that function. Chapt Revision: 2.0.2.2 47 Scheduling PBS Administrator Guide 3. Upon return to the main loop, the connection to the server is closed. Several working scheduler code examples are provided in the samples subdirectory. 3.5. C Based Sample Schedulers The following sections discuss simple C based sample sched- ulers supplied with PBS. The sources for the samples are found in src/scheduler.cc/samples under the scheduler type name, for example src/scheduler.cc/samples/fifo. 3.5.1. FIFO Scheduler This scheduler will provide several simple scheduling poli- cies. It provides the ability to sort the jobs in several different ways, in addition to FIFO order. There is also the ability to sort on user and group priority. Mainly this scheduler is intended to be a jumping off point for a real scheduler to be written. A good amount of code has been written to make it easier to change and add to this sched- uler. Check the IDS for a more detailed view of the code. As distributed, the fifo scheduler is configured to: - All jobs in a queue will considered for execution before the next queue is examined. (See file PBS_HOME/sched_priv/sched_config) - The queues are sorted by queue priority. (See file PBS_HOME/sched_priv/sched_config) - The jobs within each queue are sorted by requested cpu time (cput). The shortest job is places first. (See file PBS_HOME/sched_priv/sched_config) - Jobs which have been queued for more than a day will be considered starving and heroic measures will be taken to attempt to run them. - Any queue whose name starts with "ded" is treated as a dedicated time queue. Jobs in that queue will only be considered for execution if the system is in dedicated time as specified in the dedicated_time configuration file. If the system is in dedicated time, jobs not in a "ded" queue will not considered. (See file PBS_HOME/sched_priv/dedicated_time & PBS_HOME/sched_priv/sched_config) - Prime time is from 4:00 AM to 5:30 PM. Any holiday is considered non-prime. Standard federal holidays for the year 1998 are included. (See file PBS_HOME/sched_priv/holidays) 48 Chapt Revision: 2.0.2.2 PBS Administrator Guide Scheduling - A sample dedicated_time and resource group file are al- so included. - The system resources which are checked to make sure they are not exceeded: mem (memory requested) and ncpus (number of CPUs requested). 3.5.1.1. Installing the FIFO scheduler 1. As discussed in the build overview, run configure with the following options: --set-sched=c and --set-sched- code=fifo, which are the default. 2. You may wish to read through the src/scheduler.cc/sam- ples/fifo/config.h file. Most default values will be fine. 3. Build and install PBS 4. Change directory into PBS_HOME/sched_priv and edit the scheduling policy config file sched_config, or use the default values. This file controls the scheduling pol- icy (which jobs are run when). The default name of sched_config may be changed in config.h. The format of the sched_config file is: name: value [prime | non_prime | all] name and value may not contain any white space value can be: true | false | number | string any line starting with a '#' is a comment. a blank third word is equivalent to "all" which is both prime and non-prime the associated values as shipped as defaults are shown in braces {}: round_robin boolean: If true - run jobs one from each queue in a circular fashion; if false - run as many jobs as possible up to queue/server limits from one queue before processing the next queue. The following server and queue attributes, if set, will control if a job "can be" run: resources_max, max_running, max_user_run, and max_group_run. See the man pages pbs_server_attributes and pbs_queue_at- tributes. {false all} by_queue boolean: If true - the jobs will be run from their queues; if false - the entire job pool in the Chapt Revision: 2.0.2.2 49 Scheduling PBS Administrator Guide server is looked at as one large queue. {true all} strict_fifo boolean: If true - will run jobs in a strict FIFO order. This means if a job fails to run for any reason, no more jobs will run from that queue/server that scheduling cycle. If strict_fi- fo is not set, large jobs can be starved, i.e., not allowed to run because a never ending series of small jobs use the available resources. Also see the server attribute resources_max in chapter 3.4, and the fifo parameter help_starving_jobs be- low. {false all} fair_share boolean: This will turn on the fair share algo- rithm. It will also turn on usage collecting and jobs will be selected using a function of their usage and priority(shares). {false all} load_balancing boolean: If this is set the scheduler will load balance the jobs between a list of time-shared hosts (:ts) obtained from the Server (pbs_server). The Server reads the list from its nodes file, see chapter 3.1. {false all} help_starving_jobs boolean: This bit will have the scheduler turn on its rudimentry starving jobs support. Once jobs have waited for the amount of time give by starve_max, they are considered starving. If a job is considered starving, then no jobs will run until the starving job can be run. Starve_max needs to be set also. sort_by string: have the jobs sorted. sort_by can be set to a single sort type or multi_sort. If set to multi_sort, multiple key fields are used. Each key field will be a key for the multi sort. The order of the key fields decides which sort type is used first. Sorts: no_sort, shortest_job_first, 50 Chapt Revision: 2.0.2.2 PBS Administrator Guide Scheduling longest_job_first, smallest_memory_first, largest_memory_first, high_priority_first, low_priority_first, multi_sort, fair_share, large_walltime_first, short_walltime_first {shortest_job_first} no_sort do not sort the jobs shortest_job_first ascending by the cput attribute longest_job_first descending by the cput attribute smallest_memory_first ascending by the mem attribute largest_memory_first descending by the mem attribute high_priority_first descending by the job priority attribute low_priority_first ascending by the job priority attribute large_walltime_first descending by job walltime attribute cmp_job_walltime_asc ascending by job walltime attribute multi_sort sort on multiple keys. fair_share If fair_share if given as the sort key, the jobs are sorted based on the values in the resource group file. This is only used if strict priority sorting is needed. key Sort type as defined above for multiple sorts. Each sorting key is listed on a separate line starting with the word key. For example: sort_by: multi_sort key: sortest_job_first key: smallest_memory_first key: high_priority_first log_filter What event types not to log. The value should be the addition of the event classes which should be filtered(i.e. ORing them together). The numbers Chapt Revision: 2.0.2.2 51 Scheduling PBS Administrator Guide are defined in src/include/log.h. NOTE: those numbers are in hex and log_filter is in base 10. {256} Examples: filter PBSEVENT_DEBUG2 PBSEVENT_DEBUGnPBSEVENT_ADMIN 0x100: 256 0x080: 128 0x004: 4= 388 PBSEVENT_JOBPBSEVENT_DEBUGPBSEVENT_SCHED 0x008: 8 0x080: 128 0x040: 64= 200 dedicated_prefix The queues with this prefix will be considered dedicated queues. Example: if the dedicated pre- fix is "ded" then dedicated, ded1, ded5 etc would be dedicated queues {ded} starve_max The amount of time before a job is considered starving. This config variable is not used if help_starving_jobs is not set. The following do not matter if fair share is not turned on (which is is not by default). half_life The half life of the fair share usage {24:00:00} unknown_shares The amount of shares for the "unknown" group. {10} sync_time The amount of time between writing the fair share usage data to disk. {1:00:00} The policy set by the supplied values in sched_config is: Jobs are run on the basis of queue priority, both in prime and non-prime time. Jobs with in each queue are sorted on the basis of smallest (memory) first. Help for starving jobs will take effect after a job is 24 hours old. 52 Chapt Revision: 2.0.2.2 PBS Administrator Guide Scheduling 5. If fair share or strict priority is going to be used, the resource group file will need to be created. Use the following format: # comment username cresgrp resgrp shares username string: the username of the user or the group cresgrp numeric: an id for the group or user, should be unique for each. For users, the UID works well. resgrp string: the name of the parent resource group this user/group is in. The root of the entire tree is called root and is added automatically to the tree by the scheduler. sharesnumeric: The amount of shares(priority) the us- er/group has in the resource group. 6. If strict priority is wanted, a fair share tree will be needed. A really simple one will suffice. Every us- er's resgrp will be root. The amount of shares will be their priority. Next, set unknown_shares to one. Ev- eryone who is not in the tree will share the one share between them to make sure everyone in the tree will have priority over them. Lastly, the main sort must be set to fair_share. This will sort by the fair share tree which was just set up. 7. Create the holidays file to handle prime time and holi- days. The holidays file should use the UNICOS 8 holi- day format. The ordering does matter. Any line that begins with a "*" is considered a comment. YEAR YYYY This is the current year. Day can be weekday | saturday | sunday prime and nonprime are times when prime or non- prime time start. They can either be HHMM with no colons(:) or the word "all" or "none" day is the day of the year between 1 and 365 date is the calendar date. Ex Jan 1 holiday is the name of the holiday. Ex New Year's Day This is repeated for each company holiday Chapt Revision: 2.0.2.2 53 Scheduling PBS Administrator Guide 8. To load balance between timesharing nodes, several things need to happen. First, a nodes file needs to be set up in the PBSHOME/server_priv (See section 3.1). All timesharing nodes need to be denoted with :ts. These are the nodes which the scheduler will load bal- ance between. Secondly, on every node there has to be a mom. In each of mom's config files two static values need to be set up. One for ideal load and the other is maximum load. This is done by putting two lines in the config file like the following format: name value. The name will be ideal_load and max_load. Lastly, turn the load_balancing bit on in the scheduling policy config file. Load balancing will have the job comment changed on running of the job to show where the job was run. Example of mom config file:(64 processor machine) ideal_load 60 max_load 64 9. Space sharing is done automatically if there are both a nodes file and the job requests nodes. Make sure to set up a resources_default.nodes and resources_de- fault.nodect. 10. The scheduler honors the following attributes/node re- sources: Queue : started Queue : queue_type Queue : max_running Queue : max_user_run Queue : max_group_run Job : job state Server : max_running Server : max_user_run Server : max_group_run Server : resources_available Server : resources_max Node : loadave Node : arch Node : ncpus Node : physmem NOTE: if resources_available.res is set, it will be used, if not resources_max.res will be used. If nei- ther are set infinity is assumed. 3.5.1.2. Examples FIFO Configuration Files The following are just examples and may or may not be what 54 Chapt Revision: 2.0.2.2 PBS Administrator Guide Scheduling is shipped. Exampleof # Set the boolean values which define how the scheduling policy finds # the next job to consider to run. round_robin: False ALL by_queue: True prime by_queue: false non-prime strict_fifo: true ALL fair_share: True prime fair_share: false non-prime # help jobs which have been waiting too long help_starving_jobs: true prime help_starving_jobs: false non-prime # Set a multi_sort # This example will sort jobs first by ascending cpu time requested, and then # by ascending memory requested, and then finally by descending job priority # sort_by: multi_sort key: shortest_job_first key: smallest_memory_first key: high_priority_first # Set the debug level to only show high level messages. # Currently this only shows jobs being run debug_level: high_mess # a job is considered starving if it has waited for this long max_starve: 24:00:00 # If the scheduler comes by a user which is not currently in the resource group # tree, they get added to the "unknown" group. The "unknown" group is in roots # resource group. This says how many shares it gets. unknown_shares: 10 # The usage information needs to be written to disk in case the scheduler # goes down for any reason. This is the amount of time between when the # usage information in memory is written to disk. The example syncs the # information ever hour. sync_time: 1:00:00 # What events do you not want to log. The event numbers are defined in # src/include/log.h. NOTE: the numbers are in hex, and log_filter is in # base 10. # The example is not to log DEBUG2 events, which are the most prolific log_filter: 256 Chapt Revision: 2.0.2.2 55 Scheduling PBS Administrator Guide Hereis * the current year YEAR 1998 * * Start and end of prime time * * Prime Non-Prime * Day Start Start weekday 0400 1130 saturday none all sunday none all * * The holidays * * Day of Calendar Company * Year Date Holiday * 1 Jan 1 New Year's Day 20 Jan 20 Martin Luther King Day 48 Feb 17 President's Day 146 May 26 Memorial Day 185 Jul 4 Independence Day 244 Sep 1 Labor Day 286 Oct 13 Columbus Day 315 Nov 11 Veteran's Day 331 Nov 27 Thanksgiving 359 Dec 25 Christmas Day Exampleof # # the groups "root" and "unknown" are added by the scheduler # All the parents must be added for the children. This is why all the groups # are added first. The cresgrp numbers the users have are their UIDs # # name resgrp child resgrp shares grp1 50 root 10 grp2 51 root 20 grp3 52 root 10 grp4 53 grp1 20 grp5 54 grp1 10 56 Chapt Revision: 2.0.2.2 PBS Administrator Guide Scheduling grp6 55 grp2 20 usr1 60 root 5 usr2 61 grp1 10 usr3 62 grp2 10 usr4 63 grp6 10 usr5 64 grp6 10 usr6 65 grp6 20 usr7 66 grp3 10 usr8 67 grp4 10 usr9 68 grp4 10 usr10 69 grp5 10 Exampleof # this is a strict priority resource group file. These are people who should # get priority over everyone else. The amount of shares is the priority of # the user. sally 1000 root 4 larry 1001 root 6 manager 1010 root 100 vp 1016 root 500 ceo 2000 root 10000 Exampleof # Format: # FROM TO # MM/DD/YYYY HH:MM MM/DD/YYYY HH:MM 04/10/1998 15:30 04/11/1998 23:50 05/15/1998 05:15 05/15/1998 08:30 06/10/1998 23:25 06/10/1998 23:50 3.6. Scheduling and File Staging A decision must be made about when to begin to stage in files for a job. The files must be available before the job executes. The amount of time that will be required to copy the files is unknown to PBS, that being a function of file size and network speed. If file in-staging is not started until the job has been selected to run when the other re- quired resources are available, either those resources are "wasted" while the stage in occurs, or another job is start- ed which takes the resources away from the first job, and Chapt Revision: 2.0.2.2 57 Scheduling PBS Administrator Guide might prevent it from running. If the files are staged in well before the job is otherwise ready to run, the files may take up valuable disk space need by running jobs. PBS provides two ways that file in-staging can be initiated for a job. If a run request is received for a job with a requirement for staging-in files, the staging in operation is begun and when completed, the job is run. Or, a specific stage-in request may be received for a job, see pbs_stagein(3B), in which case the files are staged in but the job is not run. When the job is run, it begins execu- tion immediately because the files are already there. In either case, if the files could not be staged-in for any reason, the job is placed into a wait state with a "execute at" time 30 minutes, in the future. A mail message is sent to the job owner requesting that s/he look into the problem. The reason the job is changed into wait state is to prevent the scheduler from constantly retrying the same job which likely would keep on failing. Figure 5.0 in appendix B of the ERS shows the (sub)state changes for a job involving file in staging. The scheduler may note the substate of the job and chose to perform pre- staging via the pbs_stagein() call. The substate will also indicate completeness or failure of the operation. The scheduler developer should carefully chose a stage in ap- proach based on factors such as the likely source of the files, network speed, and disk capacity. 4. GUI System Administrator Notes Currently, PBS provides two GUIs: xpbs and xpbsmon. 4.1. xpbs xpbs provides a user-friendly point-and-click interface to the PBS commands. The system administrator can specify a global resources file, ${PBS_LIB}/xpbs/xpbsrc, which is read by the GUI if a personal .xpbsrc file is missing. Keep in mind that within an Xresources file (Tk only), later entries take precedence. For example, suppose in your .xpbsrc file, the following en- tries appear in order: xpbsrc*backgroundColor: blue *backgroundColor: green The later entry "green" will take precedence even though the first one is more precise and longer matching. The things that can be set in the personal preferences file 58 Chapt Revision: 2.0.2.2 PBS Administrator Guide Operation are fonts, colors, and favorite server host(s) to query. 4.2. xpbsmon xpbsmon is the node monitoring GUI for PBS. It is used for displaying graphically information about execution hosts in a PBS environment. Its view of a PBS environment consists of a list of sites where each site runs one or more servers, and each server runs jobs on one or more execution hosts (nodes). The system administrator needs to define the sites informa- tion in a global X resources file, $PBS_LIB/xpbsmon/xpbsmon- rc, which is read by the GUI if a personal .xpbsmonrc file is missing. A default xpbsmonrc file usually would have been created already upon install, defining (under *sitesInfo re- source) a default site name, list of servers that run on a site, set of nodes (or execution hosts) where jobs on a par- ticular server run, and the list of queries that are commu- nicated to each node's pbs_mom. If node queries have been specified, the host where xpbsmon is running must have been given explicit permission by the pbs_mom daemon to post queries to it. This is done by including a $restricted en- try in the Mom's config file. See chapter 2.5 for more in- formation on the restricted entry. It is not recommended to manually update the *sitesInfo val- ue in the xpbsmonrc file as its syntax is quite cumbersome. The recommended procedure is to bring up xpbsmon, click on "Pref.." button, manipulate the widgets in the Sites, Serv- er, and Query Table dialog boxes, then click "Close" button and save the settings to a .xpbsmonrc file. Then copy this file over to $PBS_LIB/xpbsmon. 5. Operational Issues This chapter addresses a few of the "day to day" operational issues which will arise. 5.1. Security There are three parts to security in the batch system: Internal security Can the daemons be trusted? Authentication How do we believe a client about who it is. Authorization Is the client entitled to have the requested action performed. Chapt Revision: 2.0.2.2 59 Operation PBS Administrator Guide 5.1.1. Internal Security An effort has been made to insure the various PBS daemon themselves cannot be a target of opportunity in an attack on the system. The two major parts of this effort is the secu- rity of files used by the daemons and the security of the daemons environment. Any file used by PBS, especially files that specify configu- ration or other programs to be run, must be secure. The files must be owned by root and in general cannot be writable by anyone other than root. When PBS directories are installed, the make file runs a program to validate own- ership and access to the files. This can be rechecked at any time by running check-tree in the top level make file. check-tree is located in the directory give by the value of bindir in configure. Each daemon also validates the most critical files and directories each time it is started. A corrupted environment is another source of attach on a system. To prevent this type of attach, each daemon resets its own environment when it starts. The source of the envi- ronment is a file named by PBS_ENVIRON set by the configure option --set-environ. If it does not already exists, this file is created during the install process. It should be edited to include the minimum set of variables required on your system. Please note that one variable, PATH must be included. The value of PATH in pbs_mom's environment, from this file, will be passed on to batch jobs. To maintain se- curity, it is important that $PATH be restricted to known, safe directories. Do not include "." in PATH. Another variable which can be dangerous and should not be set is IFS. The syntax of the PBS_ENVIRON file is either variable_name=value or variable_name In the later case, the value for the variable is obtained from the daemons environment before it is reset. Other variables for the job's environment may also be ob- tained from MOM's environment, this list varies by system. If you are interested, see the list in the variable ob- tain_vnames in the source code of src/resmom/*/mom_start.c. 5.1.2. Host Authentication PBS uses a combination of a information to authenticate a host. If a request is made from a client whose socket is bound to a privileged port (less than 1024, which requires root privilege), PBS (right or wrong) believes the IP (In- ternet Protocol) network layer as to whom the host is. If the client request is from a non-privileged port, the name 60 Chapt Revision: 2.0.2.2 PBS Administrator Guide Operation of the host which is making a client request must be includ- ed in the credential included with the request and it must match the IP network layer opinion as to the host's identi- ty. 5.1.3. Host Authorization Access to the pbs_server from another system may be con- trolled by an access control list (ACL). See section 10.1.1 of the ERS for details. Access to pbs_mom is controlled through a list of hosts specified in their configuration files. By default, only "localhost" and the name returned by gethostname(2) are al- lowed. See the man pages pbs_mom(8B) for more information on the configuration file. Access to the pbs_sched is not limited other than it must be from a privileged port. 5.1.4. User Authentication Is the user who he/she claims to be? The PBS server authenticates the user name included in a re- quest with the supplied PBS credential. This credential is supplied by pbs_iff(1B), see section 10.2 of the ERS. 5.1.5. User Authorization Is the user entitled to make the request of the server job under that name? PBS as shipped assumes a consistent user name space within the set of systems which make up a PBS cluster. Thus if a job is submitted by UserA@hostA, PBS will allow the job to be deleted or altered by UserA@hostB. The routine site_map_user() is called twice. Once to map the name of the requester and again to map the job owner to a name on the server's (local) system. If the two mapping agree, the requester is considered the job owner. See section 10.1.3 of the ERS. This behavior may be changed by a site by al- tering the server routine site_map_user() found in the file src/server/site_map_user.c, see the Internal Design Spec. Is the user entitled to execute the job under that name? A user may supply a name under which the job is to executed on a certain system. If one is not supplied, the name of the job owner is chosen to be the execution name. See the -u user_list option of the qsub(1B) command. Authorization to execute the job under the chosen name is granted under the following conditions: Chapt Revision: 2.0.2.2 61 Operation PBS Administrator Guide 1. The job was submitted on the server's (local) host and the submitter's name is the same as the selected execu- tion name. 2. The host from which the job was submitted are declared trusted by the execution host in the /etc/hosts.equiv file or the submitting host and submitting user's name are listed in the execution users' .rhosts file. The system supplied library function, ruserok(), is used to make these checks. If the above are not satisfactory to a site, the routine site_check_user_map() in the file src/server/site_check_u.c may be modified. See the IDS for more information. In addition to the above checks, access to a PBS server and queues within that server may be controlled by access con- trol lists. See section 10.1.1 and 10.1.2 of the ERS for more information. 5.1.6. Group Authorization PBS allows a user to submit jobs and specify under which group the job should be executed. The user specifies a group_list attribute for the job which contains a list of groups@hosts similar to the user list. See the group_list attribute under the -W option of qsub(1B). The PBS Server will ensure that the user is a member of the specified group by 1. Checking if the group is the user's primary group in the password entry. In this case the user's name does not have to appear in the group entry for his primary group. 2. Checking for the user's name in the specified group en- try in /etc/group. The job will be aborted if both checks fail. The checks are skipped if the user does not supply a group list attribute. In this case the user's primary group from the password file will be used. When staging files in or out, PBS also uses the selected ex- ecution group for the copy operation. This provides normal Unix access security to the files. Since all group infor- mation is passed as a string of characters, PBS cannot de- termine if a numeric string is intended to be a group name or gid. Therefore when a group list is specified by the user, PBS places one requirement on the groups within a system. Each and every group in which a user might execute a job MUST have a group name and an entry in /etc/group. If no group 62 Chapt Revision: 2.0.2.2 PBS Administrator Guide Operation lists are ever used, PBS will use the login group and will accept it even if the group is not listed in /etc/group. Note in this case, the egroup attribute value is a numeric string representing the user's gid rather than the group "name". 5.1.7. Root Owned Jobs The server will reject any job which would execute under the UID of zero unless the owner of the job, typically root this or some other system, is listed in the server attribute acl_roots. 5.2. Job Prologue/Epilogue Scripts PBS provides the ability to run a site supplied script be- fore and/or after each job runs. This provides the capabil- ity to perform initialization or cleanup of resources, such as temporary directories or scratch files. The scripts may also be used to write "banners" on the job's output files. If a script is not present, MOM continues in a normal man- ner. If present, the script is run with root privilege. In order to be run, the script must adhere to the following rules: o The script must be in the (pbs_home)/mom_priv directory with the name prologue for the script to be run before the job and the name epilogue for the script to be run after the job. o The script must be owned by root. o The script must be readable and executable by root. o The script cannot be writable by anyone but root. The script may be a shell script or an executable object file. Typically, a shell script should start with a line of the form: #! interpreter. See the rules under execve(2) or exec(2) on your system. 5.2.1. Prologue and Epilogue Arguments When invoked, the prologue is called with the following ar- guments: argv[1] is the job id. argv[2] is the user name under which the job executes. argv[3] is the group name under which the job executes. The epilogue is called with the above, plus: Chapt Revision: 2.0.2.2 63 Operation PBS Administrator Guide argv[4] is the job name. argv[5] is the session id. argv[6] is the requested resource limits (list). argv[7] is the list of resources used. argv[8] is the name of the queue in which the job resides. argv[9] is the account string, if one exists. For both the prologue and epilogue: envp The environment passed to the script is null. cwd The current working directory is the user's home directory. input When invoked, both scripts have standard input connected to a system dependent file. For all systems execute the SP, this is /dev/null. For the SP, the file is the host file containing the list of nodes assigned to the job. output With one exception, the standard output and stan- dard error of the scripts are connected to the files which contain the standard output and error of the job. If a job is an interactive PBS job, the standard output and error of the epilogue is pointed to /dev/null because the pseudo terminal connection used was released by the system when the job terminated. 5.2.2. Prologue Epilogue Time Out To prevent a bad script or error condition within the script from delaying PBS, MOM places an alarm around the scripts execution. This is currently set to 30 seconds. If the alarm sounds before the scripts has terminated, MOM will kill the script. The alarm value can be changed by changing the define of within src/resmom/prolog.c. 5.2.3. Prologue Error Processing Normally, the prologue script should exit with a zero exit status. MOM will record in her log any case of a non-zero exit from a script. Exit status values and their impact on _________________________ The PBS development group reserves the right in a future release to remove certain of the arguments passed to the Epilogue which could be obtained with in the script by calling qstat(1B) or pbs_statjob(3B). 64 Chapt Revision: 2.0.2.2 PBS Administrator Guide Operation the job are: -4 The script timed out (took too long). The job will be requeued. -3 The wait(2) call waiting for the script to exit re- turned with an error. The job will be requeued. -2 The input file to be passed to the script could not be opened. The job will be requeued. -1 The script has a permission error, it is not owned by root and or is writable by others than root. The job will be requeued. 0 The script was successful. The job will run. 1 The script returned an exit value of 1, the job will be aborted. >1 The script returned a value greater than one, the job will be requeued. The above apply to normal batch jobs. Note, interactive- batch jobs (-I option) cannot be requeued on a non-zero sta- tus, the network connection back to qsub is lost and cannot be re-established. Interactive jobs will be aborted on any non-zero prologue exit. The administrator must exercise great caution in setting up the prologue to prevent jobs from being flushed from the system. Epilogue script exit values are logged, if non-zero, but have no impact on the state of the job. 5.3. Use and Maintenance of Logs The PBS system tends to produce lots of log file entries. There are two types of logs, the event logs which record events within each PBS daemon (pbs_server, pbs_mom, and pbs_sched) and the Server's accounting log. 5.3.1. The Daemon Logs Each PBS daemon maintains an event log file. The Server (pbs_server), Scheduler (pbs_sched), and MOM (pbs_mom) de- fault their logs to a file with the current date as the name in the PBS_HOME/(daemon)_logs directory. This location can be overridden with the "-L pathname" option; pathname must be an absolute path. If the default log file name is used, no -L option, the log will be closed and reopened with the current date daily. Chapt Revision: 2.0.2.2 65 Operation PBS Administrator Guide This happens on the first message after midnight. If a path is given with the -L option, the automatic close/reopen does not take place. All daemons will close and reopen the same named log file on receipt of SIGHUP. The pid of the daemon is available in its lock file in its home directory. Thus it is possible to move the current log file to a new name and send SIGHUP to restart the file: cd PBS_HOME/daemon_logs mv current archive kill -HUP `cat ../daemon_priv/daemon.lock` The amount of output in the logs depends on the selected events to log and the presence of debug writes, turned on by compiling with -DDEBUG. The server and MOM can be directed to record only messages pertaining to certain event types. The specified events are logically "or-ed". Their decimal values are: 1 Error Events 2 Batch System/Server Events 4 Administration Events 8 Job Events 16 Job Resource Usage (hex value 0x10) 32 Security Violations (hex value 0x20) 64 Scheduler Calls (hex value 0x40) 128 Debug Messages (hex value 0x80) 256 Extra Debug Messages (hex value 0x100) Everything turned on is of course 511. 127 is a good value to use. The event logging mask is controlled differently for the server and MOM. The server's mask is set via qm- gr(1B) setting the log_events attribute. This can be done at any time. MOM's mask may be set via her configuration file with a $logevent entry, see the -c option on pbs_mom. To change her logging mask, edit the configuration file and send MOM a SIGHUP signal. The scheduler, being site written may have a different method of changing its event logging mask, or it may not have the ability at all. 5.3.2. The Accounting Log The PBS Server daemon maintains an accounting log. The for- mat of the log is described in chapter 3 of the ERS. The log name defaults to (pbs_home)/server_priv/account- 66 Chapt Revision: 2.0.2.2 PBS Administrator Guide Operation ing/yyyymmdd where yyyymmdd is the date. The accounting may be placed elsewhere by specifying the -A option on the pbs_server command line. The option argument is the full (absolute) path name of the file to be used. If a null string is given, for example pbs_server -A "" then the accounting log will not be opened and no accounting records will be recorded. If the default file is used, named for the date, the file will be closed and a new one opened every day at midnight. If the server receives a SIGHUP signal, it will close the accounting log and reopen it. This allows you to rename the old log and start recording anew on an empty file. For ex- ample, if the server was started on February 1, it will be writing in the file 19960201. If the current date is Febru- ary 9, then the following actions will cause the current ac- counting file to be renamed feb1, the server to close 032 and starting writing 19960209. mv 19960201 feb1 kill -HUP 1234 (the server's pid) 5.4. Problem Solving The following is a very incomplete list of possible problems and how to solve them. 5.4.1. Clients Unable to Contact Server If a client command, qstat, qmgr, ..., is unable to connect to a server there are several possibilities to check. If the error return is 15034, "No server to connect to", check (1) that there is indeed a server running and (2) that the default server information is set correctly. The client commands will attempt to connect to the server specified on the command line if given, or if not given, the server spec- ified in the "default server file" specified when the com- mands where built and installed. If the error return is 15007, "No permission", check for (2) as above. Also check that the executable pbs_iff is located in the search path for the client and that it is setuid root. Additionally, try running pbs_iff by typing: pbs_iff server_host 15000 Where server_host is the name of the host on which the serv- er is running and 15000 is the port to which the server is listening (if built with a different port number, use that number instead of 15000). pbs_iff should print out a string of garbage characters and exit with a status of 0. The garbage is the encrypted credential which would be used by the command to authenticate the client to the server. If pbs_iff fails to print the garbage and/or exits with a non- zero status, either the server is not running or was built Chapt Revision: 2.0.2.2 67 Operation PBS Administrator Guide with a different encryption system than was pbs_iff. 5.4.2. Non Delivery of Output If the output of a job cannot be delivered to the user, it is saved in a special directory, (pbs_home)/undelivered, and mail is sent to the user. The typical causes of non-deliv- ery are (1) destination host is not trusted and the user does not have a .rhost file, (2) An improper path was speci- fied, (3) a directory is not writable, and (4) the user's .cshrc on the destination host generates output when execut- ed. This are explained fully in the section "Delivery of Output Files" in the next chapter. 5.4.3. Job Cannot be Executed If a user receives a mail message containing a job id and the line "Job cannot be executed", the job was aborted by MOM when she tried to place it into execution. The complete reason can be found in one of two places, MOM's log file or the standard error file of the user's job. If the second line of the message is "See Administrator for help", then MOM aborted the job before the job's files were set up. The reason will be noted in MOM's log. Typical reasons are a bad user/group account, checkpoint/restart file (Cray), or a system error. If the second line of the message is "See job standard error file", then MOM had created the job's file and additional messages were written to standard error. This is typically the result of a bad resource request. 5.4.4. Running Jobs with No Active Processes On very rare occasions, PBS may be in a situation where a job is in the Running state but has no active processes. This should never happen as the death of the job's shell should trigger Mom to notify the server that the job exited and end of job processing should begin. The fact that it happens even rarely means there is a bug in PBS (gasp! Oh the horror of it all.). If this situation is noted, PBS offers a way out. Use the qsig command to send SIGNULL, signal 0, to the job. If Mom notes there are not any processes then she will force the job into the exiting state. 5.5. Communication with the User Users tend to want to know what is happening to their job. PBS provides a special job attribute, comment, which is available to the operator, manager, or the scheduler pro- gram. This attribute can be set to a string to pass infor- 68 Chapt Revision: 2.0.2.2 PBS Administrator Guide Operation mation to the job owner. It might be used to display infor- mation about why the job is not being run or why a hold was placed on the job. Users are able to see this attribute when it is set by using the -f option of the qstat command. A scheduler program can set the comment attribute via the pbs_alterjob() API. Operators and managers may use the -W option of the qalter command, for example qalter -W comment="some text" job_id 6. Advice for Users The following sections provide information necessary to the general user community concerning use of PBS. Please make this information available. 6.1. Modification of User shell initialization files A user's job may not run if the user's start-up files (.cshrc, .login, or .profile) contain commands which attempt to set terminal characteristics. Any such activity should be skipped by placing a test of the environment variable PBS_ENVIRONMENT (or for NQS compatibility, ENVIRONMENT). This can be done as shown in the following sample .login: setenv PRINTER printer_1 setenv MANPATH /usr/man:/usr/local/man:/usr/new/man if ( ! $?PBS_ENVIRONMENT ) then do terminal stuff here endif If the user's login shell is csh, the following message may appear in the standard output of a job: Warning: no access to tty, thus no job control in this shell This message is produced by many csh versions when the shell determines that its input is not a terminal. Short of modi- fying csh, there is no way to eliminate the message. For- tunately, it is just an informative message and has no ef- fect on the job. 6.2. Shell Invocation When PBS starts a job, it invokes the user's login shell (unless the user submitted the job with the -S option). PBS passes the job script which is a shell script to the login in one of two ways depending on how PBS was installed. Script as Standard Input The standard method for PBS (built with SHELL_IN- VOKE = 0), is to open the script file as standard input for the shell. This is equivalent to typing shell < script. This offers advantages and disad- vantages: + The user's script will always be directly pro- cessed by the user's login shell. Chapt Revision: 2.0.2.2 69 Advice PBS Administrator Guide + If the user specifies a non-standard shell (any old program) with the -S option, the script can be read by that program as its input. - If a command within the job script reads from standard input, it may read lines from the script depending on how far ahead the shell has buffered its input. Any command line so read will not be executed by the shell. A command that reads from standard input with out explic- it redirection is generally unwise in a batch job. Name of Script on Standard Input The alternative method (PBS built with SHELL_IN- VOKE = 1), is to pass the name of the job script to the shell program. This is equivalent to typ- ing the script name as a command to an interactive shell. Since this is the only line passed to the script, standard input will be empty to any com- mands. This approach also offers both advantages and disadvantages: + Any command which reads from standard input without redirection will get an EOF. + The shell syntax can vary from script to script, it does not have to match the syntax for the user's login shell. The first line of the script, even before any #PBS directives, should be #!/shell where shell is the full path to the shell of choice, /bin/sh, /bin/csh, ... The login shell will interpret the #! line and invoke that shell to process the script. - An extra shell process is run to process the job script. - If a non-standard shell is used via the -S op- tion, it will not receive the script, but its name, on its standard input. The choice if shell invocation methods is left to the site. It is recommended that all PBS execution servers (pbs_mom) within that site be built to use the same shell invocation method. 6.3. Job Exit Status The exit status of a job is normally the exit status of the shell executing the job script. If a user is using csh and has a .login file in the home directory, the exit status of csh becomes the exit status of the last command in .logout. 70 Chapt Revision: 2.0.2.2 PBS Administrator Guide Advice This may impact the use of job dependencies which depend on the job's exit status. To preserve the job's status, the user may either remove .logout or add the following two lines to it. Add as the first line: set EXITVAL = $status and as the last executable line: exit $EXITVAL 6.4. Delivery of Output Files PBS uses a version of the rcp(1) command, pbs_rcp(1B) to re- turn output files to the user at a remote host after the us- er's job terminates. This version of rcp is based on the bsd 4.4 lite distribution rcp. It is provided because it, unlike some rcp implementation, always exits with a non-zero exits status for any error. Thus pbs_mom knows if the file was not delivered. Note that the secure copy program, scp, is also based on this version of rcp and it is good about exit statuses. Using rcp, the copy of output or staged files can fail for (at least) two reasons. 1. If the user's .cshrc script outputs any characters to standard output, e.g. contains an echo command, pbs_rcp will fail. See the section in this document entitled Modification of User shell initialization files. 2. The user must have permission to rsh to the remote host. Output is delivered to the remote destination host with the remote file owner's name being the job owner's name (job submitter). On the execution host, the file is owned by the user's execution name which may be different. For information, see the -u us- er_list option on the qsub(1) command. If the two names are identical, permission to rcp may be granted at the system level by an entry in the des- tination host's /etc/host.equiv file calling out the execution host. If the owner name and the execution name are different or if the destination host's /etc/hosts.equiv file does not contain an entry for the execution host, the user must have an ".rhosts" file in her home directory of the system to which the output files are being re- turned. The .rhosts must contain an entry for the sys- tem on which the job executed with the user name under under which the job was executed. It is wise to have two lines, one with just the "base" host name and one with the full host.domain_name. For delivery of output files on the local host, PBS uses the /bin/cp(1) command. Local and remote Delivery of output may Chapt Revision: 2.0.2.2 71 Advice PBS Administrator Guide fail for the following additional reasons: 1. A directory in the specified path does not exist. 2. A directory in the specified path is not searchable by the user. 3. The target directory is not writable by the user. Additional information as to the cause of the delivery prob- lem might be determined from MOM's log file. Each failure is logged. The various error codes are described in re- quests.c/sys_copy() in the IDS. If PBS is built to use the Secure Copy Program, scp, then PBS will first try to deliver output or stage-in/out files using scp. If scp fails, PBS will try again using rcp [as- suming that scp might not exist on the remote host]. If rcp also files, the above cycle will be repeated after a delay in case the problem is caused by a temporary network prob- lem. All failures are logged in Mom's log. 6.5. Stage in and Stage out problems The same requirements and hints discussed above in regard to delivery of output apply to staging files in and out. It may also be useful to note that the stage in and out option on qsub both take the form local_file@remote_host:remote_file regardless of the direction of transfer. Thus for stage in, the direction of travel is local_file <-- remote_host:remote_file and for stage out, the direction of travel is local_file --> remote_host:remote_file Also note that all relative paths are relative to the user's home directory on the respective hosts. PBS uses rcp or scp (or cp if the remote host is the local host) to perform the transfer. Hence, a stage in is just a rcp -r remote_host:remote_file local_file and a stage out is just rcp -r local_file remote_host:remote_file As with rcp, the remote_file may be a directory name. Also as with rcp, the local_file specified in the stage in/out directive may name a directory. For stage in, if re- mote_file is a directory, then local file must also be a di- rectory. For stage out, if local_file is a directory, then remote_file must also be a directory. If local_file is directory on a stage out directive, the lo- cal_file directory on the execution host, including all files and subdirectories, will be copied. At the end of the job, the directory, including all files and subdirectories, will be deleted. Users should be aware that this may cre- 72 Chapt Revision: 2.0.2.2 PBS Administrator Guide Advice ate a problem if multiple jobs are using the same directory. Stage in presents another problem. Assume the user wishes to stage in the contents of a single file named poo and gives the following stage in directive: -W stagein=/tmp/bear@somehost:poo If /tmp/bear is an existing directory, the local file be- comes /tmp/bear/poo. When the job exits, PBS will determind that /tmp/bear is a directory and append /poo to it. Thus /tmp/bear/poo will be deleted. If however, the user wishes to stage in the contents of a directory named cat and gives the following stage in directive: -W stagein=/tmp/dog/newcat@somehost:cat where /tmp/dog is an existing directory, then at job end, PBS will determine that /tmp/dog/newcat is a directory and append /cat and then fail on the attempt to delete /tmp/dog/newcat/cat. On stage in when remote_file is directory, the user should not specify a new directory as local_name. In the above case, the user should go with -W stagein=/tmp/dog@somehost:cat which will produce /tmp/dog/cat which will match what PBS will try to delete at job's end. Wildcards should not be used in either the local_file or the remote_file name. PBS does not expand the wildcard charac- ter on the local system. If wildcards are used in the re- mote_file name, since rcp is launched by rsh to the remote system, the expansion will occur. However, at job end, PBS will attempt to delete the file whose name actually contains the wildcard character and fail to find it. This will leave all the staged in files in place (undeleted). 6.6. Dependent Jobs and Test Systems If you have users running on a test batch system using an alternative port number, -p option to pbs_server, problems may occur with job dependency if the following requirements are not observed: 1. For a test system, the job identifier in a dependency specification must include at least the first part of the host name. 2. The colon in the port number specification must be es- caped by a black slash. This is true for both the server and current server sections. For example: 123.test_host\:17000 123.old_host@test_host\:17000 123.test_host\:17000@diff_test_host\:18000 On a shell line, the back slash itself must be escaped from the shell, so the Chapt Revision: 2.0.2.2 73 Advice PBS Administrator Guide above become: 123.test_host\\:17000 123.old_host@test_host\\:17000 123.test_host\\:17000@diff_test_host\\:18000 These rules are not documented on the qsub/qalter man pages since the likely hood of the general user community finding themselves seting up dependencies with jobs on a test system is small and the inclusion would be generally confusing. 74 Chapt Revision: 2.0.2.2