CCCCCC   RRRRRRR    AAAAAA   YY    YY
                   CCCCCCCC  RRRRRRRR  AAAAAAAA  YY    YY
                   CCC   CC  RR    RR  AA    AA  YYY  YYY
                   CC        RR   RR   AA    AA   YYYYYY
                   CC        RRRRRR    AAAAAAAA    YYYY
                   CC        RRRRRR    AAAAAAAA     YY
                   CC        RR   RR   AA    AA     YY
                   CCC   CC  RR   RR   AA    AA     YY
                   CCCCCCCC  RR    RR  AA    AA     YY
                    CCCCCC   RR    RR  AA    AA     YY

                               RESEARCH,  INC.

   CF77 Compiling System, Volume 4: Parallel Processing Guide (SG-3074 5.0) 

   This user's guide defines and describes the Autotasking feature of the 
   CF77 compiling system release 5.0.  Autotasking is the automatic 
   distribution of loop iterations to multiple processors.  This user's 
   guide is one manual in a set (SR-3071, SR-3072, and SG-3074) describing 
   the CF77 compiling system, which includes the Cray Fortran compiler 
   CFT77 and the Autotasking software.


                                                           Record of Revision
     ########################################################################







     The date of printing or software version number is indicated in the
     footer.  In reprints with revision, changes are noted by revision bars
     along the margin of the page.





            Version         Description

              4.0           June 1990.  Original printing.  This manual
                            replaces the UNICOS Autotasking User's Guide,
                            publication SN-2088, and the COS Autotasking
                            User's Guide, publication SN-3033.  It supports
                            the Autotasking feature of the CF77 compiling
                            system release 4.0.  The "New Features" page
                            details specific features associated with the 4.0
                            release.


              5.0           June 1991.  Reprint with revision to support the
                            Autotasking feature of the CF77 compiling system
                            release 5.0, which runs on CX/CEA and CRAY-2
                            systems under the UNICOS 6.0 release or higher.
                            Documentation for the Autotasking feature under
                            the Cray Research operating system COS,
                            previously included in this manual, is no longer
                            included.  COS users of Autotasking should see
                            revision 4.0 of this manual.

                            New fpp command options include the following:
                            -H, to specify directories that contain INCLUDE
                            files; -N80, to specify 80-column input files;
                            -P, to specify the number of lines per page, for
                            page-formatted listings; -Q, to specify the size
                            of FPP-generated temporary arrays; and -V, to
                            display current FPP version information.  New FPP
                            optimization switch f lets you enable generation
                            of debugging directives.  Optimization switch j,
                            to translate nested loop idioms into library
                            calls, is now enabled by default.  A new FPP
                            directive, CFPP$ PRIVATEARRAY, lets you specify
                            that private arrays can be autotasked.

                            New fmp command options include the following:
                            -I, to specify directories that contain INCLUDE
                            files; -N80, to specify 80-column input files;
                            and -V, to display current FMP version
                            information.  A new FMP Autotasking directive,
                            TASKCOMMON, lets you specify that common blocks
                            should be converted to task common blocks.

                            UNICOS environment variables are now documented
                            in section 2, "CF77 User Interface," rather than
                            in section 13.

     SG-3074 5.0               Cray Research, Inc.                        iii



                                                                     Contents
     ########################################################################








       v  Preface

       1  Introduction  [1]
       3  Evolution of CRI parallel processing software
       5  Using microtasking and macrotasking with Autotasking
       6  Goals of Autotasking
       6  When to use Autotasking
       7  Autotasking's effect on vectorization
       8  Speedup expected from Autotasking

      11  CF77 User Interface  [2]
      11  CF77 compiling system
      18  UNICOS user interface
      28  UNICOS environment variables

      35  Invoking FPP and FMP Directly  [3]
      35  UNICOS fpp command
      47  UNICOS fmp command

      51  Concepts and Directives  [4]
      51  Concepts
      56  Levels of user intervention with Autotasking
      57  Directives

     101  FPP Data Dependency Analysis  [5]
     102  Data dependency examples
     107  Reference reordering
     108  Ambiguous subscript resolution
     109  Loop splitting to split index set
     109  Loop peeling
     110  Using data dependency directives
     115  Loop splitting to isolate recursion
     115  Translation of linear recursion
     117  Array indexing

     123  FPP Loop Analysis and Tuning  [6]
     124  Loop analysis
     136  Loop optimizations
     145  Loop tuning parameters








     SG-3074 5.0               Cray Research, Inc.                          v


     Contents                       CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     167  Additional FPP Optimization  [7]
     167  Vectorization enhancement
     174  Inline expansion
     187  Scalars in loops

     193  FPP Source Output  [8]
     194  Names generated by FPP
     195  Temporary arrays generated by FPP
     197  Example FPP listings

     203  Autotasking Performance  [9]
     203  Performance expectations for vectorization
     205  Amdahl's Law for vectorization
     206  Performance expectations for Autotasking
     206  Amdahl's Law for multitasking
     209  Estimating the percentage of parallelism within a program
     211  Prerequisites for high performance
     212  Characteristics of parallel programs
     215  Extent of parallelism and load balancing
     220  Overhead produced by Autotasking
     224  Autotasking performance example:  NAS Kernel Benchmark

     235  Autotasking Analysis Tools  [10]
     235  Tool summary
     237  Autotasking tools
     243  Other UNICOS tools

     253  Autotasking Memory Usage  [11]
     253  Increased program space requirements
     256  Increased stack space requirements
     257  Specifying memory requirements

     263  Autotasking in a Batch Environment  [12]
     263  Realistic Autotasking performance expectations
     272  Autotasking in a heavily loaded batch environment

     275  Debugging Autotasked Programs  [13]
     275  Problems unrelated to the use of multiple processors
     276  Problems related to the use of multiple processors
     276  CDBX debugger support
     281  Environment variables for all systems

     283  UNICOS Interface to Autotasking  [14]

     287  Software Anomalies  [A]

     291  FPP TIDY Subprocessor  [B]
     291  TIDY options
     295  TIDY parameters
     298  FORMAT and DATA statements
     298  Continued lines





     vi                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                       Contents
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     299  Interpreting FMP Intermediate Source Code  [C]

     307  UNICOS Command Pages  [D]

     309  FMP and FPP Messages  [E]

          Figures
       2  Figure 1.  Cray Research parallel processing capabilities
       4  Figure 2.  Cray Research multitasking implementations
      12  Figure 3.  CF77 compiling system
      14  Figure 4.  FPP  the dependence analysis phase
      17  Figure 5.  Summary of FMP and CFT77 roles in CF77 compiling system
      18  Figure 6.  Phases of Autotasking using cf77
      21  Figure 7.  cf77 command summary
      23  Figure 8.  cf77 command control option summary
      29  Figure 9.  Summary of multitasking environment variables
      53  Figure 10. Multitasking terminology
      54  Figure 11. Multitasking terminology (continued)
      57  Figure 12. Levels of intervention with Autotasking
      59  Figure 13. Directive summary by type
      63  Figure 14. FPP transformation directive summary
      68  Figure 15. FPP data dependency directive summary
      72  Figure 16. Miscellaneous FPP directive summary
      77  Figure 17. Autotasking versus microtasking
      78  Figure 18. FMP Autotasking directives
      82  Figure 19. Work distribution parameters for parallel loops
      89  Figure 20. FMP microtasking directive summary
     102  Figure 21. FPP data dependency analysis
     104  Figure 22. FPP data dependency analysis
     107  Figure 23. Summary of FPP techniques to optimize data dependencies
     117  Figure 24. Summary of linear recursion patterns recognized by FPP
     124  Figure 25. Summary of FPP loop analysis
     132  Figure 26. Summary of FPP loop selection criteria for vectorization
     134  Figure 27. Summary of FPP criteria for Autotasking and possible
                     inhibitors
     137  Figure 28. Summary of FPP loop optimization techniques
     146  Figure 29. Summary of FPP loop tuning parameters
     169  Figure 30. Summary of additional FPP vectorization enhancements
     188  Figure 31. Summary of FPP transformations of scalars in loops
     198  Figure 32. Summary of FPP listing features
     204  Figure 33. Summary of performance issues
     210  Figure 34. Simple technique for estimating parallelism in a program
     213  Figure 35. Characteristics of programs with a high potential for
                     parallelism
     217  Figure 36. Summary of parallelism and load balancing
     218  Figure 37. Execution scenario for example 1
     219  Figure 38. Execution scenario for example 2
     223  Figure 39. Overhead introduced by Autotasking on CX/CEA
     223  Figure 40. Overhead introduced by Autotasking on CRAY-2
     255  Figure 41. FMP source code generation
     266  Figure 42. Summary of issues influencing Amdahl's Law under
                     realistic Autotasking conditions


     SG-3074 5.0               Cray Research, Inc.                        vii


     Contents                       CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     273  Figure 43. Summary of advantages and disadvantages of Autotasking
                     in a batch system
     285  Figure 44. Parallel region overview (master process)
     285  Figure 45. Parallel region overview (slave process)

          Tables
      42  Table 1.  Optimization switches enabled and disabled by -e and -d
      46  Table 2.  Listing switches enabled and disabled by -p and -q
      60  Table 3.  Allowable scope parameters for CFPP$ directives
      61  Table 4.  CFPP$ directives
      76  Table 5.  Equivalent Autotasking and microtasking directives
      99  Table 6.  CFT77 versus FPP directives
     199  Table 7.  Loop disposition codes
     208  Table 8.  Amdahl's Law for multitasking
     226  Table 9.  NAS Kernel Benchmark  zero changes
     227  Table 10. NAS Kernel Benchmark  twenty changes
     234  Table 11. VPENTA case study results from a CRAY Y-MP system
     235  Table 12. Which tool to use?
     237  Table 13. Tool impact
     270  Table 14. Autotasking wall-clock speedups in batch environment
     291  Table 15. TIDY switches






































     viii                      Cray Research, Inc.                SG-3074 5.0



                                                                      Preface
     ########################################################################







                           This user's guide is one manual in a set
                           describing the Cray Research CF77 compiling
                           system.  The compiling system includes the Cray
                           Research Fortran compiler CFT77 and the
                           Autotasking software described in this guide.
                           Other manuals in this set include the following:

                           * CF77 Compiling System, Volume 1:  Fortran
                             Reference Manual, publication SR-3071

                           * CF77 Compiling System, Volume 2:  Compiler
                             Message Manual, publication SR-3072

                           * CF77 Compiling System, Volume 3:  Vectorization
                             Guide, publication SG-3073

                           * CF77 Compiling System Ready Reference,
                             publication SQ-3070

                           This user's guide defines and describes the
                           Autotasking feature of the CF77 compiling system.
                           Autotasking is the automatic distribution of loop
                           iterations to multiple processors.  The
                           Autotasking feature described in this manual runs
                           on CRAY Y-MP, CRAY X-MP EA, CRAY X-MP, and CRAY-2
                           computer systems running the UNICOS 6.0 release or
                           higher and the CF77 5.0 release.  Autotasking is
                           released as part of the CF77 5.0 release.

                           This user's guide describes the dependence
                           analyzer, FPP, the translation phase, FMP, and the
                           user interface to the compiling system, cf77.  FPP
                           preprocesses DO and IF loops for the CFT77
                           compiler, and improves performance of Cray Fortran
                           programs by providing Autotasking and
                           vectorization enhancement.  FMP translates
                           directives and original Fortran source for
                           Autotasking.  cf77 provides a one-step user
                           interface to the compiling system.

                           The following Cray Research, Inc. (CRI) manuals
                           provide information about related subjects.
                           Unless otherwise noted, all publications
                           referenced in this manual are CRI publications.

                           * UNICOS User Commands Reference Manual,
                             publication SR-2011

                           * UNICOS User Commands Ready Reference,
                             publication SQ-2056


     SG-3074 5.0               Cray Research, Inc.                         ix


     Preface      CF77 Compiling System, Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * CAL Assembler Version 2 Reference Manual,
                             publication SR-2003

                           * CRAY-2 Fortran (CFT2) Reference Manual,
                             publication SR-2007

                           * Macros and Opdefs Reference Manual, publication
                             SR-0012

                           * UNICOS Macros and Opdefs Reference Manual for
                             CRAY-2 Computer Systems, publication SR-2082

                           * Volume 1:  UNICOS Fortran Library Reference
                             Manual, publication SR-2079

                           * Volume 2:  UNICOS C Library Reference Manual,
                             publication SR-2080

                           * Volume 3:  UNICOS Math and Scientific Library
                             Reference Manual, publication SR-2081

                           * Volume 4:  UNICOS System Calls Reference Manual,
                             publication SR-2012

                           * Segment Loader (SEGLDR) and ld Reference Manual,
                             publication SR-0066

                           * UNICOS Performance Utilities Reference Manual,
                             publication SR-2040

                           * UNICOS CDBX Symbolic Debugger Reference Manual,
                             publication SR-2091

                           * UNICOS CDBX Debugger User's Guide, publication
                             SG-2094

                           * CRAY Y-MP, CRAY X-MP EA, and CRAY X-MP
                             Multitasking Programmer's Manual, publication
                             SR-0222

                           * CRAY-2 Multitasking Programmer's Manual,
                             publication SN-2026

                           * Interlanguage Programming Conventions,
                             publication SN-3009




     Conventions
                           The Hardware Product Line sheet, located at the
                           end of this preface, defines the hardware naming
                           conventions used in this manual.  This sheet shows
                           both the chronological evolution of Cray Research
                           mainframes and the characteristics of each
                           mainframe group.  The reverse side of the sheet
                           contains definitions of the terms used on the
                           sheet and throughout this manual.


     xi                        Cray Research, Inc.                SG-3074 5.0


     CF77 Compiling System, Volume 4:  Parallel Processing Guide      Preface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The following typographic conventions are used
                           throughout this manual:

                           Convention   Meaning
                           ----------   -------

                           command(1)   The designation (1) following a
                                        command name indicates that the
                                        command is documented in UNICOS User
                                        Commands Reference Manual,
                                        publication SR-2011

                           system call(2)
                                        The designation (2) following a
                                        system call name indicates that the
                                        system call is documented in Volume
                                        4:  UNICOS System Calls Reference
                                        Manual, publication SR-2012

                           library routine(3X)
                                        The designation (3X) following a
                                        routine name indicates that the
                                        routine is documented in one of the
                                        CRI library reference manuals (-
                                        SR-2079, SR-2080, SR-2081, SR-2057,
                                        or SM-2083).  The letter following
                                        the number 3 indicates the
                                        appropriate manual.

                                        For a list of the 3X library routine
                                        designations and their associated
                                        manuals, see the FILES section of the
                                        man(1) man page.

                           typewriter font
                                        Denotes literal items such as command
                                        names, file names, routines,
                                        directory names, path names, signals,
                                        messages, and programming language
                                        structures.

                           italic font  Denotes variable entries and words or
                                        concepts being defined.

                           bold typewriter font
                                        In screen drawings of interactive
                                        sessions, denotes literal items
                                        entered by the user.  Output is shown
                                        in nonbold typewriter font.

                           In this manual, Cray Research and CRI are used
                           interchangeably to refer to Cray Research, Inc.,
                           and/or its products.  To avoid redundancy or
                           awkwardness, Cray Research may occasionally be
                           shortened to Cray (for example, Cray disk or Cray
                           job).





     SG-3074 5.0               Cray Research, Inc.                        xii


     Preface      CF77 Compiling System, Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     Reader comments
                           If you have comments about the technical accuracy,
                           content, or organization of this manual, please
                           tell us.  You can contact us in any of the
                           following ways:

                           * Call our Software Information Services
                             department at (612) 683-5729.

                           * Send us electronic mail from a UNICOS or UNIX
                             system, using the following UUCP address:

                                uunet!cray!publications


                           * Send us electronic mail from any system
                             connected to Internet, using one of the
                             following Internet addresses:

     				pubs3074@timbuk.cray.com (comments specific 
				to this manual)

     				publications@timbuk.cray.com (general 
				comments)


                           * Send a facsimile of your comments to the
                             attention of "Software Information Services" at
                             fax number (612) 683-5599.

                           * Use the postage-paid Reader's Comment form at
                             the back of this manual.

                           * Write to us at the following address:

                                Cray Research, Inc.
                                Software Information Services Department
                                655F Lone Oak Drive
                                Eagan, MN  55121


                           We value your comments and will respond to them
                           promptly.
















     xiii                      Cray Research, Inc.                SG-3074 5.0



                                                            Introduction  [1]
     ########################################################################







                           Parallel processing capabilities exist on several
                           levels and can be used in many different ways on
                           Cray Research, Inc. (CRI) computer systems.
                           Parallel processing capabilities are summarized in
                           Figure 1, page 2, and discussed in this section.
                           At the hardware level, the following introduce
                           parallel processing:

                           * Parallel instruction execution - Most CRI
                             computer systems issue one instruction per clock
                             period, although an instruction may take several
                             clock periods to complete execution.  Also,
                             instructions are not executed serially, but in
                             parallel; for example, an addition instruction
                             may be issued during the execution of a
                             multiplication instruction.  This is parallel
                             processing at the hardware instruction level.

                           * Vector registers and segmented vector functional
                             units - The vector registers and vector
                             functional units in CRI systems use instruction
                             pipelining.  Vector functional unit segmentation
                             and vector chaining (in CRAY Y-MP systems) act
                             as parallel processing aids.  Instruction
                             pipelining occurs when an instruction begins
                             before the previous instruction has completed;
                             this is accomplished by using segmented
                             hardware.  Segmentation is the process whereby
                             an operation is divided into a discrete number
                             of steps; segmented hardware allows these
                             discrete parts to be "pipelined" through it.
                             Vector chaining in CRAY Y-MP systems allows a
                             vector register reserved for results to become
                             the operand register of a succeeding
                             instruction.  These hardware parallel processing
                             features, combined with instruction parallelism,
                             allow a significant number of operations to be
                             done in parallel.

                           * I/O subsystems or foreground processors -
                             CRAY Y-MP/8 computer systems can have up to two
                             I/O subsystems, and CRAY-2 computer systems have
                             foreground processors.  These are logically
                             separate processors that perform input and
                             output functions for the operating system in use
                             on the computer system.  These operations occur
                             in parallel with a job or process running in the
                             main processor.



     SG-3074 5.0               Cray Research, Inc.                          1


     Introduction                   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                 Cray Research Parallel Processing Capabilities


       Parallel execution at the hardware level


       * Parallel instruction execution


       * Vector registers and segmented vector functional units


       * I/O subsystems or foreground processors






       Parallel execution at the software level


       * Concurrent multiprogramming


       * Multiprogramming at the job level


       * Multiprogramming at process level


       * Multitasking
     ________________________________________________________________________

            Figure 1.  Cray Research parallel processing capabilities

                           In conjunction with the hardware-level
                           parallelism, there are also software elements that
                           introduce parallel processing, as follows:

                           * Concurrent multiprogramming - When a CRI system
                             has only one main processor, that processor
                             switches between jobs or processes in the
                             system.  This switching can make it appear that
                             many things are happening simultaneously, even
                             though only one program is being worked on at
                             any point in time.  This is desirable, because
                             the processor can work on one job while another
                             job is waiting for an I/O operation to complete.
                             Almost all operating systems can do this kind of
                             multiprogramming.

                           * Multiprogramming at the job level - With more
                             than one main processor available, as in any
                             multiprocessor CRI system, the processors are
                             working concurrently on as many different


     2                         Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                   Introduction
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             programs at one time as there are processors.

                           * Multiprogramming at the process level - Each
                             program that a user submits to the operating
                             system or types in at a terminal is a process to
                             the operating system.  In UNICOS, a user can
                             create separate processes simply by placing a
                             job in the background (affixing an ampersand to
                             the end of the command line), or by using the
                             pipe capability to "pipe" the output of one
                             process to another process as input.

                           It is also possible to have more than one
                           processor work on one program.  This is
                           generically called multitasking.  CRI has
                           exploited this multitasking capability in evolving
                           software products.




     Evolution of CRI
     parallel processing
     software
     1.1
                           The evolution of CRI parallel processing software
                           consists of three implementations:  macrotasking,
                           microtasking, and Autotasking.  This evolution is
                           summarized in Figure 2, page 4.  Macrotasking
                           required programmers to modify their codes to
                           exploit parallelism by doing extensive data
                           scoping and required the insertion of library
                           calls specific to CRI.  Microtasking expanded on
                           the strengths of macrotasking; less data scoping
                           was required and compiler directives replaced
                           library calls specific to CRI.  A big advantage of
                           microtasking is that it requires programmers to
                           change working programs much less than
                           macrotasking and works well in both batch and
                           dedicated environments.




















     SG-3074 5.0               Cray Research, Inc.                          3


     Introduction                   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                   Cray Research Multitasking Implementations



       * Macrotasking



       * Microtasking



       * Autotasking




       Autotasking, microtasking, and macrotasking can coexist in the same
       program, but not in the same subprogram unit.




       A previously autotasked routine can be processed by FPP to detect
       additional parallelism.  FPP will analyze only loop nests that do
       not contain CMIC$ directives.




       A previously microtasked program can be processed by FPP to detect
       additional parallelism in the nonmicrotasked routines.
     ________________________________________________________________________

              Figure 2.  Cray Research multitasking implementations

                           The most recent implementation, Autotasking,
                           combines the best aspects of microtasking with two
                           fundamental enhancements:

                           * Autotasking can be fully automatic; that is, it
                             does not require programmer intervention,
                             although programmers are free to interact with
                             the Autotasking system to enhance performance.

                           * Autotasking can exploit parallelism at the DO
                             loop level without extending to subroutine
                             boundaries, as microtasking is written to do.









     4                         Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                   Introduction
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             The CF77 compiling system, composed of products
                             FPP, FMP, and the CFT77 compiler, provide this
                             functionality.  The cf77, fpp, and fmp UNICOS
                             commands are the interface to the CF77 compiling
                             system.  The cf77 command serves as an
                             "overcompiler," which invokes the appropriate
                             phases of Autotasking (that is FPP, FMP, the
                             CFT77 compiler, and the loader) to build an
                             executable program based on a set of defaults
                             and options.  The UNICOS fpp command invokes the
                             dependence analysis phase of Autotasking.

                             FMP replaces PREMULT, the previous microtasking
                             preprocessor, providing a bridge to the enhanced
                             libraries and supporting both Autotasking and
                             microtasking.  Programs that have been
                             microtasked do not have to be changed to use
                             Autotasking support.




     Using microtasking
     and macrotasking with
     Autotasking
     1.2
                           On all CRI systems, Autotasking, microtasking, and
                           macrotasking can coexist in the same program.
                           Entire subprogram units can be macrotasked,
                           microtasked, or Autotasked in a given program
                           without problems.  You can also combine
                           Autotasking and microtasking in the same
                           subprogram unit, with the following restrictions:

                           * Autotasking CMIC directives inhibit FPP action
                             on any loop nest in which they appear.  Also,
                             FPP does not try to optimize anything inside a
                             parallel region (that is, anything bounded by a
                             CMIC$ PARALLEL/CMIC$ END PARALLEL pair.)

                           * For microtasking, the following CMIC$ directives
                             inhibit FPP action for the entire routine:
                             DOGLOBAL, MICRO, PROCESS, ALSOPROCESS, and
                             ENDPROCESS.

                           * Microtasking directives other than those
                             previously listed (including CONTINUE and
                             GUARD/ENDGUARD) are handled the same as for
                             Autotasking CMIC directives.

                           * FPP does not change DOGLOBAL directives to DOALL
                             directives.

                           * Codes can be autotasked from within macrotasked
                             areas.



     SG-3074 5.0               Cray Research, Inc.                          5


     Introduction                   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Goals of Autotasking
     1.3
                           Autotasking can be generally described as the
                           automatic distribution of loop iterations to
                           multiple processors.  To do this, Autotasking
                           takes a Fortran program as input and transforms it
                           so it can run on multiple processors concurrently,
                           and (assuming multiple processors are available),
                           it makes the program run faster (wall-clock time)
                           than it does without Autotasking.  Autotasking
                           builds on the experience gained from prior CRI
                           parallel processing products, macrotasking and
                           microtasking, and makes parallel processing easier
                           for CRI system users.

                           More specifically, the goals of Autotasking
                           include the following:

                           * Detect parallelism automatically in a program
                             and exploit the parallelism without user
                             intervention.

                           * Define a syntax by which parallelism is
                             expressed, allowing users to guide the
                             Autotasking system in code segments in which the
                             user can provide additional information to the
                             Autotasking system, or where the Autotasking
                             system cannot detect parallelism automatically.

                           * Define the scope of variables when transforming
                             a program to exploit parallelism.

                           * Provide a simple command line interface to
                             Autotasking.




     When to use
     Autotasking
     1.4
                           Autotasking, like microtasking and macrotasking,
                           can reduce the wall-clock run time of CPU-
                           intensive programs.  If a program is I/O-bound,
                           using Autotasking will probably make it more I/O-
                           bound.  Long-running programs, programs that use
                           so much memory that little else can run in the
                           machine, or programs that have hard deadlines for
                           completion are particularly good candidates for
                           Autotasking.  However, most running CRI computer
                           systems have available idle time that autotasked
                           programs can employ effectively.






     6                         Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                   Introduction
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Generally, Autotasking works best on programs in
                           which most of the work is in nested DO loops that
                           do not contain CALL statements.  To run in
                           parallel, the iterations of a DO loop must use
                           independent elements of arrays that are being
                           changed.  This property is often hard to see for
                           complex loops.  Also, Autotasking is not limited
                           to running code in parallel on the outer loop; it
                           can make transformations that arrange code so that
                           it will run in parallel on loops other than the
                           outermost loop.  Programs that are heavily
                           vectorized tend to have high potential for
                           parallelism.




     Autotasking's effect
     on vectorization
     1.5
                           High performance for many codes is achieved when
                           the compiler detects code sequences that can be
                           vectorized, and it uses the vector registers to
                           run those sequences.  Generally, vectorized code
                           for a loop runs about 10 times faster than scalar
                           code for the same loop.  Because it costs a
                           program little to use vector registers, it is
                           almost always better to run in vector mode.

                           In determining how to optimize a program,
                           Autotasking favors vectorization over parallel
                           processing.  If dependence analysis allows it,
                           Autotasking vectorizes the innermost loop of a
                           nest of DO loops and runs the outermost loop on
                           multiple processors.  In some cases, Autotasking
                           will process a single vectorizable DO loop in
                           chunks, as if it were a nested pair of loops, with
                           a vector inner loop and a parallel outer loop.

                           Loops do not need to be vectorized for Autotasking
                           to detect that a nest of loops can be run in
                           parallel.  Some codes may have scalar inner loops
                           that can be run in parallel on an outer loop or a
                           set of adjacent scalar loops that are independent
                           and can be executed in parallel.

                           For more information about vectorization that can
                           be acheived with the compiling system, see the
                           CF77 Compiling System, Volume 3:  Vectorization
                           Guide, publication SG-3073.









     SG-3074 5.0               Cray Research, Inc.                          7


     Introduction                   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~





     Speedup expected from
     Autotasking
     1.6
                           Because there are so many contributing factors,
                           the degree of speedup is difficult to predict.
                           The first factor to consider is the program
                           itself.  How much parallelism does it contain?  If
                           you have previously microtasked or macrotasked the
                           program, you probably have a good idea how much
                           parallelism exists.  If the program has never used
                           parallel processing, some of the guidelines in
                           "When to use Autotasking," page 6, may give you an
                           idea of the parallelism you can expect.  From a
                           known or expected amount of parallelism, you can
                           calculate a speedup based on Amdahl's Law.  For
                           example, to get a speedup of 3 on an eight-
                           processor system, the code must be 80300003000000202034arallel;
                           to get a speedup of 7 on an eight-processor
                           machine, the code must be 98300003000000202034arallel.  Amdahl's
                           Law and its effect on autotasked programs are
                           explained in more detail in "Autotasking
                           Performance," page 201.  (The UNICOS amlaw(1)
                           command also provides a summary of Amdahl's Law.)


                           Every code contains some amount of parallelism.
                           Autotasking detects some types of parallelism, but
                           not others.  Parallelism found in a code sequence
                           may not be of sufficient granularity to make the
                           program run faster; therefore, Autotasking may
                           choose to ignore the parallelism in that code
                           sequence.  Vectorization already exploits
                           parallelism in most codes.  Because of these
                           various factors, Autotasking may detect and
                           exploit only part or none of the parallelism that
                           exists in the code.

                           Briefly, Autotasking a large existing application
                           rarely results in a speedup linear with the number
                           of processors in the machine.  A fair amount of
                           user assistance will probably be required to
                           achieve that level of performance.  Smaller codes
                           or those that spend almost all of their execution
                           time in small kernels (matrix multiplication,
                           basic linear algebra, and so on) have a better
                           chance of achieving near-linear speedups without
                           user assistance.

                           Vectorization almost always results in codes
                           running faster.  Autotasking generally results in
                           speedups, but it has a higher risk for slowing
                           down some codes.  As with any optimization, you
                           should apply Autotasking carefully to any code.


     8                         Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                   Introduction
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




























































     SG-3074 5.0               Cray Research, Inc.                          9



                                                     CF77 User Interface  [2]
     ########################################################################







                           To understand the user interface to CF77, you must
                           first understand how the parts of CF77 fit
                           together.  This section explains the CF77
                           compiling system, then describes the user
                           interface, and environment variables that you can
                           use to customize your environment.




     CF77 compiling system
     2.1
                           Autotasking, which is a part of the CF77 compiling
                           system, is made up of three phases:  the
                           dependence analysis phase, FPP; the translation
                           phase, FMP; and the code generation phase, the
                           CFT77 compiler.  Figure 3, page 12, shows how the
                           phases fit together when invoked using valid
                           options of the cf77 command.  Although the phases
                           can be invoked independently, knowing how the
                           phases fit together may help you do a better job
                           of Autotasking your program.




























     SG-3074 5.0               Cray Research, Inc.                         11


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



                         Figure 3.  CF77 compiling system



     FPP
     2.1.1
                           The dependence analysis phase, FPP, parses the
                           original Fortran source program, looks for
                           parallelism within program units, and produces a
                           transformed Fortran source file as output.  Some
                           of the transformations FPP performs are summarized
                           in Figure 4, page 14, and are as follows:

                           * Adds Autotasking directives with private and
                             shared variable lists where parallel execution
                             is possible

                           * Adds CDIR@ IVDEP directives before loops that
                             can be vectorized

                           * Expands external procedures inline if requested
                             and if possible

                           * Restructures loop nests

                           * Replaces certain code patterns with calls to
                             highly optimized, multiprocessed library
                             routines

                           * Generates a run-time threshold test for
                             autotasked loops when the amount of work cannot
                             be computed at compile time

                           * Generates conditional vector or scalar code

                           * Converts IF loops into DO loops, where possible,
                             to enhance vectorization and parallelization

                           * Rewrites over-complicated subscript expressions
                             as linear functions of the loop index

                           * Splits partially vectorizable loops into
                             separate fully vectorizable and nonvectorizable
                             loops

                           * Reorders statements to remove data dependencies









     12                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           FPP recognizes when iterations of DO loops operate
                           on independent elements of arrays and it inserts
                           directives to exploit this independence.  Many
                           codes have parallelism of this type.  FPP also
                           recognizes adjacent blocks of code that can be
                           executed concurrently.  Some parallelism obvious
                           to users may be difficult for FPP to recognize.
                           For example, a CALL statement inside a DO loop
                           prevents FPP from transforming the code to run in
                           parallel, because the effects of the subroutine
                           being called are unknown by FPP.


















































     SG-3074 5.0               Cray Research, Inc.                         13


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                                       FPP

       The dependence analysis phase makes the following transformations:


       * Adds Autotasking directives where parallel execution is possible

       * Adds CDIR@ IVDEP directives before loops that can be vectorized

       * Expands external procedures inline if requested and if possible

       * Restructures loop nests

       * Replaces certain code patterns with calls to highly optimized,
         multiprocessed library routines

       * Generates a run-time threshold test for autotasked loops when the
         amount of work cannot be computed at compile time

       * Generates conditional vector or scalar code

       * Converts IF loops into DO loops, where possible

       * Rewrites over-complicated subscript expressions as linear
         functions of the loop index

       * Splits partially vectorizable loops into separate fully
         vectorizable and nonvectorizable loops

       * Reorders statements to remove data dependencies
     ________________________________________________________________________

                  Figure 4.  FPP - the dependence analysis phase

                           The input to the FPP phase is Fortran source code
                           and the output is (possibly restructured) Fortran
                           source code with Autotasking and compiler
                           directives added to express the parallelism.  This
                           output may be compiled directly by CFT77.  In this
                           case, only the vectorization enhancements are
                           obtained without multitasking enhancements.

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           The CDIR@ IVDEP directives inserted by FPP are
                           reserved for use by only FPP.  If you want the
                           functionality of the CDIR@ IVDEP directive, use
                           the CDIR$ IVDEP form of the directive.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~







     14                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     FMP
     2.1.2
                           The primary function of the translation phase,
                           FMP, is to transform a Fortran source file for
                           multitasking.  Autotasking directives are
                           translated into Autotasking intrinsic functions or
                           library calls, master and slave code, conditional
                           Autotasking threshold tests, and Autotasking
                           initialization code.

                           The output of the translator is Fortran source
                           code with calls to machine-dependent library
                           routines and compiler intrinsic functions embedded
                           in the source code to control parallel execution.
                           FMP maintains the proper scope of variables as it
                           makes these changes.  Special intrinsic statements
                           not used by normal Fortran programs are inserted
                           by the translator.  Directives that FMP recognizes
                           are described in "Concepts and Directives," page
                           51.



     CFT77
     2.1.3
                           The code generation phase is the CFT77 compiler,
                           which takes the output of the translator and
                           produces executable machine code.  The Autotasking
                           intrinsic functions are recognized by CFT77 and
                           cause inline code to be generated.

                           Each of these phases of the Autotasking system
                           contributes to the overall compilation time for
                           Autotasking.  Generally, compilation time is a
                           function of the number of lines of source code
                           processed in each phase.  The transformations
                           produced by FPP usually result in an increase in
                           the number of lines of source code; however, FMP
                           may substantially increase the size of the source
                           file.  This may result in much longer processing
                           time by CFT77 compared to compiling the original
                           Fortran source file.  See "Master and slave
                           tasks," page xx.x 0, for reasons for the increase.

                           Successful completion of these first three phases
                           plus SEGLDR results in the creation of an absolute
                           binary file (a.out) that reflects the contents of
                           the source files and any referenced library
                           routines.  Figure 5, page 17, summarizes the role
                           of FMP and the CFT77 compiler in the CF77
                           compiling system.  See Figure 6, page 18, for an
                           overview of the entire CF77 compiling system
                           process.




     SG-3074 5.0               Cray Research, Inc.                         15


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Compiling system
     advantages
     2.1.4
                           The effort involved in these three phases may seem
                           redundant.  After all, each phase parses Fortran,
                           analyzes the scope of variables, and so on.  It
                           seems simpler to perform these tasks just once,
                           but CRI chose to implement Autotasking this way
                           for the following reasons.  First, separate phases
                           allow the most flexibility to change one phase
                           without affecting other phases.  Second, keeping
                           separate phases had the smallest impact on the
                           existing functions of the CFT77 compiler, which
                           has many functions besides parallel processing.

                           These three phases optionally create source output
                           files and listings that let you see what
                           Autotasking is doing to your program, and they let
                           you feed your insight back into Autotasking in the
                           form of directives.

                           You can look at the generated source output to see
                           how it differs from your original program, and why
                           the dependence analyzer may not have found
                           parallelism you think exists.  You may also want
                           to look at the FPP diagnostic listing (see the
                           following subsections for details on getting a
                           listing).  You can add directives of your own at
                           this point or continue with what the dependence
                           analyzer found.






























     16                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                                  FMP and CFT77




       FMP - The translation phase transforms a Fortran source file for
       multitasking.  Autotasking directives are translated into
       Autotasking intrinsic functions or library calls, master and slave
       code, conditional Autotasking threshold tests, and Autotasking
       initialization code.





       CFT77 - The code generation phase is the CFT77 compiler, which
       takes the output of the translator and produces executable machine
       code.  The Autotasking intrinsic functions are recognized by CFT77
       and cause inline code to be generated.
     ________________________________________________________________________

        Figure 5.  Summary of FMP and CFT77 roles in CF77 compiling system



































     SG-3074 5.0               Cray Research, Inc.                         17


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



                   Figure 6.  Phases of Autotasking using cf77

                           You can also invoke the dependence analysis phase
                           and the translation phase separately.  The output
                           of FPP provides a very detailed view of the way
                           Autotasking affects your code.  However, the
                           output of FMP has calls to library routines,
                           intrinsic statements that refer to machine-
                           specific hardware registers, and other code that
                           generally make the output difficult to read.

                           The following subsection explains the CF77 user
                           interface (the cf77 command) in more detail.  See
                           "Invoking FPP and FMP Directly," page 35 for
                           details of the commands for FPP and FMP.  The
                           UNICOS cf77 man page is included in "UNICOS
                           Command Pages," page xx.x 0.

                           You may also interact with the compiling system at
                           a lower level by using compiler directives in your
                           source code.  See "Concepts and Directives," page
                           51, for more information on the use of compiler
                           directives.




     UNICOS user interface
     2.2
                           Under UNICOS, you have several choices of how to
                           interact with the CF77 compiling system.  As
                           explained previously, you can use the cf77
                           command, which functions similarly to the cc
                           command found on UNICOS systems.  cf77 provides a
                           one-line command to analyze, translate, compile,
                           and load a Fortran program, letting you ignore the
                           details of invoking the compiler and loader.
                           Figure 6, page 18, provides an overview of the
                           CF77 system.

                           The cf77 command also lets you pass options to all
                           phases of the compiling system.  You can also
                           communicate directly with each of the phases of
                           the compiling system, as follows:

                           * You can direct the actions of the dependence
                             analyzer, FPP, by using the fpp command

                           * You can direct the actions of the translation
                             phase, FMP, by using the fmp command



     18                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * You can direct the actions of the CFT77
                             compiler, the code generation phase, by using
                             the cft77 command

                           For a complete list of fpp and fmp options, see
                           "Invoking FPP and FMP Directly," page 35.  For
                           more information about using the cft77 command,
                           see CF77 Compiling System, Volume 1:  Fortran
                           Reference Manual, publication SR-3071.

                           The following subsections describe the use of the
                           cf77 command and some of the most commonly used
                           cf77 options.  A complete description of the  cf77
                           command is provided by the cf77 man page located
                           in "UNICOS Command Pages," page xx.x 0.



     Using the cf77
     command
     2.2.1
                           The cf77 command serves as an "overcompiler,"
                           which invokes the appropriate phases of
                           Autotasking (that is FPP, FMP, the CFT77 compiler,
                           and the loader) to build an executable program
                           based on the defaults and options.  The cf77
                           command is summarized in Figure 7, page 21, and
                           discussed in the following subsections.

                           To use cf77 to run a nonautotasked program, which
                           creates an executable file named a.out, enter the
                           following:

                                cf77 abc.f


                           To autotask the same program, add the -Zp option
                           to the cf77 command line:

                                cf77 -Zp abc.f


                           This shows the simplest way to invoke Autotasking
                           on a program; it invokes the three compiling
                           system phases (FPP, FMP, and CFT77), loads object
                           files, and produces an executable binary file
                           (a.out).

                           A simplified "expanded" version of a cf77 command
                           is as follows.  The "expanded" version does not
                           include all options for the commands, but it shows
                           the order in which things are done.

                                cf77 -Zp file.f

                           Expanded version:




     SG-3074 5.0               Cray Research, Inc.                         19


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                fpp file.f > file.m
                                fmp file.m > file.j
                                cft77  -a stack  file.j
                                rm file.m
                                rm file.j
                                segldr file.j.o
                                rm file.j.o






















































     20                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                              cf77 Command Summary



       cf77 - A one-line command to analyze, translate, compile and load a
       Fortran program.


       FPP, FMP, CFT77, and SEGLDR can also be invoked with separate
       commands.


       A simplified "expanded" version of a cf77 command is as follows:

       cf77 -Zp file.f


       Expanded version:

       fpp file.f > file.m
       fmp file.m > file.j
       cft77  -a stack  file.j
       rm file.m
       rm file.j
       segldr file.j.o
       rm file.j.o

     ________________________________________________________________________

                         Figure 7.  cf77 command summary

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           Autotasked codes must run in STACK allocation
                           mode.  If you are trying to autotask a code that
                           has been running in STATIC allocation mode, first
                           get the program running and debugged in STACK mode
                           before trying Autotasking.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

















     SG-3074 5.0               Cray Research, Inc.                         21


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The cf77 command has options that let you exercise
                           more control over the phases of the Autotasking
                           system.  With the cf77 command-line options, you
                           have a high degree of flexibility; for example,
                           you can do the following:

                           * Bypass different phases of the compiling system

                           * Create only an intermediate source file

                           * Create only an object file

                           * Obtain command information

                           * Pass information directly to the CFT77 compiler,
                             FPP, FMP, and SEGLDR

                           Figure 8, page 23, summarizes these control
                           options; the following subsections explain them
                           (and other, miscellaneous options) and how to use
                           them.


     Compiling system
     control options (-Z)
     2.2.1.1
                           The cf77 command provides the following CF77
                           compiling system control options with the -Z
                           options, as follows:

                           Option   Description
                           ------   -----------

                           -Zp      Selects parallel dependency analysis and
                                    translation.  The full compiling system
                                    (all three phases) is invoked.
                                    Intermediate source files are deleted by
                                    the cf77 command.

                           -Zu      Invokes only FMP, the CFT77 compiler, and
                                    SEGLDR.  Intermediate source files are
                                    deleted by the cf77 command.  For
                                    example, if you had a program that
                                    contained microtasking directives, you
                                    might want to skip the dependency
                                    analysis phase (FPP) of the compiling
                                    system by using cf77 -Zu.













     22                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                          cf77 Command Control Options


       With the cf77 command-line options, you have a high degree of
       flexibility; for example, you can:


       * Bypass different phases of the compiling system



       * Create only an intermediate source file



       * Create only an object file



       * Obtain command information



       * Pass information directly to CFT77, FPP, FMP, and SEGLDR
     ________________________________________________________________________

                  Figure 8.  cf77 command control option summary

                                    Option   Description
                                    ------   -----------

                                    -Zv      Selects only vector enhancements
                                             only and invokes FPP, the CFT77
                                             compiler, and SEGLDR.  FPP
                                             performs only vectorization
                                             enhancements for any input file
                                             of the form file.f; no
                                             Autotasking directives are
                                             inserted.  (Additional input
                                             files of the form file.o are
                                             passed to the loader.  If they
                                             contain Autotasking directives,
                                             they are recognized by the
                                             compiling system.)  Intermediate
                                             source files are deleted by the
                                             cf77 command.

                                    -Zc      Invokes the compiler and loader;
                                             this is the default control
                                             option.

                                    -Zm      Selects microtasking.  Invokes
                                             PREMULT as the translation phase
                                             (rather than FMP) and the
                                             compiler (CFT or CFT2) only.
                                             Intermediate source files are


     SG-3074 5.0               Cray Research, Inc.                         23


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                             deleted by the cf77 command.

                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Note

                                             The premult(1) command, which
                                             invokes PREMULT, will no longer
                                             be supported in the CF77 6.0
                                             release.  The fmp(1) command
                                             provides equivalent
                                             functionality to premult; you
                                             are encouraged to switch from
                                             premult to fmp at your earliest
                                             convenience.

                           In addition, the -Zm option will be supported but
                           will invoke FMP rather than PREMULT.  Because the
                           -Zm option will no longer be supported in a future
                           release, you are encouraged to switch to the -Zu
                           option at your earliest convenience.

                           The -ZP, -ZU, and -ZV options provide the same
                           functionality as -Zp, -Zu, and -Zv, respectively.
                           However, the uppercase versions of these options
                           force the cf77 command to leave all intermediate
                           source files intact.
                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



     Intermediate source
     file options
     2.2.1.2
                           If you want to create intermediate source files
                           and exit the compiling system without creating an
                           object file or an executable binary file, use the
                           following options:

                           Option   Description
                           ------   -----------

                           -M       Runs only FPP and produces an
                                    intermediate source file.  The
                                    intermediate source file is named using
                                    the original source file name suffixed
                                    with .m.  Option
                                    -Zv or -Zp must also be selected.

                           -J       Runs only FPP and FMP and produces an
                                    intermediate source file.  The
                                    intermediate source file is named using
                                    the original source file name suffixed
                                    with .j.  Option -Zu or -Zp must also be
                                    selected; otherwise, the compilation
                                    aborts and returns an error message.




     24                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The output from both of these options (file.m and
                           file.j) can be used as input to the cf77 command.
                           When the file is used as input, the compiling
                           system recognizes the file type and invokes the
                           correct processing phase.

                           Examples:

                                cf77  -Zp  file.m
                                cf77  -Zp  xxx.f  yyy.m  zzz.j  www.o



     Object file creation
     2.2.1.3
                           The -c option is provided to create an object file
                           and then exit the compiling system before invoking
                           SEGLDR.

                           Option   Description
                           ------   -----------

                           -c       Forces object files to be produced.  The
                                    object file is named using the original
                                    source file name suffixed with .o.  If
                                    you specify -c, SEGLDR is not invoked.


     cf77 command
     information
     2.2.1.4
                           The cf77 command executes various commands.  You
                           can obtain a log of the commands issued by cf77 by
                           using one of the following verbose mode options:

                           Option   Description
                           ------   -----------

                           -v       Specifies verbose mode.  Writes output to
                                    stderr (normally your screen) indicating
                                    each phase of the compilation as it
                                    occurs, as well as all options and
                                    arguments being passed to each phase.

                           -T       Disables the entire compiling system but
                                    displays all options currently in effect
                                    for the individual commands corresponding
                                    to each system phase.  This information
                                    is the same as that given by option -v,
                                    but with no processing.  This output is
                                    written to stderr (normally your screen).
                                    In the following example, your input is
                                    shown in typewriter bold.



     SG-3074 5.0               Cray Research, Inc.                         25


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
     f77  -Zp  -T  file.f
     /bin/fpp: fpp file.f > /tmp/jtmp.000526a/ct2BAAa60788
     /bin/fmp: fmp -c /tmp/jtmp.000526a/ct4DAAa60788.s \
	/tmp/jtmp.000526a/ct2BAAa60788
     /tmp/jtmp.000526a/ct3CAAa60788.f
     /bin/cft77: /bin/cft77 -b file.o -a stack \
	/tmp/jtmp.000526a/ct3CAAa60788.f
     /bin/segldr: segldr file.o
     
     ------------------------------------------------------------------------


     Using the -W option
     2.2.1.5
                           The following -W options let you pass arguments to
                           individual components of the CF77 compiling
                           system.

                           Option       Description
                           ------       -----------

                           -Wf"optstring"
                                        Passes options contained in optstring
                                        to the CFT77 compiler

                           -Wd"optstring"
                                        Passes options contained in optstring
                                        to FPP

                           -Wu"optstring"
                                        Passes options contained in optstring
                                        to FMP

                           -Wl"optstring"
                                        Passes options contained in optstring
                                        to SEGLDR

                           You must separate multiple options with spaces.

                           For example, to obtain a load map from SEGLDR, use
                           the following command:

                                cf77 -Zp -Wl"-M,f" prog1.f


                           The following example passes options to FPP using
                           the cf77 -Wd mechanism and illustrates invoking
                           FPP analysis of inner loops for Autotasking:

                                cf77 -Zp -Wd"-ei" prog2.f


                           The following example disables CFT77 double
                           precision and defines integers to consist of 64
                           bits.

                                cf77 -Zp -Wf"-dp -i64" prog3.f



     26                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           You can specify more than one -W option on the
                           same cf77 command line.

                           The following example shows the command line if
                           you want to use atexpert:

                                cf77 -ZP -Wd"-ei -d0" -Wu"-p" prog4.f


                           The previous command line specifies use of the
                           entire compiling system, enables FPP inner-loop
                           analysis and disables threshold test generation,
                           and generates FMP output suitable for use with
                           atexpert.


     Additional cf77
     options
     2.2.1.6
                           The following list shows some additional cf77
                           options.  Although the functionality of each of
                           these options can be duplicated with one or more
                           -W options, it is not recommended.

                           Option    Description
                           ------    -----------

                           -l name   Identifies library files.  If a library
                                     name lib begins with . or /, it is
                                     assumed to be a path name, and SEGLDR
                                     uses it as is.  Otherwise, SEGLDR checks
                                     first for file /lib/libname.a, then for
                                     file /usr/lib/libname.a, and uses the
                                     first one found.  See the -L option.

                           -L libdir Passes the directory name libdir to
                                     SEGLDR as the directory in which to find
                                     default libraries during the load phase.

                           -F        Enables the Flowtrace option for CFT77.

                           -g        Invokes CFT77 debugging options -ez and
                                     -o off, but inhibits Autotasking.  You
                                     can generate debug symbols when doing
                                     Autotasking by specifying cf77 options
                                     -Wf"-ez" or
                                     -Wf"-ez -ooff".  You can enable
                                     Autotasking debugging by using the -G
                                     cf77 option.

                           -G        Invokes FPP, FMP, and the CFT77
                                     debugging option -ez.  When this option
                                     is used with the
                                     -Zp option, Autotasking debugging is
                                     enabled.




     SG-3074 5.0               Cray Research, Inc.                         27


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           For a complete description of the options for
                           cf77, see the cf77 manual page in "UNICOS Command
                           Pages," page xx.x 0.




     UNICOS environment
     variables
     2.3
                           Environment variables are predefined shell
                           variables that determine some of the
                           characteristics of your shell.  These environment
                           variables are taken from the execution
                           environment, and they can affect your parallel
                           processing environment.  Environment variables
                           affecting parallel processing are summarized in
                           Figure 9, page 29, and discussed in the following
                           subsections.






































     28                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                              Environment Variables

       * All systems (general compiling system)

         - NCPUS

         - CAL

         - CFT77

         - FPP

         - FMP

         - SEGLDR

       * CX/CEA systems (multitasking)

         - MP_DEDICATED

         - MP_MAXCPU

         - MP_DBACTIVE

         - MP_DBRELEAS

         - MP_HOLDTIME

         - MP_SAMPLE

         - MP_PRIORITY

         - MP_SLVPRI

         - MP_STACKSZW

         - MP_STACKINW

         - MP_SLVSSZ

         - MP_SLVSIN

       * CRAY-2 systems (multitasking)

         - MICRO_NICE

         - MICRO_TIMEOUT
     ________________________________________________________________________

             Figure 9.  Summary of multitasking environment variables







     SG-3074 5.0               Cray Research, Inc.                         29


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Environment variables
     for all systems
     2.3.1
                           The first variable available on all systems is
                           NCPUS; it specifies the number of tasks available
                           to an autotasked program.  When debugging
                           autotasked code, set NCPUS=1, which allows only
                           the master task to execute (that is, no slave
                           processes are created).  See "Debugging Autotasked
                           Programs," page xx.x 0, for more specific
                           information.

                           Generally, the default value for NCPUS is the
                           number of physical processors in the system.  If
                           you specify NCPUS to be greater than the number of
                           physical processors available, unnecessary
                           overhead will be incurred.

                           Other environment variables that you can use to
                           specify particular versions of software are as
                           follows:  CAL, the file name of as(1), the CAL
                           assembler (default is /bin/as); CFT77, the file
                           name of the CFT77 compiler (default is
                           /bin/cft77); FPP, the file name of the fpp(1)
                           dependence analyzer (default is /bin/fpp); FMP,
                           the file name of the fmp(1) translator (default is
                           /bin/fmp); PREMULT, the file name of premult(1),
                           the microtasking preprocessor (default is
                           /bin/premult); and SEGLDR, the file name of the
                           loader (default is /bin/segldr).

                           Setting any of these variables lets you use
                           versions of the software that are not the default;
                           for example, you could test a new release level of
                           a compiler before it becomes the default.



     Additional
     environment variables
     for CX/CEA systems
     2.3.2
                           Additional environment variables available on
                           CX/CEA systems are described in the following
                           list.  Many of these environment variables control
                           TSKTUNE tuning keywords, which let you tune the
                           system for parallel processing without rebuilding
                           libraries or other system software.  See section 5
                           in the CRAY Y-MP, CRAY X-MP EA, and CRAY X-MP
                           Multitasking Programmer's Manual, publication
                           SR-0222, for more information about TSKTUNE.







     30                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Variable   Description
                           --------   -----------

                           MP_DEDICATED
                                      If set to 1, indicates you are running
                                      a multiprocessed application in a
                                      dedicated machine environment.  Slave
                                      processors wait in user space instead
                                      of returning back to the operating
                                      system.  If MP_DEDICATED is set to 0 or
                                      is not set at all, slave processesors
                                      return to the operating system after
                                      waiting in user space for 50,000 clock
                                      periods.
                                      When MP_DEDICATED is set to anything
                                      other than 1 or 0, the behavior is
                                      undefined.  Setting MP_DEDICATED in a
                                      nondedicated machine environment
                                      degrades personal and system
                                      throughput.

                           MP_MAXCPU  Maximum number of CPUs allowed for
                                      macrotasking; the default is 16.

                           MP_DBACTIVE
                                      Number of additional user tasks that
                                      can be readied for execution before an
                                      additional logical CPU is acquired;
                                      this is called the activation deadband
                                      value.  The value of MP_DBACTIVE can
                                      range from 0 to the largest integer
                                      value (the number of logical CPUs is
                                      equal to the number of user tasks
                                      limited by MAXCPUS).  The initial value
                                      is 0.

                           MP_DBRELEAS
                                      Number of logical CPUs retained by the
                                      job if there are more CPUs than tasks;
                                      this is called the release deadband
                                      value.  Any CPUs in excess of this
                                      number are released to the system.  The
                                      initial value is set to 1 less than the
                                      number of physical CPUs available on
                                      the system or to 1, whichever is
                                      greater.  Setting MP_DBRELEAS to less
                                      than this value may cause an excessive
                                      number of CPUs to be deleted and
                                      acquired, and a correspondingly long
                                      list of CPUs in the log file.  The
                                      value of MP_DBRELEAS can range from 0
                                      (representing immediate return) to the
                                      value of MAXCPUS.

                           MP_HOLDTIME
                                      Number of clock periods (CPs) to hold a
                                      processor before giving up the CPU when
                                      no parallel work is available.  The
                                      default is 50,000 CPs.


     SG-3074 5.0               Cray Research, Inc.                         31


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Variable   Description
                           ________   ___________
                           MP_SAMPLE  Sample rate at which the ready mask is
                                      read when in the hold loop; the default
                                      is 150 CPs.  This means that a process
                                      checks for a ready task every 150 CPs
                                      while it is waiting for parallel work.

                           MP_PRIORITY
                                      Scheduling priority for macrotasks.
                                      Legal values are 0 to 63, 0 being the
                                      lowest priority.  The default is 31.
                                      When the library schedules queued
                                      tasks, higher priority tasks are
                                      scheduled first.

                           MP_SLVPRI  Scheduling priority for slave
                                      microtasks or autotasks.  Legal values
                                      are 0 to 63, 0 being the lowest
                                      priority.  The default is 0.

                           MP_STACKSZW
                                      Initial stack size for macrotasks.

                           MP_STACKINW
                                      Stack increment for macrotasks.

                           MP_SLVSSZ  Initial stack size for microtasking or
                                      Autotasking slaves.

                           MP_SLVSIN  Stack increment for microtasking or
                                      Autotasking slaves.



     Additional
     environment variables
     for CRAY-2 systems
     2.3.3
                           Additional environment variables exist for use on
                           CRAY-2 systems, as follows:

                           Variable   Description
                           --------   -----------

                           MICRO_NICE Integer value used by the nice system
                                      call when the library starts the
                                      Autotasking slaves.  The default value
                                      is 4.  If you want to run the slaves at
                                      normal priority, MICRO_NICE should be
                                      set to 0.








     32                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Variable   Description
                           ________   ___________
                           MICRO_TIMEOUT
                                      Integer value that is the number of
                                      milliseconds that Autotasking slave
                                      tasks wait, looking for work, before
                                      they give up the CPU.  The default
                                      value is 4 ms.  If you are making
                                      dedicated runs, set MICRO_TIMEOUT to a
                                      large value (such as 10,000) to ensure
                                      that the CPUs are always connected to
                                      the job.















































     SG-3074 5.0               Cray Research, Inc.                         33



                                           Invoking FPP and FMP Directly  [3]
     ########################################################################







                           FPP, the dependence analysis phase of the CF77
                           compiling system, parses the original Fortran
                           source program, looks for parallelism within
                           program units, and produces a transformed Fortran
                           source file as output.  The primary function of
                           FMP, the translation phase, is to transform a
                           Fortran source file for multitasking.

                           Both of these phases can be invoked by using the
                           cf77 command, as discussed in section 2.  Both
                           phases can also be invoked directly.  This section
                           describes the commands to invoke FPP and FMP
                           directly.




     UNICOS fpp command
     3.1
                           The UNICOS fpp command invokes the dependence
                           analysis phase of Autotasking.  You can invoke FPP
                           either as a part of the CF77 compiling system, as
                           described in section 2, or separately, by using
                           the fpp command.

                           When fpp is executed as a separate command, it has
                           the following syntax:

     -----------------------------------------------------
     fpp [-C routine1,routine2,...] [-d optoff]
       [-D directive[:sub1,sub2,...]] [-e opton] [-F file]
       [-H directory] [-I routine1,routine2,...]
       [-l listingfile] [-M lines] [-N80] [-o outputfile]
       [-p liston] [-P pagelength] [-q listoff]
       [-Q tempspace] [-r formaton] [-n formatoff]
       [-S file1,file2,...] [-T threshold] [-V] file.f
     -----------------------------------------------------












     SG-3074 5.0               Cray Research, Inc.                         35


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Options to fpp are as follows:

                           Option     Description
                           ------     -----------

                           -C routine1,routine2,...
                                      Lists names of concurrently callable
                                      routines.

                           -d optoff
                           -e opton
                                      Enables (-e) or disables (-d)
                                      optimization option switches specified
                                      in optoff and opton.  The optimization
                                      switches are described in "fpp
                                      optimization switches," page 40, and
                                      also listed in Table 1, page 42.

                           -D directive[:sub1,sub2,...]
                                      Specifies a directive to be applied to
                                      certain routines, or to the whole input
                                      file if no routines are listed.

                           -F file    Specifies a file containing additional
                                      command-line options.  This option is
                                      useful when you have many command-line
                                      options for a program.  You cannot use
                                      tab characters in a -F file, neither
                                      can you use the -F option to specify
                                      the input file for fpp (it must appear
                                      on the command line).  A typical use
                                      for a -F file would be to specify -D
                                      options.  An example of a -F file
                                      follows the description of fpp options.

                           -H directory
                                      Specifies a directory in which to
                                      search for INCLUDE files.  This option
                                      can be repeated up to 10 times, but
                                      only one directory name is allowed per
                                      -H specification.  INCLUDE files are
                                      searched for first in the directory of
                                      the input source file, then in the
                                      directories named on -H options, in the
                                      order in which they were specified.

                           -I routine1,routine2,...
                                      Lists routines to be expanded inline.
                                      This option specifies only the names of
                                      routines to be expanded inline, and not
                                      their source location.  Source location
                                      can be specified by using the -S or -e8
                                      options, the SEARCH directive, or the
                                      default search method.  See "Where to
                                      find code for inline expansion, page
                                      180, for more information.  This option
                                      is different from the cf77 and cft77 -I
                                      options.
                                      You cannot specify file.f in your
                                      current working directory as an


     36                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                      argument to the -I option.

                           -l listingfile
                                      Directs the listing to file
                                      listingfile.  A listing is not produced
                                      unless this option is specified.

                           -M lines   Sets the maximum number of lines of
                                      code allowed for automatic inline
                                      expansion of any one routine; default
                                      is 50.

                           -N80       Specifies an 80-column Fortran input
                                      file rather than the default 72-column
                                      Fortran input file.

                           -o outputfile
                                      Directs the translated source to file
                                      outputfile rather than standard output.
                                      The output file is ready for processing
                                      by fmp(1) or cft77(1), the other
                                      components of the CF77 compiling
                                      system.

                           -p liston
                           -q listoff
                                      Enables (-p) or disables (-q) listing
                                      option switches specified in liston and
                                      listoff.  The listing switches are
                                      described in "fpp listing switches,"
                                      page 45, and listed in Table 2, page
                                      46.

                           -P pagelength
                                      Specifies the number of lines per page,
                                      for page-formatted listings.
                                      pagelength must be 9 or greater; the
                                      default is 66 lines.

                           -Q tempspace
                                      Specifies the size, in words, of space
                                      to be used for FPP-generated temporary
                                      arrays in any one program unit.  The
                                      default is 8191 words.  For more
                                      information, see "FPP Source Output,"
                                      page 191.

                           -r formaton
                           -n formatoff
                                      Enables (-r) or disables (-n)
                                      reformatting (TIDY) option switches
                                      specified in formaton and formatoff.
                                      The reformatting switches are described
                                      in "FPP TIDY Subprocessor," page xx.x
                                      0.








     SG-3074 5.0               Cray Research, Inc.                         37


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Option     Description
                           ______     ___________
                           -S file1,file2,...
                                      Specifies file names or complete path
                                      names of files in which routines to be
                                      expanded inline are located.  For
                                      example, any of the following
                                      specifications is acceptable:

                                        -S file.f
                                        -S /usr/fred/file.f
                                        -S /usr/fred/abc.f,xyz.f

                                      This option specifies only the source
                                      location of routines to be expanded
                                      inline.  To enable inlining, you must
                                      also specify the -I, -e6, or
                                      -e7 options, or use the AUTOEXPAND,
                                      EXPAND, or NEXPAND directives.

                           -T threshold
                                      Specifies the maximum Autotasking
                                      threshold value for comparison to the
                                      loop iteration count.  See "Threshold
                                      tests," page 149, for more information
                                      on default values and threshold
                                      testing.

                           -V         Displays current FPP version
                                      information to standard error (stderr)
                                      during execution.  If you specify the
                                      -V option on the fpp command line and
                                      do not specify an input file (for
                                      example, fpp -V), only version
                                      information is displayed.

                           By default, the translated Fortran source output
                           file is written to the standard output file
                           (usually the terminal); a listing file is not
                           produced.  If you invoke fpp without arguments, it
                           prints a short usage summary.



     fpp command examples
     3.1.1
                           This subsection contains example fpp command lines
                           and explanations for those command lines.

                           Example 1:

                           To run the Fortran source file crunch.f through
                           fpp, enter the following:

                           -------------------------------------------------
                           $ fpp crunch.f > crunch.m
                           -------------------------------------------------



     38                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The optimized output is sent to crunch.m.

                           Example 2:

                           To run crunch.f through fpp, with inline expansion
                           enabled, and to save the output in crunch.m, run
                           the output (crunch.m) through fmp, and save the
                           output in file crunch.j.  Then, compile the
                           translated code and enter the following:

                           -------------------------------------------------
                           $ fpp -e 78 crunch.f > crunch.m
                           $ fmp crunch.m crunch.j
                           $ cft77 -a stack crunch.j
                           -------------------------------------------------

                           The output of the last command is crunch.o, which
                           can then be loaded.

                           Example 2:

                           The following is an example of a typical -F file
                           (fppopts):

                                -Dnoinner:sub1
                                -Dnexpand(sub2):sub1#/usr/psr
                                -Ffile2.com           (Nested command file)
                                -D relation(n.gt.32):sub2
                                -Dswitch,tdyoff=p,indal=5,renumb=1000:100


                           To run the source file prog.f through fpp using
                           the options in file fppopts and producing an fpp
                           listing (prog.l), enter the following:

                           -------------------------------------------------
                           $ fpp -F fppopts -l prog.l prog.f > prog.m
                           -------------------------------------------------

                           The output is sent to file prog.m.

                           Example 4:

                           A source file references INCLUDE files contained
                           in three directories:  the directory of the source
                           file, /usr/joe/inc, and /usr/jane/misc.  The fpp
                           command line used to specify the INCLUDE files
                           that are not contained in the same directory as
                           the source file could be as follows:

     ------------------------------------------------------------------------
     $ fpp -H /usr/joe/inc -H /usr/jane/misc -o myprog.m myprog.f
     ------------------------------------------------------------------------

                           File myprog.f is the main input file; file
                           myprog.m will be the output file.





     SG-3074 5.0               Cray Research, Inc.                         39


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example 5:

                           The -S option can be used with either automatic or
                           explicit inlining.  File routines.f contains
                           subroutines that you want considered for automatic
                           inlining.  To enable automatic inlining, and to
                           specify the location of the files to be inlined,
                           enter the following:

                           -------------------------------------------------
                           $ fpp -e7 -S routines.f program.f > program.m
                           -------------------------------------------------

                           File program.f is the main input source file.
                           Specifying the
                           -e7 option enables automatic inlining, the -S
                           option tells FPP where to look for the files to be
                           inlined, and the output goes to file program.m.
                           If you have more than one file containing routines
                           to be inlined, they can be specified with the -S
                           option, separated by commas, as shown in the
                           following example command line:

     ------------------------------------------------------------------------
     $ fpp -S r1.f,r2.f -I solvx,solvy,solvj -o comp.m comp.f
     ------------------------------------------------------------------------

                           In this case, you invoked explicit inlining (-I)
                           for any calls to routines solvx, solvy, and solvj
                           that occur in input file comp.f.  The -S option
                           tells FPP to look for the source for solvx, solvy,
                           and solvj in files r1.f and r2.f.



     fpp optimization
     switches
     3.1.2
                           Switches, also called option-arguments, let you
                           control optimization of FPP.  These switches are
                           called optimization switches.

                           You can pass optimization switches in any of the
                           following ways:

                           * Using the -d (disable) and -e (enable) options
                             of fpp

                           * Using the -Wd"-d" and -Wd"-e" options of cf77

                           * Using the SWITCH directive

                           For more information about the use of the SWITCH
                           directive, see "Concepts and Directives," page 51.





     40                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Table 1, page 42, shows the optimization switches
                           that affect the transformation of the input
                           program.  For example, specifying fpp option -d el
                           means that EQUIVALENCE statements are not examined
                           for data dependency analysis, and IF loops are not
                           converted to DO loops.

                           Some switches duplicate or overlap the functions
                           of directives.  For example, the -d d switch is
                           equivalent to the NODEPCHK directive with file
                           scope (CFPP$ NODEPCHK F).

                           Switches that correspond to directives (a, c, d,
                           e, i, r, u, and v) may be toggled more than once
                           within a routine (using the SWITCH directive).
                           Switches that do not correspond to directives (b,
                           f, h, j, k, l, m, o, p, s, t, y, 0, 1, 4, 5, 6,
                           and 7) can have only one valid setting for any one
                           routine; if they are set more than once within a
                           routine, only the last setting is used.

                           The q, x, and 8 switches are valid only as
                           command-line option-arguments.






































     SG-3074 5.0               Cray Research, Inc.                         41


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        Table 1.  Optimization switches enabled and disabled by -e and -d

     ________________________________________________________________________

     Switch        Description                                       Default
     ________________________________________________________________________

       a           Allows associative transformations.  -d a	     ON
                   is equivalent to the NOASSOC directive
                   with file scope.

       b           Generates linear recursion library calls.	     OFF

       c           Autotasks loops; all loops (inner and	     ON
                   outer) that have enough work to justify
                   concurrent execution are analyzed for
                   Autotasking.  -d c is equivalent to the
                   NOCONCUR directive with file scope.

       d           Does not ignore potential data                    ON
                   dependencies.  -d d is equivalent to the
                   NODEPCHK directive with file scope.

       e           Examines EQUIVALENCE statements for data          ON
                   dependency.  -d e is equivalent to the
                   NOEQVCHK directive with file scope.

       f           Generates BTRNSFRM and ETRNSFRM markers           OFF
                   for debugging.

       h           Allows parallel case optimization.                ON
                   Ignored if the c switch (autotask) is off.

       i           Analyzes inner loops with variable                OFF
                   iteration counts at compile time to
                   determine whether they are candidates for
                   Autotasking.  By default, outer loops and
                   inner loops that obviously have enough
                   work are autotasked.  For inner loops with
                   high iteration counts and many statements,
                   enabling this option may improve
                   performance.  -e i is equivalent to the
                   INNER directive with file scope.  If the c
                   switch (autotask) is off, the i switch is
                   ignored.

       j           Translates nested loop idioms, such as            ON
                   matrix multiplication, matrix-vector
                   multiplication, and rank one update, to
                   library calls.

       k           Treats D in column 1 as a comment                 ON
                   character.  If this switch is off, a D in
                   column 1 is treated as a blank.  This
                   switch provides compatibility with a
                   debugging feature of some compilers.

       l           Transforms IF loops to DO loops.                  ON

       m           Generates alternative code for potential          ON
                   dependencies.  If this switch is off,
                   loops containing potential data
                   dependencies will not be optimized.
     ________________________________________________________________________



     42                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Table 1.  Optimization switches enabled and disabled by -e and -d
               (continued)


     ________________________________________________________________________
     Switch        Description                                       Default
     ________________________________________________________________________

       o           Specifies a minimum DO loop trip count of         OFF
                   1.  Provides compatibility with ANSI '66
                   Fortran compilers.

       p           Collapses loop nests into single loops            ON
                   when possible.

       q           Takes error exit if syntax or fatal errors        OFF
                   are found.  If this switch is on and fpp
                   detects a syntax or fatal error, it
                   returns an error code of 2.  If fpp was
                   invoked by cf77, cf77 would cease
                   processing at this point.

       r           Splits user subroutines and functions out         OFF
                   of loops nests where possible.  This
                   sometimes results in additional loops
                   being autotasked.  -e r is equivalent to
                   the SPLIT directive with file scope.

       s           Permits loop splitting to isolate                 ON
                   recursion, which permits partial
                   vectorization of loops.

       t           Specifies use of aggressive loop exchange         OFF
                   criteria.  Weights desirability of stride
                   one vectors and increased vector length
                   more heavily compared to retaining
                   original loop nest ordering.

       u           Generates final values for transformed            ON
                   scalars when appropriate.  -d u is
                   equivalent to the NOLSTVAL directive with
                   file scope.

       v           Enhances CFT77 vectorization.  -d v is            ON
                   equivalent to the NOVECTOR directive with
                   file scope.  If this switch is off, the b,
                   m, p, r, and s switches are not
                   meaningful.

       x           Creates optimized source file.  This              ON
                   switch may be turned off if only the
                   diagnostic listing is desired.  Turning
                   this switch off may speed compile time and
                   reduce disk space used.  The setting of
                   this switch does not affect the listing of
                   the transformed source in the listing
                   file.  This switch is valid only as a
                   command-line argument; it may not be
                   specified with the SWITCH directive.

       y           Reformats only restructured loops.  -d y          ON
                   causes the entire program unit to be
                   reformatted with the TIDY subprocessor.
     ________________________________________________________________________


     SG-3074 5.0               Cray Research, Inc.                         43


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Table 1.  Optimization switches enabled and disabled by -e and -d
               (continued)

     ________________________________________________________________________
     Switch        Description                                       Default
     ________________________________________________________________________

       0           Generates Autotasking threshold tests for         ON
                   comparison to loop iteration counts.

       1           Converts array syntax to DO loops.                ON

       4           Asserts that first values of private              OFF
                   arrays are not needed.  In some cases, it
                   allows more loops to be autotasked.

       5	   Generates output for CRAY-2 systems.  If          ON
                   this option is enabled, strides are		     (CRAY-2
                   heavily weighted as a negative factor in	     systems)
                   choosing vector loops and CRAY-2 systems	     OFF
                   threshold tests are generated.  See		     (all
                   "Threshold tests," page 149, for more	     other
                   information on threshold testing.	             systems)

       6           Automatically expands called routines             OFF
                   inline (always safe).  The subroutines and
                   functions must meet certain criteria.
                   Always produces safe code.
                   This option must be used in conjunction
                   with the -S or -e8 options to enable
                   inlining; used alone, it does not provide
                   information about the location of routines
                   to be inlined.  Information about routine
                   expansion or about why routines were not
                   expanded is sent to the file specified
                   with the -l option.

       7           Automatically expands called routines             OFF
                   inline (rarely unsafe).  The subroutines
                   and functions must meet certain criteria.
                   Usually exploits many more opportunities
                   for inline expansion than the 6 switch,
                   but in rare cases, creates incorrect code
                   because of adjustable array dimensioning
                   problems.  Using the -e 7 option is
                   equivalent to the AUTOEXPAND directive
                   with file scope.
                   This option must be used in conjunction
                   with the -S or -e8 options to enable
                   inlining; used alone, it does not provide
                   information about the location of routines
                   to be inlined.  Information about routine
                   expansion or about why routines were not
                   expanded is sent to the file specified
                   with the -l option.



     44                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

       8           Searches input file for expandable                OFF
                   routines.  This switch is valid only as a
                   command-line option; it may not be
                   specified with the SWITCH directive.
                   The only purpose of this switch is to set
                   the search path used by FPP when searching
                   for routines to be expanded inline.  To
                   enable inlining, you must also specify the -I,
                   -e 6, or -e 7 options; or you must insert
                   AUTOEXPAND, EXPAND, or NEXPAND directives
                   in your source code.
     ________________________________________________________________________




     fpp listing switches
     3.1.3
                           Switches (or option-arguments) also let you
                           control the contents of the listing file for FPP.
                           These switches are called listing switches.

                           You can pass listing switches in any of the
                           following ways:

                           * Using the -p (enable) and -q (disable) options
                             of fpp

                           * Using the -Wd"-p" and -Wd"-q" options of cf77

                           * Using the SWITCH directive

                           For more information about the use of the SWITCH
                           directive, see "Concepts and Directives," page 51.

                           Table 2, page 46, shows the switches that control
                           the format of the listing file.  For example, if
                           you want to get a 132-column printer listing
                           without warning messages and no event summary,
                           specify the -q twe option on the fpp command line.

                           The TIDY subprocessor is another feature of FPP
                           that improves the readability of the output code,
                           either by using default standards, or according to
                           user-specified parameters.  By default, TIDY is
                           applied only to loops that require restructuring
                           in order to be vectorized and to be run in
                           parallel.  To apply TIDY to the entire program
                           unit, use the -dy option of fpp or the -Wd"-dy"
                           option of cf77.  See "FPP TIDY Subprocessor," page
                           xx.x 0, for more information on TIDY switches.






     SG-3074 5.0               Cray Research, Inc.                         45


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

           Table 2.  Listing switches enabled and disabled by -p and -q

     ________________________________________________________________________
     Switch        Description                                       Default
     ________________________________________________________________________

       b           Lists corresponding input line numbers in         ON
                   columns 73 through 80 of output listing of
                   the transformed source.  This switch is
                   valid only if the n switch is on.  This
                   listing feature is useful in relating
                   transformed source lines to original
                   source lines.

       c           Lists data dependency conflict messages.          ON

       d           Lists declarations added by FPP.  This            OFF
                   switch is valid only if the n switch is
                   on.

       e           Lists event summary at end of routine.            ON

       f           Lists fatal error messages.                       ON

       g           Lists translation diagnostics.                    ON

       h           Lists input source lines.                         ON

       i           Lists lines that come from INCLUDE files.         ON
                   When this switch is on, source lines
                   obtained from INCLUDE files are listed.
                   They are identified by a dash following
                   the line number.  This switch is valid
                   only if the h switch (list source lines) is on.

       l           Produces a listing.  -q l is equivalent to        ON
                   the NOLIST directive.

       n           Lists translated code.                            ON

       p           Lists loop summary at end of routine.             ON

       s           Lists only summary information.  If the s         OFF
                   switch is used, the c, d, e, f, g, h, i,
                   l, n, p, t, w, and y listing switches are
                   ignored.

       t           Formats FPP listing for a terminal (format        ON
                   output for 80 columns).  -q t results in a
                   wide-format listing file, with printer
                   control, pagination, and page headers,
                   suitable for a 132-column line printer.

       u           Shows extent and disposition of loops in          ON
                   source code.

       w           Lists warning messages.                           ON

       y           Lists syntax errors.                              ON
     ________________________________________________________________________





     46                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     UNICOS fmp command
     3.2
                           The UNICOS fmp command invokes the translation
                           phase of Autotasking to translate Autotasking and
                           microtasking directives.  You can invoke FMP
                           either as a part of the CF77 compiling system, as
                           described in "CF77 User Interface," page 11, or
                           separately, by using the fmp command.

                           When fmp is executed as a separate command, it has
                           the following syntax:

     ------------------------------------------------------------------------
     fmp [-c file] [-d optoff] [-e opton] [-f] [-g gvalue]
       [-i] [-I directory] [-l] [-N80] [-p] [-s file]
       [-S] [-V] [input_file] [output_file]
     ------------------------------------------------------------------------

                           Options to fmp are as follows:

                           Option     Description
                           ------     -----------

                           -c file    Specifies file for CAL source stub
                                      program; used for microtasking
                                      routines.  If you do not specify file,
                                      it defaults to file multc.s.

                           -d optoff
                           -e opton
                                      Enables (-e) or disables (-d)
                                      optimization switches specified in
                                      optoff and opton, as follows:

                                      f     Generates BTRNSFRM and ETRNSFRM
                                            markers for debugging.
                                            The default is -df, which means
                                            that no debugging markers are
                                            generated.

                           -f         Selects CFT or CFT2 (depending on
                                      machine type), rather than CFT77, for
                                      Fortran output syntax.  The default is
                                      CFT77 output syntax.

                           -g gvalue  Specifies value to be used by the
                                      guided and vector scheduling
                                      algorithms.  The gvalue determines the


     SG-3074 5.0               Cray Research, Inc.                         47


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                      number of iterations to be assigned to
                                      a processor each time it is
                                      redispatched, based on the number of
                                      processors available.  A gvalue of 0
                                      causes code to be generated that reads
                                      the current number of processors
                                      acquired by the program at run time.
                                      The default gvalue is the number of
                                      physical processors on the system.

                           -i         Disables generation of code that uses
                                      the CFT77 Autotasking inline intrinsic
                                      functions.  External calls are
                                      generated in place of the intrinsic
                                      function calls.  This is the default
                                      for CRAY-2 systems; the CFT77
                                      Autotasking inline intrinsic functions
                                      cannot be enabled for CRAY-2 systems.

                           -I directory
                                      Specifies a directory that contains
                                      INCLUDE files for FMP to expand.  The
                                      fmp command searches first in the
                                      directory of its input file and then in
                                      directories specified by the -I option.
                                      Multiple directories may be specified
                                      on the command line by using a -I
                                      option for each directory.  If input to
                                      FMP is from stdin, FMP looks in the
                                      current working directory for the
                                      INCLUDE file.

                                      If no INCLUDE file is found in the
                                      specified directory, the fmp command
                                      generates an error message and exits.

                           -l         Replaces the last character of an 8-
                                      character name with an S or M when
                                      generating names for microtasked
                                      routines.  By default, the preprocessor
                                      creates two subroutines for each
                                      microtasked routine and appends as much
                                      of MULT or SNGL as it can, within the
                                      8-character name limit.  (This means
                                      that by default, 8-character routine
                                      names produce an FMP abort because no
                                      characters from MULT or SNGL can be
                                      added and still stay within the 8-
                                      character limit.)  With this option,
                                      however, a routine named FUNCTION would
                                      become FUNCTIOM and FUNCTIOS.  The fmp
                                      command aborts if it finds an 8-
                                      character subroutine name for a
                                      microtasked routine that already ends
                                      in M or S.

                           -N80       Specifies an 80-column Fortran input
                                      file rather than the default 72-column
                                      Fortran input file.


     48                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Option     Description
                           ______     ___________
                           -p         Generates FMP output suitable for use
                                      by atexpert, the Autotasking Expert
                                      System.

                           -s file    Specifies file name where the
                                      uniprocessor versions of the code are
                                      to be placed.  This includes routines
                                      that do not contain microtasking or
                                      Autotasking directives, uniprocessor
                                      versions of microtasked routines, and
                                      the modified version of routines
                                      containing parallel regions and DOALLs.
                                      The default is to direct output to
                                      output_file.  If output_file is not
                                      specified, the default is to direct
                                      output to stdout.

                           -S         Prints symbol table to stdout.

                           -V         Displays current FMP version
                                      information to standard error (stderr)
                                      during execution.  If you specify the
                                      -V option on the fmp command line, you
                                      must specify an input file, as follows:

                                        fmp -V file.m file.j

                           input_file Specifies optional file for input;
                                      default is stdin.

                           output_file
                                      Specifies optional file for output;
                                      default is stdout.

                           The FMP translator can interpret both Autotasking
                           and microtasking directives in Fortran code.
                           Subroutines in a program may contain Autotasking
                           and microtasking directives, subject to certain
                           restrictions.  See "Using microtasking and
                           macrotasking with Autotasking," page 5, for these
                           restrictions.  For more information about
                           directives, see "Concepts and Directives," page
                           51.














     SG-3074 5.0               Cray Research, Inc.                         49



                                                 Concepts and Directives  [4]
     ########################################################################







                           To use the remaining portions of this guide, you
                           must understand some fundamental multitasking
                           concepts and be familiar with the terms that
                           describe these concepts.  This section introduces
                           basic multitasking and Autotasking concepts, gives
                           definitions of these concepts, discusses the
                           levels of intervention you, as a programmer, have
                           with Autotasking, and describes directives that
                           can be used with Autotasking.




     Concepts
     4.1
                           The following definitions describe basic
                           multitasking and Autotasking concepts that are
                           useful when dealing with any of the CRI parallel
                           processing software products.  These terms are
                           also summarized in Figure 10, page 53, and Figure
                           11, page 54.

                           Term          Definition
                           ----          ----------

                           Multitasking  One program makes use of multiple
                                         processors to execute portions of
                                         the program simultaneously.  Because
                                         multiple processes or tasks execute
                                         at the same time, the execution of
                                         the processes is not synchronous.
                                         There is no guarantee that
                                         concurrent processes will execute in
                                         any given order or sequence unless
                                         the program contains implicit
                                         synchronizations or the programmer
                                         has inserted explicit
                                         synchronization mechanisms.  (Also
                                         called parallel processing.)

                           Autotasking   Automatic distribution of loop
                                         iterations to multiple processors
                                         (or tasks) using the CF77 compiling
                                         system.  Autotasking can exploit
                                         parallelism at the DO loop level; it
                                         can be fully automatic, but you also
                                         can interact with the CF77 compiling
                                         system on several levels.  See
                                         "Levels of user intervention with


     SG-3074 5.0               Cray Research, Inc.                         51


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                         Autotasking," page 56, for more
                                         information.



























































     52                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                            Multitasking Terminology


       Multitasking or Parallel processing
                                   One program makes use of multiple
                                   processors to execute portions of the
                                   program simultaneously.


       Autotasking                 Automatic distribution of loop
                                   iterations to multiple processors (or
                                   tasks) using the CF77 compiling system.


       Parallel region             Section of code that is executed by
                                   multiple processors.  All code within a
                                   parallel region can be classified as
                                   partitioned or redundant.


       Single-threaded code        Section of code that is executed by
                                   only one processor at a time.


       Serial code                 Section of code that is executed by
                                   only one processor.


       Partitioned code            Code within a parallel region in which
                                   multiple processors share the work that
                                   needs to be done.  Each processor does
                                   a different portion of the work.

     ________________________________________________________________________

                       Figure 10.  Multitasking terminology






















     SG-3074 5.0               Cray Research, Inc.                         53


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                            Multitasking Terminology
                                   (continued)


       Redundant code              Code in a parallel region in which
                                   processors duplicate the work that
                                   needs to be available to all
                                   processors.


       Data dependency             When a computation in one iteration of
                                   a loop requires a value computed in
                                   another iteration of the loop.


       Synchronization             Process of coordinating the steps
                                   within concurrent/parallel regions.


       Master task                 Task that executes all of the serial
                                   code, initiates parallel processing,
                                   and waits until parallel processing is
                                   finished before leaving the Autotasking
                                   region.


       Slave task                  Task initiated by the master task.


       Directives                  Special lines of code beginning with
                                   CDIR$, CDIR@, CMIC$, CMIC@, or CFPP$
                                   that give the compiling system
                                   information about a program.
     ________________________________________________________________________

                 Figure 11.  Multitasking terminology (continued)

                           Term          Definition
                           ----          ----------

                           Parallel region
                                         Section of code that is executed by
                                         multiple processors.  All code
                                         within a parallel region can be
                                         classified as partitioned or
                                         redundant.

                           Single-threaded code
                                         Section of code that is executed by
                                         only one processor at a time.
                                         Another processor may enter this
                                         section of code as soon as the
                                         current processor is finished
                                         executing the code.

                           Serial code   Section of code that is executed by


     54                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                         only one processor.

                           Partitioned code
                                         Code within a parallel region in
                                         which multiple processors share the
                                         work that needs to be done.  Each
                                         processor does a different portion
                                         of the work.  (Also called control
                                         structure.)

                           Redundant code
                                         Code within a parallel region in
                                         which multiple processors can
                                         execute the same code and make the
                                         results available to all processors.

                           Data dependency
                                         When a computation in one iteration
                                         of a loop requires a value that was
                                         computed in another iteration of the
                                         loop.

                           Synchronization
                                         The process of coordinating the
                                         steps within concurrent/parallel
                                         regions.

                           Master task   The task that executes all of the
                                         serial code, initiates parallel
                                         processing when an Autotasking
                                         region is entered, performs  all or
                                         part (or none) of the work in the
                                         Autotasking region, and waits until
                                         parallel processing is finished
                                         before leaving the Autotasking
                                         region.  The code executed by the
                                         master task is in the original
                                         calling routine of a program that is
                                         being autotasked and contains the
                                         initialization and termination code
                                         for parallel execution.

                           Slave task    The task initiated by the master
                                         task that contains the parallel
                                         region code to be executed by slave
                                         processors.

                           Directives    Special lines of code beginning with
                                         CDIR$, CDIR@, CMIC$, CMIC@, or CFPP$
                                         that give the compiling system
                                         information about a program.  FPP
                                         automatically inserts CMIC@ and
                                         CDIR@ directives; you can manually
                                         add CDIR$, CMIC$, or CFPP$
                                         directives to the program.
                                         Directives CDIR$ and CDIR@ differ in
                                         that when you disable interpretation
                                         of directives by CFT77, CDIR@
                                         directives are still interpreted by


     SG-3074 5.0               Cray Research, Inc.                         55


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                         the compiler to optimize your code.




     Levels of user
     intervention with
     Autotasking
     4.2
                           As explained previously, Autotasking represents
                           the third phase in the development of CRI
                           multitasking software.  You have several options
                           for interaction with Autotasking:

                           * No intervention (use it as an automatic system).

                           * Insert Autotasking directives to identify
                             parallelism not detected automatically.

                           * Process previously microtasked code through the
                             Autotasking system to detect and exploit
                             parallelism in nonmicrotasked routines.

                           * Process previously autotasked code to exploit
                             more parallelism.  FPP can recognize parallelism
                             outside a loop nest; that is, FPP examines all
                             code except a loop nest that already contains an
                             Autotasking directive.  (Even subroutines
                             containing Autotasking directives are analyzed.)

                           * Insert Autotasking directives yourself,
                             bypassing FPP.

                           Autotasking can be used as a fully automatic
                           system.  You can simply use an error-free Fortran
                           program as input to the Autotasking system and let
                           the software automatically detect and exploit
                           parallelism.  For many programs, using this
                           "automatic mode" gives a substantial speedup.

                           These levels of intervention are summarized in
                           Figure 12, page 57.

















     56                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                     Levels of Intervention with Autotasking



       * No intervention (use it as an automatic system).



       * Insert Autotasking directives to identify parallelism not
         detected automatically.



       * Process previously microtasked code through the Autotasking
         system to detect and exploit parallelism in nonmicrotasked
         routines.



       * Process previously autotasked code to exploit more parallelism.
         FPP can recognize parallelism outside a loop nest; that is, FPP
         examines all code except a loop nest that already contains an
         Autotasking directive.  (Even subroutines containing Autotasking
         directives are analyzed.)



       * Insert Autotasking directives yourself, bypassing FPP.
     ________________________________________________________________________

               Figure 12.  Levels of intervention with Autotasking

                           However, you may sometimes know information about
                           the structure of a program and about its data that
                           is unavailable to FPP through inspection of
                           individual program units.  For this reason,
                           directives supply a way for you to guide FPP.  All
                           directives are treated as comments by Fortran
                           compilers, thus preserving code transportability.
                           The following subsections explain directives that
                           you can use to pass information to the CF77
                           compiling system.




     Directives
     4.3
                           You can pass information to all phases of the
                           compiling system by inserting directives into your
                           source code.  FPP, FMP, and CFT77 each have their
                           own set of directives.  For example, you can
                           insert directives to instruct FPP where to perform
                           or not perform parallel and vector dependency


     SG-3074 5.0               Cray Research, Inc.                         57


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           analysis.  The following types of directives are
                           available for use with the Autotasking system:

                           * FPP directives

                           * FMP directives

                           * Compiler directives

                           These directives are summarized in Figure 13, page
                           59.

                           FPP directives are special lines of code beginning
                           with CFPP$ that are interpreted by FPP to provide
                           more information about the program.  You can
                           manually add them to the program.  (These
                           directives are also called user directives.)

                           FMP directives are special lines of code beginning
                           with CMIC$  or CMIC@ that are interpreted by FMP
                           to give it more information about the program.
                           FPP automatically inserts CMIC@ directives; you
                           can also manually add CMIC$ directives to the
                           program.  (These directives are also called
                           microtasking directives.)

                           Compiler directives are special lines of code
                           beginning with CDIR$ and CDIR@ that are
                           interpreted by the CFT77 compiler as information
                           about the program.  FPP automatically inserts
                           CDIR@ directives; you can also manually add CDIR$
                           directives to the program.  FPP also interprets
                           certain compiler directives associated with vector
                           processing; see "Compiler directives," page 99,
                           for more information.


























     58                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                                   Directives


       FPP Directives


       * Special lines of code beginning with CFPP$

       * Interpreted by FPP

       * Can manually add to the program

       * Also called user directives

         FMP Directives

       * Special lines of code beginning with CMIC$

       * Interpreted by FMP

       * FPP automatically inserts

       * Can also manually add them to the program

       * Also called microtasking directives

         Compiler directives


       * Special lines of code beginning with CDIR$ and CDIR@

       * Interpreted by the compiler

       * FPP automatically inserts CDIR@ directives

       * Can manually add CDIR$ directives to the program
     ________________________________________________________________________

                      Figure 13.  Directive summary by type



     FPP directives
     4.3.1
                           FPP or user directives have the following syntax:

                           -------------------------------------------------
                           CFPP$ directive scope
                           -------------------------------------------------

                           The C in column 1 makes the directive a comment
                           for all other Fortran compilers.  The FPP$ flags
                           this line as a directive to FPP.  Following the
                           directive is an optional scope parameter, scope.
                           Table 3 shows allowable scope values.


     SG-3074 5.0               Cray Research, Inc.                         59


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Table 3.  Allowable scope parameters for CFPP$
                                     directives

                           _______________________________________________
                           Value     Meaning       Description
                           _______________________________________________
                             R       Routine	   Directive applies
                                                   until the end of the
                                                   current routine.

                             L       Loop	   Default scope.
                                                   Directive applies to
                                                   the next loop, and
                                                   only the next loop;
                                                   that is, any inner and
                                                   outer loops are
                                                   considered
                                                   independently.

                             F      File	   Directive applies
                                                   until the end of the
                                                   input file.

                             I      Immediate	   Directive applies
                                                   immediately at that
                                                   point in the source
                                                   code.

                                    Blank	   Same as L; directive
                                                   applies to the next
                                                   loop encountered.
                           _______________________________________________


                           Some directives ignore the scope parameter.
                           Directives affecting IF loops must have R or F
                           scope; directives with L scope apply only to DO
                           loops.  The body of the directive begins after one
                           or more blanks.  Many directives can be preceded
                           by NO, thus effecting the reverse operation.

                           The following example tells FPP to ignore
                           potential data dependencies in the next loop:

                                CFPP$ NODEPCHK


                           The following example disables the vectorization
                           enhancement for the rest of this routine.

                                CFPP$ NOVECTOR R


                           The following example enables the listing for the
                           rest of the input file.

                                CFPP$ LIST F





     60                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The full set of directives is summarized in Table
                           4.  The scope entry is either L, indicating that
                           it applies to the next loop; R, indicating that it
                           applies to the whole routine; I, indicating that
                           it applies immediately; or LRF, which indicates
                           that any of the loop, routine, or file options can
                           be used to control the scope.  If the scope is not
                           specified for these directives, the default is L,
                           or loop.  A short description of each of these
                           directives follows the table, grouped by
                           functionality.


                            Table 4.  CFPP$ directives

     _______________________________________________________________________
     Directive      Function                                 Default   Scope
     _______________________________________________________________________

     NOVECTOR/	   Disables/enables vectorization               VECTOR   LRF
     VECTOR	   enhancement 

     NOCONCUR/	   Disables/enables Autotasking                 CONCUR   LRF
     CONCUR

     SKIP          Disables Autotasking and vectorization         None   LRF

     INNER/	   Enables/disables Autotasking for            NOINNER   LRF
     NOINNER	   inner loops

     CNCALL        Allows concurrent calls in loop                None   LRF

     NOALTCODE/    Disables/enables generation of alternate    ALTCODE   LRF
     ALTCODE	   code blocks

     NOASSOC/	   Disables/enables all associative              ASSOC   LRF
     ASSOC	   transformations

     SPLIT/	   Disables/enables cutting out subroutine     NOSPLIT   LRF
     NOSPLIT	   and function calls from loop

     SELECT        Selects which loop in a nest of loops          None   L
     		   to optimize

     NOLSTVAL/ 	    Disables/enables saving last values of      LSTVAL   LRF
     LSTVAL	    transformed scalars

     UNROLL	    Enables/disables automatic or explicit    NOUNROLL   LRF
     NOUNROLL	    loop unrolling

     NODEPCHK/	    Disables/enables data dependency check      DEPCHK   LRF
     DEPCHK

     NOSYNC/	    Enables/disables analysis of potential        SYNC   LRF
     SYNC	    overlag of array sections

     _______________________________________________________________________


     SG-3074 5.0               Cray Research, Inc.                         61


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                     Table 4.  CFPP$ directives  (continued)
     _______________________________________________________________________
     Directive      Function                                 Default   Scope
     _______________________________________________________________________

     NOEQVCHK/	    Disables/enables checking of EQUIVALENCE    EQVCHK   LRF
     EQVCHK	    statements to see whether they cause data 
		    dependencies

     PERMUTATION    Declares that listed integer arrays,          None   R
                    for use as subscripts in array section names, 
		    have no repeated values

     RELATION       Specifies relationship between two 		  None   LRF
                    simple variables

     NOLIST/	    Disables/enables listing of the input 	  LIST   I
     LIST	    source file

     SWITCH         Sets global switches                          None   I

     COUNT          Supplies iteration count for loop             None   LRF

     ITERATIONS     Supplies iteration counts for classes         None   R
                    of loops

     AUTOEXPAND/    Enables/disables automatic routine    NOAUTOEXPAND   LRF
     NOAUTOEXPAND   inlining

     EXPAND         Expands particular routines inline            None   RF

     NEXPAND        Expands particular nested routines inline     None   RF

     SEARCH         Supplies location for source of routines      None   RF
		    to be expanded

     PRIVATEARRAY   Asserts that private arrays can be 		  None   LRF
     		    autotasked
     _______________________________________________________________________




































     62                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                          FPP Transformation Directives


       * NOVECTOR/VECTOR disables/enables vectorization enhancement;
         VECTOR serves only to toggle back from NOVECTOR.


       * NOCONCUR/CONCUR disables/enables Autotasking; CONCUR serves only
         to toggle back from NOCONCUR.


       * SKIP disables Autotasking and vectorization.


       * INNER/NOINNNER enables/disables Autotasking for inner loops.


       * CNCALL allows concurrent calls in loop.


       * NOALTCODE/ALTCODE disables/enables generation of alternate code
         blocks.


       * NOASSOC/ASSOC disables/enables all associative transformations.


       * SPLIT/NOSPLIT disables/enables cutting out subroutine and
         function calls from loop.


       * SELECT selects which loop in a nest of loops to optimize.


       * NOLSTVAL/LSTVAL disables/enables saving last values of
         transformed scalars.


       * UNROLL/NOUNROLL enables/disables automatic or explicit loop
         unrolling.
     ________________________________________________________________________

                 Figure 14.  FPP transformation directive summary


     Transformation
     directives
     4.3.1.1
                           Transformation directives change the way FPP
                           transforms a loop.  These directives are
                           summarized in Figure 14, page 63, and discussed in
                           the following subsections.





     SG-3074 5.0               Cray Research, Inc.                         63


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     NOVECTOR/VECTOR
     4.3.1.1.1
                           The NOVECTOR directive disables vectorization
                           enhancement.  Despite the best efforts of FPP to
                           make the right choices, occasionally a loop may be
                           less efficient after transformation.  NOVECTOR is
                           provided to disable vectorization enhancement in
                           such cases.  VECTOR serves only to toggle back
                           from NOVECTOR; it does not force vectorization.

                           The -d v option to fpp is equivalent to NOVECTOR
                           with file scope.

     NOCONCUR/CONCUR
     4.3.1.1.2
                           The NOCONCUR directive disables conversion of
                           loops to autotasked form.  The CONCUR directive
                           serves only to toggle back from, or locally
                           override, a previous directive or command-line
                           option that disabled concurrency analysis; it does
                           not force conversion of a specific loop.  (See the
                           SELECT directive, page 66, for information about
                           selecting a loop for concurrency analysis.)

                           Specifying the NOCONCUR directive with loop scope
                           (the default) does not inhibit FPP from making a
                           loop part of a parallel case, neither does it
                           inhibit FPP from expanding a parallel region
                           outside of a nonparallel loop.  You can use the
                           NOCONCUR directive with routine or file scope, or
                           use fpp command-line options (-d h and -d c) to
                           inhibit these transformations.

                           The -d c option to fpp is equivalent to NOCONCUR
                           with file scope.

     SKIP
     4.3.1.1.3
                           The SKIP directive disables Autotasking and
                           vectorization; it acts like a combined NOCONCUR
                           and NOVECTOR.

     INNER/NOINNER
     4.3.1.1.4
                           The INNER directive enables Autotasking of inner
                           loops.  For more information on the use of the
                           INNER directive, see "INNER directive use," page
                           148.





     64                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     CNCALL
     4.3.1.1.5
                           The CNCALL directive asserts that any subroutines
                           called in a loop have no recursive side effects;
                           they can be called concurrently by separate
                           iterations of the loop.  See "CNCALL directive
                           use," page 155, for more information on the use of
                           the CNCALL directive,

     NOALTCODE/ALTCODE
     4.3.1.1.6
                           The NOALTCODE directive disables the generation of
                           alternate code blocks.

                           For potentially dependent vector loops, the
                           ALTCODE directive directs FPP to generate both
                           vector and nonvector versions of the loop,
                           together with a run-time test to choose between
                           them based on the value of array subscripts.

                           For autotasked loops, ALTCODE directs FPP to
                           supply a similar threshold test for the IF clause
                           of the DO ALL or DO PARALLEL.

                           The ALTCODE directive allows an optional
                           parameter.  If the parameter is an integer
                           constant, FPP generates a test comparing the
                           loop's iteration count to the constant.  If the
                           iteration count is larger than the constant, the
                           loop is vectorized; otherwise, it is not.  If the
                           parameter is not an integer constant, the
                           parameter is echoed verbatim for the IF test.  If
                           the result of the IF test is true, the loop is
                           vectorized; otherwise, it is not.

                           The ALTCODE directive is in force by default.  The
                           -d m option to fpp is equivalent to NOALTCODE with
                           file scope.

     NOASSOC/ASSOC
     4.3.1.1.7
                           By default, FPP transforms certain constructs into
                           vector or concurrent versions in which the order
                           of operations may be different than the original
                           (that is, they have been associatively
                           transformed).  (This is similar to the associative
                           property of real numbers.)  Because of the way
                           numbers are internally represented in computers,
                           this operation reordering may result in answers
                           that differ slightly from the scalar original.
                           For example, floating-point arithmetic is not
                           associative.  The NOASSOC directive disables all
                           associative transformations, including the
                           following:


     SG-3074 5.0               Cray Research, Inc.                         65


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * Reductions - sum, dot product, and index of
                             minimum and maximum

                           * Operation reordering when minimizing dependent
                             regions

                           * Linear recursion translation

                           The -d a option to fpp is equivalent to NOASSOC
                           with file scope.

     SPLIT/NOSPLIT
     4.3.1.1.8
                           The SPLIT directive asserts that subroutine and
                           function calls do not cause feedback of results
                           from one loop pass to another, and thus may be
                           "split out" from an optimized loop into a separate
                           loop.  For more information on the use of the
                           SPLIT directive, see "SPLIT directive use," page
                           159.

     SELECT
     4.3.1.1.9
                           The SELECT directive advises FPP to choose the
                           next loop as the one to vectorize or autotask in a
                           nest of loops.  If FPP cannot analyze the loop or
                           finds a dependence, the SELECT directive is
                           ignored.

                           In choosing a single loop from a nest, FPP weighs
                           loop iteration count, the presence of data
                           dependence, and the amount of work within the loop
                           to a heuristic algorithm.  Because not all
                           pertinent information is available at compile
                           time, FPP may not always be able to make the best
                           choice.  Therefore, the SELECT directive allows
                           you to dictate the optimization mode of a specific
                           loop.  Place the SELECT directive directly before
                           the DO statement of the loop to be optimized.  An
                           optional argument indicates the mode of
                           optimization, either VECTOR or CONCUR.  The
                           default is VECTOR.

     NOLSTVAL/LSTVAL
     4.3.1.1.10
                           The NOLSTVAL directive advises FPP not to save the
                           final values for transformed scalars; that is, the
                           final values for transformed scalars (within the
                           directive's scope) do not need to be identical to
                           those in the scalar version.  This directive is
                           useful when FPP cannot determine by inspecting the
                           current subprogram whether a variable is
                           subsequently used.  Such variables are typically


     66                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           in common blocks.

                           Transformed scalars are array indexes and promoted
                           scalars.  See "Last value saving," page 119,
                           "Array indexing," page 117, and "Scalars in
                           loops," page 186, for more information.

                           The LSTVAL directive causes FPP to save the last
                           values of transformed scalars; this is the
                           default.

     UNROLL/NOUNROLL
     4.3.1.1.11
                           The UNROLL directive has two functions:  the first
                           function is to enable the automatic unrolling of
                           loops with small constant iteration counts; the
                           second function is to force explicit unrolling of
                           a particular loop, regardless of iteration count.
                           Eliminating an inner loop by unrolling may allow
                           another loop to vectorize.

                           The UNROLL directive has the following syntax:

                           -------------------------------------------------
                           CFPP$  UNROLL  [(number_of_times)]  [{L,R,F}]
                           -------------------------------------------------

                           When routine or file scope is specified (R or F),
                           automatic unrolling of loops is enabled or
                           disabled over that scope.  The optional parameter
                           number_of_times, which must be a constant,
                           specifies the threshold loop iteration count for
                           automatic unrolling.  Loops with an iteration
                           count greater than this value are not unrolled.
                           By default, the threshold is 3.

                           To force a loop to be explicitly unrolled, use the
                           UNROLL directive with local scope (L) immediately
                           preceding the loop.  In this case, the optional
                           parameter is taken as the number of times to
                           unroll the loop.  If a parameter is not supplied,
                           FPP uses an internally calculated function of the
                           loop length, loop complexity, and default
                           threshold to determine the number of times to
                           unroll the loop.

                           The NOUNROLL directive disables automatic loop
                           unrolling; this is the default.











     SG-3074 5.0               Cray Research, Inc.                         67


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                         FPP Data Dependency Directives



       * NODEPCHK/DEPCHK disables/enables data dependency checks.



       * NOSYNC/SYNC enables/disables analysis of potential overlap of
         array sections.  (NOSYNC is a generalization of NODEPCHK to
         concurrency.)



       * NOEQVCHK/EQVCHK disables/enables checking of EQUIVALENCE
         statements to see whether they cause data dependencies.



       * PERMUTATION declares that listed integer arrays, for use as
         subscripts in array section names, have no repeated values.



       * RELATION specifies relationship between two simple variables.



       * PRIVATEARRAY asserts that private arrays can be autotasked.
     ________________________________________________________________________

                Figure 15.  FPP data dependency directive summary


     Data dependency
     directives
     4.3.1.2
                           Data dependency directives are used to help FPP
                           decide whether data dependency conflicts actually
                           exist in a loop.  If you know that an operation is
                           not recursive, you can supply one of these
                           directives to inform FPP.  These directives are
                           summarized in Figure 15, page 68, and briefly
                           discussed in the following subsections.  They are
                           discussed further and examples are given in "Using
                           data dependency directives," page 110.

     NODEPCHK/DEPCHK
     4.3.1.2.1
                           When elements of an array are modified within a
                           loop, FPP must determine the exact storage
                           relationship of these elements to all other


     68                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           references to the array in the loop.  This must be
                           done to ensure that the references do not overlap,
                           and thus, they can be safely executed in parallel.
                           When the relationships cannot be determined, FPP
                           issues a potential dependency diagnostic.  The
                           NODEPCHK directive asserts that all such
                           potentially recursive relationships are, in fact,
                           not recursive.  You should use this capability
                           only when you know no real recursion exists.  Use
                           of the directive does not, however, force the
                           optimization of operations that are unambiguously
                           recursive.  Use the DEPCHK directive to toggle
                           back to the default state.  The -d d option to fpp
                           is equivalent to NODEPCHK with file scope.  See
                           "NODEPCHK (declaring nonrecursion)," page 111, for
                           more information.

     NOSYNC/SYNC
     4.3.1.2.2
                           The NOSYNC directive is a generalization of the
                           NODEPCHK directive to concurrency.  FPP generates
                           the diagnostic MUST SYNCHRONIZE TO PRESERVE ORDER
                           OF ACCESSES when one processor might write over
                           elements of an array before another processor
                           reads those elements.  If there is no overlap, you
                           can use the NOSYNC directive to allow full
                           optimization.

                           You can use the SYNC directive to toggle back to
                           the default state.

     NOEQVCHK/EQVCHK
     4.3.1.2.3
                           The NOEQVCHK directive tells FPP to ignore
                           relationships between variables caused by
                           EQUIVALENCE statements, when examining the data
                           dependencies in a loop.  You can use the EQVCHK
                           directive to toggle back to the default state.
                           The -d e option to fpp is equivalent to a NOEQVCHK
                           directive with file scope.  See the example
                           "NOEQVCHK (declaring nonrecursion in
                           equivalences)," page 112.

     PERMUTATION
     4.3.1.2.4
                           The PERMUTATION directive declares that an integer
                           array does not have repeated values.  This is
                           useful when the integer array is used as a
                           subscript for another array ("indirect
                           addressing").  If it is known that the integer
                           array is used merely to permute the elements of
                           the subscripted array, it can often be determined
                           that feedback does not exist with that array


     SG-3074 5.0               Cray Research, Inc.                         69


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           reference.  See "PERMUTATION (declaring safe
                           indirect addressing)," page 114, for directive
                           syntax and more information.

     RELATION
     4.3.1.2.5
                           The RELATION directive advises FPP that a
                           specified relationship exists between two integer
                           variables or between an integer variable and an
                           integer constant.  This information may be useful
                           to FPP in resolving otherwise ambiguous array
                           relationships.

                           RELATION directives are informative only, they do
                           not force any action.  They can be applied at the
                           loop, routine, or file level.  If conflicting
                           relations are given, the result is unpredictable.
                           You must ensure that the relations specified are
                           correct and consistent.  See "RELATION (specifying
                           relationship between variables)," page 113, for
                           the syntax of the directive and more information.

     PRIVATEARRAY
     4.3.1.2.6
                           The PRIVATEARRAY directive tells FPP that it is
                           safe to autotask private arrays, specifically,
                           that private arrays use only values generated
                           within the autotasked loop.  The PRIVATEARRAY
                           directive has no parameters, and can have loop,
                           routine, or file scopes.

                           The fpp -e 4 command-line option is equivalent to
                           the PRIVATEARRAY directive with file scope.

                           Example:

                                CFPP$ PRIVATEARRAY
                                      DO 500 J = 1, M
                                         DO 100 I = 1, N
                                            X(I) = A(I,J) + B(I,J)
                                100      CONTINUE
                                         DO 200 I = 1, NM1
                                            C(I,J) = X(I) + D(I,J)
                                200      CONTINUE
                                500   CONTINUE


                           Without the PRIVATEARRAY directive, FPP does not
                           autotask the 500 loop, because it cannot tell
                           whether all the values of X used in the 200 loop
                           are generated by the 100 loop.  (If they are not,
                           the values generated outside the 500 loop are
                           required, and X cannot efficiently be made
                           private.)



     70                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



     Miscellaneous
     directives
     4.3.1.3
                           FPP directives also exist that cannot be
                           categorized in any of the preceding classes.
                           These directives are summarized in Figure 16, page
                           72, and discussed in the following subsections.

     Advisory directives:
     COUNT and ITERATIONS
     4.3.1.3.1
                           Advisory directives provide information for FPP
                           that may result in a better choice of loops to be
                           optimized.

                           If the iteration count of a loop (or class of
                           loops) is variable and cannot be determined from
                           the information in the routine until execution
                           time, but you know the approximate number of
                           iterations, you can use the COUNT or ITERATIONS
                           directive to supply this information.

                           The COUNT directive has the following syntax:

                           -------------------------------------------------
                           CFPP$  COUNT (val1) [{L,R,F}]
                           -------------------------------------------------

                           The ITERATIONS directive has the format:

                           -------------------------------------------------
                           CFPP$  ITERATIONS (var1=val1 [,var2=val2] ...)
                           -------------------------------------------------

                           var1, var2...    Specifies indices of loops.

                           val1, val2...    Specifies vector length values
                                             for the given loop indices.
                                             These values do not have to be
                                             exact because they are used only
                                             as guidelines.

                           The COUNT directive can be used at the file,
                           routine, or loop levels.  The ITERATIONS directive
                           can be used only at the routine level.  A CFPP$
                           COUNT(0) F or NOITERATIONS directive returns FPP
                           to its normal iteration count processing.








     SG-3074 5.0               Cray Research, Inc.                         71


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                          Miscellaneous FPP Directives




       * COUNT lets you provide approximate iteration counts for loops to
         FPP.




       * ITERATIONS lets you provide approximate iteration counts for
         classes of loops to FPP.




       * NOLIST/LIST disables or enables the listing of the input source
         file.




       * SWITCH lets you set (or change) FPP global switches.
     ________________________________________________________________________

                 Figure 16.  Miscellaneous FPP directive summary

                           Example:

           SUBROUTINE OPTIM6 ( A, B, N )
           REAL A(N), B(N)
     C
     CFPP$ COUNT ( 3 ) R
           DO 606 I = 1,N
           (Not autotasked; iteration count is too small.)
              A(I) = B(I)
       606 CONTINUE
     C
           END

















     72                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                      SUBROUTINE ITERS (A,B,C,D,M,MM1,N,NP1)
                                      REAL A(M,N), B(M,N), C(M,N), D(M,N)
                                C
                                CFPP$ ITERATIONS (I=15,J=20)
                                C
                                      DO 200 J = 1, M
                                         DO 100 I = 1, N
                                            A(I,J) = B(I,J) + C(I,J)
                                 100     CONTINUE
                                 200  CONTINUE
                                C
                                      CALL CALC1
                                C
                                      DO 400 J = 1, MM1
                                         DO 300 I = 2, NP1
                                            A(I,J) = A(I,J) + D(I,J)
                                            B(I,J) = B(I,J) + D(I,J)
                                            C(I,J) = C(I,J) + D(I,J)
                                 300     CONTINUE
                                 400  CONTINUE


                           Translation:

                                      DO 200 J = 1, M
                                CDIR@ IVDEP
                                         DO 100 I = 1, N
                                            A(I,J) = B(I,J) + C(I,J)
                                 100     CONTINUE
                                 200  CONTINUE
                                C
                                      CALL CALC1
                                C
                                CMIC$ DO ALL SHARED(MM1, NP1, M, N, A, D, B,
                                CMIC$1   C) PRIVATE(J, I)
                                      DO 400 J = 1, MM1
                                CDIR@ IVDEP
                                         DO 300 I = 2, NP1
                                            A(I,J) = A(I,J) + D(I,J)
                                            B(I,J) = B(I,J) + D(I,J)
                                            C(I,J) = C(I,J) + D(I,J)
                                 300     CONTINUE
                                 400  CONTINUE


                           Without the ITERATIONS directive, both loop nests
                           would have been conditionally autotasked,
                           resulting in lower performance for the DO 200 loop
                           and unnecessary compile-time and run-time overhead
                           for the DO 400 loop.

     Listing directives:
     NOLIST and LIST
     4.3.1.3.2
                           Listing directives change the appearance of the


     SG-3074 5.0               Cray Research, Inc.                         73


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           FPP listing.  The following subsections discuss
                           the FPP listing directives.

                           You can selectively suppress listing of the input
                           source code with the NOLIST/LIST directive pair.
                           If the NOLIST directive (or the -q l option to
                           fpp) is in force when the END statement is
                           encountered, the rest of the listing (messages,
                           translated source, summaries) is suppressed,
                           unless specifically enabled (with the -p option
                           switches to fpp).

     SWITCH
     4.3.1.3.3
                           The SWITCH directive enables you to set (or
                           change) global switches, including listing
                           switches.  You can also use the SWITCH directive
                           to set optimization and reformatting switches.
                           See Table 1, page 42, for a list of the
                           optimization switches.  See Table 2, page 46, for
                           a list of listing switches.  See "FPP TIDY
                           Subprocessor," page xx.x 0, for a list of the
                           reformatting switches.

                           The format of the SWITCH directive is as follows:

     ------------------------------------------------------------------------
     CFPP$ SWITCH,OPTON=string,OPTOFF=string,LSTON=string,
       LSTOFF=string, TDYON=string,TDYOFF=string,   TIDY parameters
     ------------------------------------------------------------------------

                           Parameters OPTON, OPTOFF, LSTON, LSTOFF, TDYON,
                           and
                           TDYOFF correspond to fpp options -e, -d, -p, -q,
                           -r, and
                           -n, respectively.

                           Blanks are not significant, and keywords and
                           switches can be in either uppercase or lowercase.
                           See "FPP TIDY Subprocessor," page xx.x 0, for
                           information on TIDY parameters specific to the
                           SWITCH directive.

     Inline expansion
     directives
     4.3.1.3.4
                           Inline expansion directives provide information
                           for FPP that allows expansion of the bodies of
                           certain subroutines and functions into the loops
                           that call them.  The directives are as follows:

                           * AUTOEXPAND

                           * EXPAND



     74                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * NEXPAND

                           * SEARCH

                           See "Inline expansion," page 174, for more
                           information about these directives and examples of
                           their use.



     FMP directives
     4.3.2
                           The FMP translator interprets CMIC$ and CMIC@
                           directives in Fortran code.

                           Autotasking and microtasking directives can be
                           used in the same subprogram unit, with
                           restrictions.  See the following note and "Using
                           microtasking and macrotasking with Autotasking,"
                           page 5, for a complete discussion of these
                           restrictions.

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           Autotasking CMIC$ directives inhibit FPP action on
                           any loop nest in which they appear.  Also, FPP
                           does not try to optimize anything inside a
                           parallel region (that is, anything bounded by a
                           CMIC$ PARALLEL and CMIC$ END PARALLEL pair.)
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


                           The general form of FMP directives is as follows:

                           -------------------------------------------------
                           CMIC$ GENERIC_DIRECTIVE directive_parameters
                           CMIC$*directive_parameters continued
                           -------------------------------------------------

                           User-specified FMP directives begin with "CMIC$ "
                           in columns 1 through 6 and directive text in
                           columns 7 through 72.  Directives can be continued
                           by using CMIC$ in columns 1 through 5 and any
                           nonblank, nonzero character in column 6.
                           Parameters on directives (for example, PRIVATE)
                           may be repeated as needed, and need not be
                           ordered.  Uppercase and lowercase may be used
                           freely in the directive text.  In the
                           descriptions, brackets [ ] delimit optional
                           parameters to individual directives.






     SG-3074 5.0               Cray Research, Inc.                         75


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Certain FMP directives are the same for both
                           Autotasking and microtasking.  Common directives
                           include the following:

                                CMIC$ CONTINUE
                                CMIC$ END GUARD
                                CMIC$ GETCPUS
                                CMIC$ GUARD
                                CMIC$ RELCPUS


                           There are also FMP directives for Autotasking and
                           microtasking that perform basically the same
                           functions.  Equivalent directives are shown in
                           Table 5.

                           Table 5.  Equivalent Autotasking and microtasking
                                     directives

                           __________________________________________________
                           Autotasking                 Microtasking
                           __________________________________________________
                           CMIC$ CASE                  CMIC$ PROCESS
                           CMIC$ END CASE              CMIC$ ALSO PROCESS
                                                       CMIC$ END PROCESS
                           CMIC$ DO PARALLEL           CMIC$ DO GLOBAL
                           CMIC$ SOFT EXIT             CMIC$ STOP ALL PROCESS
                           __________________________________________________


                           The data scope rules used for FMP directives are
                           discussed in "FMP data scope rules," page 95.


     FMP Autotasking
     directives
     4.3.2.1
                           FMP Autotasking directives provide a way for you
                           to specify loop-level parallelism in your
                           programs; you can start and end parallel
                           processing at any number of suitable points within
                           a subroutine.  Using these directives eliminates
                           the microtasking requirement that parallel
                           processing always start at the first executable
                           statement of a subroutine and always end at the
                           last executable statement of the subroutine.
                           Figure 17, page 77, shows the difference between
                           parallel processing with Autotasking and
                           microtasking.

                           These directives are also useful when FPP fails to
                           recognize parallelism that you know exists.  They
                           are summarized in Figure 18, page 78, and
                           discussed in the following subsections.




     76                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



                   Figure 17.  Autotasking versus microtasking























































     SG-3074 5.0               Cray Research, Inc.                         77


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                           FMP Autotasking Directives



       * CMIC$ DO ALL



       * CMIC$ PARALLEL and CMIC$ END PARALLEL



       * CMIC$ DO PARALLEL and CMIC$ END DO



       * CMIC$ CASE and CMIC$ END CASE



       * CMIC$ GUARD and CMIC$ END GUARD



       * CMIC$ CONTINUE



       * CMIC$ SOFT EXIT



       * CMIC$ TASKCOMMON
     ________________________________________________________________________

                      Figure 18.  FMP Autotasking directives






















     78                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     CMIC$ DO ALL
     4.3.2.1.1
                           The CMIC$ DO ALL directive has the following
                           options:

                           -------------------------------------------------
                           CMIC$ DO ALL [IF (expr)] [SHARED (var [,...])]
                           [PRIVATE (var [,...])] [AUTOSCOPE]
                           [CONTROL(var [,...])] [SAVELAST] [MAXCPUS (val)]
                           {[SINGLE] | [CHUNKSIZE (n)] | [NUMCHUNKS (m)]
                           | [GUIDED] | [VECTOR]}
                           -------------------------------------------------


                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                Caution

                           In this description and in the remaining
                           descriptions of FMP directives, the syntax is not
                           correct as shown.  (The descriptions attempt to
                           show all possible options.)   Directives can be
                           continued by using CMIC$ in columns 1 through 5
                           and any nonblank, nonzero character in column 6.
                           See "FMP directives," page 75, for a complete
                           description of the correct syntax.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


                           The DO ALL directive indicates that the DO loop
                           that begins on the next line may be executed in
                           parallel by multiple processors.  No directive is
                           needed to end a DO ALL loop, (that is, the DO ALL
                           initiates a parallel region whose only code is a
                           DO loop with independent iterations).  The loop
                           index variable for a DO ALL is PRIVATE.  Optional
                           parameters are as follows:

                           Parameter   Description
                           ---------   -----------

                           IF(expr)    Performs a run-time test to choose
                                       between uniprocessing and
                                       multiprocessing.  When not specified,
                                       multiprocessing is chosen if the loop
                                       was not called from within a parallel
                                       region.  The logical expression (expr)
                                       determines (at run-time) whether
                                       multiprocessing will occur.  When expr
                                       is true, multiprocessing is enabled.

                           SHARED(var1,var2,...)
                                       Specifies that the variables listed
                                       will have shared scope; that is, they
                                       are accessible to both the original
                                       task and all helper tasks.  The SHARED
                                       clause identifies those variables that
                                       are shared between parallel processes.
                                       The data scope rules used in a


     SG-3074 5.0               Cray Research, Inc.                         79


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                       partitioned loop are discussed in "FMP
                                       data scope rules," page 95.

                           PRIVATE(var1,var2,...)
                                       Specifies that the variables listed
                                       will have private scope; that is, each
                                       task (original or helper) will have
                                       its own private copy of these
                                       variables.  The PRIVATE clause
                                       identifies those variables that are
                                       not shared between parallel processes.
                                       The data scope rules used in a
                                       partitioned loop are discussed in "FMP
                                       data scope rules," page 95.

                           AUTOSCOPE   Specifies that all unscoped variables
                                       that have not been explicitly scoped
                                       with a PRIVATE/SHARED declaration are
                                       scoped according to the default rules
                                       for scoping variables.  The data scope
                                       rules used in a partitioned loop are
                                       discussed in "FMP data scope rules,"
                                       page 95.

                           CONTROL(var1,var2,...)
                                       Specifies that the variables listed
                                       are considered control variables for
                                       the purpose of the AUTOSCOPE
                                       directive.  An array indexed by a
                                       control variable has shared scope.

                           SAVELAST    When present, this parameter specifies
                                       that private variables' values (from
                                       the final iteration of a DO ALL) will
                                       persist in the original task after
                                       execution of the iterations of the DO
                                       ALL.  By default, private variables
                                       are not guaranteed to retain the last
                                       iteration values.  SAVELAST can be
                                       used only with DO ALL, and if the full
                                       iteration set is not completed (for
                                       example, due to a SOFT EXIT), the
                                       values of private variables are
                                       indeterminate.

                           MAXCPUS (val)
                                       Specifies the maximum number of CPUs
                                       that the parallel region can use
                                       effectively.  Specifying MAXCPUS(val)
                                       does not ensure that the val
                                       processors will be assigned; it
                                       specifies the optimal maximum.
                                       Argument val can be either a constant
                                       or a variable; both of the following
                                       are valid specifications:

                                         MAXCPUS (2)

                                         MAXCPUS (val)


     80                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The rest of the parameters (SINGLE, CHUNKSIZE,
                           NUMCHUNKS, GUIDED, and VECTOR) specify the work
                           distribution policy for the iterations of the
                           parallel DO loop.  These parameters are summarized
                           in Figure 19, page 82, and discussed in more
                           detail in the following paragraphs.  By default,
                           the iterations are distributed one at a time
                           (SINGLE).  You can select only one of the
                           following work distribution algorithms for a given
                           DO loop:

                           Parameter   Description
                           ---------   -----------

                           SINGLE      Specifies to distribute the iterations
                                       one at a time to available processors.

                           CHUNKSIZE(n)
                                       Specifies the number of iterations to
                                       distribute to an available processor.
                                       n is an expression (for best
                                       performance, n should be an integer
                                       constant).  (For example, given 100
                                       iterations and CHUNKSIZE(4), 4
                                       iterations at a time are distributed
                                       to each available processor until the
                                       100 iterations are complete.)
                                       CHUNKSIZE(64) is an analog of the
                                       microtasking LONGVECTOR directive.

                           NUMCHUNKS(m)
                                       Specifies to divide the iterations
                                       into m chunks of equal size (with a
                                       possible smaller residual chunk), and
                                       distribute these chunks to available
                                       processors.  (For example, given 100
                                       iterations and NUMCHUNKS(4), 25
                                       iterations at a time are distributed
                                       to each available processor until the
                                       100 iterations are complete.)





















     SG-3074 5.0               Cray Research, Inc.                         81


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                 Work Distribution Parameters for Parallel Loops




       SINGLE                 Hand out the iterations one at a time to
                               available processors.  (Default
                               distribution.)



       CHUNKSIZE(n)           Number (n) of iterations to distribute to an
                               available processor.



       NUMCHUNKS(m)           Divide the iterations into m chunks of equal
                               size (with a possible smaller residual
                               chunk), and distribute these chunks to
                               available processors.



       GUIDED                 Uses the "Guided Self-scheduling" to
                               partition the iteration space.



       VECTOR                 Specifies use of a scheduling algorithm used
                               only in the case of "stripmining" an
                               innermost vectorized loop.  (Default for
                               inner-loop Autotasking.)
     ________________________________________________________________________

           Figure 19.  Work distribution parameters for parallel loops

                           Parameter   Description
                           ---------   -----------

                           GUIDED      Specifies the use of "Guided Self-
                                       scheduling" to distribute the
                                       iterations to available processors.
                                       This mechanism does a good job of
                                       minimizing synchronization overhead
                                       while providing acceptable dynamic
                                       load balancing.

                           VECTOR      Specifies the use of "Guided Self-
                                       scheduling" to distribute a minimum of
                                       64 iterations to available processors.
                                       Also specifies the use of a special
                                       scheduling algorithm when
                                       "stripmining" an innermost vectorized
                                       loop.




     82                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


     CMIC$ PARALLEL and
     CMIC$ END PARALLEL
     4.3.2.1.2
                           The CMIC$ PARALLEL/END PARALLEL directives have
                           the following options:

                           -------------------------------------------------
                           CMIC$ PARALLEL [IF (expr)] [SHARED (var [,...])]
                           [PRIVATE (var [,...])] [AUTOSCOPE]
                           [CONTROL (var [,...])] [MAXCPUS (val)]

                           CMIC$ END PARALLEL
                           -------------------------------------------------

                           The PARALLEL directive marks the start of a
                           parallel region.  The END PARALLEL directive marks
                           the end of a parallel region.  The scope of a
                           variable in a parallel region is either shared or
                           private.  Shared variables are used by all
                           processors; private variables are unique to a
                           processor.  Parallel regions are combinations of
                           redundant code blocks and partitioned code blocks.
                           The PARALLEL directive indicates where multiple
                           processors enter execution, which may be different
                           from where they demonstrate a direct benefit
                           (partitioned code block).  See the descriptions of
                           the optional parameters (IF, SHARED, PRIVATE,
                           AUTOSCOPE, CONTROL, and MAXCPUS) in the DO ALL
                           directive description, page 79.

     CMIC$ DO PARALLEL and
     CMIC$ END DO
     4.3.2.1.3
                           The CMIC$ DO PARALLEL/END DO directives have the
                           following options:

                           -------------------------------------------------
                           CMIC$ DO PARALLEL {[SINGLE]|[CHUNKSIZE (n)]|
                           [NUMCHUNKS (m)]|[GUIDED]|[VECTOR]}

                           CMIC$ END DO
                           -------------------------------------------------

                           The DO PARALLEL directive indicates that the DO
                           loop that begins on the next line may be executed
                           in parallel by multiple processors.  A directive
                           is not needed to end a DO PARALLEL loop.  A
                           control structure can be extended beyond the loop
                           by using the END DO directive.  The END DO
                           directive marks the end of a partitioned code
                           block that contains a DO PARALLEL loop.  This
                           ability to define partitioned code blocks that
                           contain DO loops as well as other code lets
                           Autotasking exploit parallelism in loops


     SG-3074 5.0               Cray Research, Inc.                         83


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           containing some forms of reduction computations.
                           These directives can be used only within a
                           parallel region, which is bounded by PARALLEL/END
                           PARALLEL directives.

                           The DO PARALLEL directive is equivalent to a CMIC$
                           DO GLOBAL microtasking directive.

                           The rest of the parameters (SINGLE, CHUNKSIZE,
                           NUMCHUNKS, GUIDED, and VECTOR) specify the work
                           distribution policy for the iterations of the
                           parallel DO loop.  By default, the iterations are
                           distributed one at a time (SINGLE).  Only one of
                           the work distribution algorithms can be chosen for
                           a given DO loop.  See the descriptions of SINGLE,
                           CHUNKSIZE, NUMCHUNKS, GUIDED, and VECTOR in the DO
                           ALL directive description, page 81.

                           In the following example, a parallel region
                           (PARALLEL/END PARALLEL) is defined that uses a DO
                           PARALLEL/END DO pair and GUARD/END GUARD pair to
                           implement a parallel reduction computation.  A
                           description of the GUARD/END GUARD directives
                           follows the example.

                           Example:

                                      SUM = 0.0
                                      BIG = -1.0
                                CMIC$ PARALLEL  PRIVATE(XSUM,XBIG)
                                CMIC$1SHARED(SUM,BIG,AA,BB,CC)
                                      XSUM = 0.0
                                      XBIG = -1.0
                                CMIC$ DO PARALLEL
                                      DO 200   I = 1,2000
                                      :
                                      XSUM = XSUM+(AA(I)*(BB(I)-CC(AA(I))))
                                      XBIG = MAX(ABS(AA(I)*BB(I)),XBIG)
                                      :
                                200   CONTINUE
                                CMIC$ GUARD
                                      SUM = SUM+XSUM
                                      BIG = MAX(XBIG,BIG)
                                CMIC$ END GUARD
                                CMIC$ END DO
                                CMIC$ END PARALLEL


                           In this example, the GUARD/END GUARD protect the
                           update of the shared variables (SUM and BIG), and
                           the DO PARALLEL/END DO pair ensure that all
                           contributions to SUM and BIG are included.

     CMIC$ CASE and
     CMIC$ END CASE
     4.3.2.1.4
                           The CASE directive serves as a separator between


     84                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           adjacent code blocks that are concurrently
                           executable.  The END CASE directive serves as the
                           terminator for a group of one or more parallel
                           CASE directives.  These directives can appear only
                           in a parallel region.

                           The CASE directive is equivalent to the CMIC$
                           PROCESS/ALSO PROCESS microtasking directives.  The
                           END CASE directive is equivalent to a CMIC$ END
                           PROCESS microtasking directive.

                           In the following example, CASE directives have
                           been added.  Currently, FPP does not perform
                           interprocedural analysis and would not, therefore,
                           add the CASE directives automatically for this
                           example.  Because CASE directives have been added
                           in the following example, subroutines called
                           within the CASE/END CASE directives are
                           concurrently executable:

                                CMIC$ PARALLEL
                                CMIC$   CASE
                                        CALL ABC
                                CMIC$   CASE
                                        CALL DEF
                                CMIC$   CASE
                                        CALL GHI
                                CMIC$   END CASE
                                CMIC$ END PARALLEL


                           The work in the subroutine calls completes before
                           execution continues with the code below the END
                           CASE.  A special form of the CASE/END CASE
                           directive pair forces only one processor to
                           execute a code block in a parallel region, as
                           follows:

                                CMIC$ PARALLEL
                                CMIC$ CASE
                                      CALL XYZ
                                CMIC$ END CASE
                                       :
                                CMIC$ DO PARALLEL
                                      DO 200   I = 1,IMAX
                                       :
                                200   CONTINUE
                                CMIC$ END PARALLEL


                           In the preceding example, only one processor calls
                           XYZ.

     CMIC$ GUARD and
     CMIC$ END GUARD
     4.3.2.1.5
                           The CMIC$ GUARD/END GUARD directives have the


     SG-3074 5.0               Cray Research, Inc.                         85


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           following syntax:

                           -------------------------------------------------
                           CMIC$ GUARD [n]
                           CMIC$ END GUARD [n]
                           -------------------------------------------------

                           The GUARD/END GUARD directive pair delimits a
                           critical region, and it provides the necessary
                           synchronization to protect (or guard) the code
                           inside the critical region.  A critical region is
                           a code block that is to be executed by only one
                           processor at a time, although all processors in
                           the parallel region execute it.

                           The optional n parameter is an expression that
                           serves as a mutual exclusion flag (using the low-
                           order 6 bits of the value).  That is, GUARD 1 and
                           GUARD 2 can be active concurrently, but two GUARD
                           7 directives cannot.  For optimal performance, n
                           should be an integer constant, and the general
                           expression capability is provided only for the
                           unusual case that the critical region number must
                           be passed to a lower-level routine.  When n is not
                           provided, the critical region blocks only other
                           instances of itself, but no other critical
                           regions.  Critical regions may appear anywhere in
                           a program; that is, they are not limited to
                           parallel regions.

     CMIC$ CONTINUE
     4.3.2.1.6
                           The CONTINUE directive indicates that the external
                           routine called on the next line is a microtasked
                           subroutine.  The Fortran dependence analyzer (FPP)
                           does not generate this directive, and it cannot
                           prepare the called subprogram for this special
                           form of processing.

     CMIC$ SOFT EXIT
     4.3.2.1.7
                           The SOFT EXIT directive indicates that the branch
                           statement on the next line jumps somewhere outside
                           of the current parallel region.  Use of a SOFT
                           EXIT directive, in effect, terminates the parallel
                           region, and is typically used in search loops,
                           where if a single process reaches a specified
                           condition, all processors should stop.  Jumps can
                           have different targets.  You can use multiple SOFT
                           EXIT directives within one parallel region.

                           The SOFT EXIT directive is equivalent to a CMIC$
                           STOP ALL PROCESS microtasking directive.




     86                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Branch statements that jump around inside a
                           parallel region should not be preceded by a SOFT
                           EXIT directive.  Jumps to labels completely
                           outside of a parallel region must be preceded by a
                           SOFT EXIT directive.  Jumps from inside a DO
                           PARALLEL or a CASE structure to areas outside of
                           the structure, but still inside the parallel
                           region, are not allowed.  Jumps from one CASE into
                           another CASE of a multiple CASE/END CASE structure
                           are also not permitted.

     CMIC$ TASKCOMMON
     4.3.2.1.8
                           A CMIC$ TASKCOMMON directive causes FMP to change
                           each occurrence of a specified common block into a
                           task common block (throughout all routines).
                           Converting a common block into a task common block
                           makes the contents of the block local to a task
                           but global within a task.  It also ensures that
                           processes get separate copies of the contents of
                           these blocks.

                           The CMIC$ TASKCOMMON directive has the following
                           syntax:

                           -------------------------------------------------
                           CMIC$ TASKCOMMON blocks
                           -------------------------------------------------

                           Argument blocks is a comma-separated list of
                           common blocks to be converted to task common
                           blocks.  You must specify this directive before
                           the first executable statement of a program unit
                           and before any of the specified blocks.  Common
                           blocks to be converted with this directive should
                           contain only read-only variables and write-first
                           variables; otherwise, correct results cannot be
                           ensured.

                           Example:

                                CMIC$ TASKCOMMON DATA
                                      COMMON /DATA/ A(100), B(100)


                           This directive is equivalent to the following
                           code:

                                      TASK COMMON /DATA/ A(100), B(100)


                           You can also specify multiple blocks, as in the
                           following example:






     SG-3074 5.0               Cray Research, Inc.                         87


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                CMIC$ TASKCOMMON data1, data2, data3
                                      . . .
                                      COMMON /data1/ a,b,c
                                      COMMON /data2/ d,e,f
                                      COMMON /data3/ g,h,i


                           FMP also recognizes CDIR$ TASKCOMMON compiler
                           directives, which are placed before the first
                           executable statement of a program unit, and
                           specify which common blocks are to be converted to
                           task common blocks.


     FMP microtasking
     directives
     4.3.2.2
                           In addition to the preceding Autotasking
                           directives, FMP recognizes the microtasking
                           directives.  These directives are summarized in
                           Figure 20, page 89, and described in the following
                           subsections.




































     88                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                           FMP Microtasking Directives



       * CMIC$ MICRO



       * CMIC$ DO GLOBAL



       * CMIC$ DO GLOBAL LONG VECTOR



       * CMIC$ DO GLOBAL BY expression



       * CMIC$ DO GLOBAL FOR expression



       * CMIC$ GETCPUS



       * CMIC$ PROCESS, CMIC$ ALSO PROCESS, and CMIC$ END PROCESS



       * CMIC$ RELCPUS



       * CMIC$ STOP ALL PROCESS
     ________________________________________________________________________

                  Figure 20.  FMP microtasking directive summary

     CMIC$ MICRO
     4.3.2.2.1
                           The CMIC$ MICRO directive designates a subroutine
                           to be microtasked and appears just before the
                           SUBROUTINE statement.  A subroutine introduced in
                           this way becomes a microtasked subroutine, or
                           fray.  Executing a RETURN or END statement signals
                           the end of multiprocessing work.  On exit, only
                           one processor returns to the calling routine.  A
                           function may not be microtasked, though it may, of
                           course, be rewritten as a subroutine and then
                           microtasked.  FMP microtasking directives
                           described in the following subsections provide a
                           way for you to specify subroutine-level


     SG-3074 5.0               Cray Research, Inc.                         89


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           parallelism in your programs.

                           The CMIC$ MICRO directive is an optional directive
                           and is not needed to use microtasking effectively.

     CMIC$ DO GLOBAL
     4.3.2.2.2
                           The CMIC$ DO GLOBAL directive marks the beginning
                           of a control structure in which the iterations of
                           a DO loop comprise all of the processes.  DO
                           GLOBAL is probably the most commonly used control
                           structure.

                           The statement following the CMIC$ DO GLOBAL
                           directive is a DO statement.  The end of the
                           control structure is marked by the statement
                           containing the label referred to in the DO
                           statement; the DO GLOBAL control structure does
                           not require a preprocessor directive to close it.

                           DO GLOBAL directives may be used to create control
                           structures within a DO loop, but the path through
                           such control structures cannot be altered inside
                           the microtasked subroutine.  A DO GLOBAL statement
                           may be nested within a DO loop, but only one DO
                           GLOBAL can be executing at a time.

                           The loop variable for loops using DO GLOBAL must
                           be of type integer and the initial, final, and
                           step values must be integer expressions.

                           Example:

                                CMIC$ DO GLOBAL
                                      DO 20 J= 1, 1000
                                      DO 10 I= 1, 1000
                                      A(I,J)= X(I,J) * Y(I,J)
                                10    CONTINUE
                                20    CONTINUE


                           Three variants of the DO GLOBAL directive are
                           supplied to help you better balance microtasking
                           and vectorization.  These variants are described
                           next.

     CMIC$ DO GLOBAL LONG
     VECTOR
     4.3.2.2.3
                           The CMIC$ DO GLOBAL LONG VECTOR directive marks
                           the beginning of a control structure that permits
                           both vectorization and microtasking on an
                           innermost DO loop.  This structure divides a loop
                           into processes of 64 iterations each, microtasking
                           the "chunks" and vectorizing the iterations.  (One


     90                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           remainder chunk will have 64 or fewer iterations.)

                           To provide a speedup, the loop should be longer
                           than 64 iterations, and it should be vectorizable.
                           Two associated directives (DO GLOBAL BY and DO
                           GLOBAL FOR) let you change the iteration chunk
                           size, also known as the chunking factor.

                           Example:

                                CMIC$ DO GLOBAL LONG VECTOR
                                      DO 100 K = 1, 4096
                                      A(K) = B(K) * C(K)
                                 100  CONTINUE


                           This example divides the original loop into an
                           inner and outer loop, each consisting of 64
                           iterations.

     CMIC$ DO GLOBAL BY
     expression
     4.3.2.2.4
                           The CMIC$ DO GLOBAL BY expression directive is the
                           same as the DO GLOBAL LONG VECTOR directive except
                           that the iterations are divided into chunks of
                           size expression.  It divides a DO loop into an
                           inner loop, with expression iterations, and an
                           outer loop.  The number of iterations in the outer
                           loop is approximately the number of iterations in
                           the original DO loop divided by expression.  The
                           inner loop may be vectorized and the outer loop
                           microtasked.  Setting expression to a multiple of
                           64 maximizes the vectorization performance.

                           You must ensure that the Fortran expression
                           evaluates to an integer greater than 0.  The
                           expression is evaluated at run time and may change
                           each time the DO loop is executed, but it cannot
                           change during the execution of a DO GLOBAL.

                           Example:

                                CMIC$ DO GLOBAL BY 1024
                                      DO 100 K = 1, 4096
                                      A(K) = B(K) * C(K)
                                100   CONTINUE


                           In this example, the 4096 iterations of the DO
                           loop are divided into four pieces consisting of
                           1024 iterations each.






     SG-3074 5.0               Cray Research, Inc.                         91


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     CMIC$ DO GLOBAL FOR
     expression
     4.3.2.2.5
                           The CMIC$ DO GLOBAL FOR expression directive is
                           the same as the DO GLOBAL LONG VECTOR directive,
                           except that the iterations are divided into
                           expression number of chunks.

                           It divides a DO loop into an outer loop, with
                           expression iterations, and an inner loop.  The
                           number of iterations in the inner loop is
                           approximately the number of iterations in the
                           original DO loop divided by expression.  The inner
                           loop is then vectorized and the outer loop
                           microtasked.

                           Example:

                                CMIC$ DO GLOBAL FOR 4
                                      DO 100 K = 1, 4096
                                      A(K) = B(K) * C(K)
                                 100  CONTINUE


                           This example specifies the number of iterations
                           for the generated outer loop to be 4.  The number
                           of iterations for the inner loop is then 1024.
                           The effect is the same as for the DO GLOBAL BY
                           directive in the previous example.  The only
                           difference is whether you want to specify the
                           chunk size or the number of chunks.

     CMIC$ GETCPUS
     4.3.2.2.6
                           The CMIC$ GETCPUS directive has the following
                           syntax:

                           -------------------------------------------------
                           CMIC$ GETCPUS expression
                           -------------------------------------------------

                           This optional directive may appear anywhere in the
                           program outside a control structure.  It specifies
                           the maximum number of processors permitted to work
                           on a microtasked or autotasked program.
                           expression is an integer expression that defines
                           the number of physical CPUs that will be used for
                           the program.  The default value for expression is
                           the maximum number of physical CPUs available for
                           your program.

                           The NCPUS environment variable (if set) defines
                           the maximum number of physical CPUs available for
                           your program.  If NCPUS is not set, the default
                           number of CPUs used for the program is the number
                           of physical CPUs in the machine.


     92                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


     CMIC$ PROCESS,
     CMIC$ ALSO PROCESS,
     and CMIC$ END PROCESS
     4.3.2.2.7
                           The CMIC$ PROCESS directive marks the beginning of
                           a control structure and signals that the code
                           following it is a single process.

                           The CMIC$ ALSO PROCESS directive marks the
                           beginning of a process other than the first
                           process inside a control structure and the end of
                           the previous process.  A PROCESS directive
                           followed by any number of ALSO PROCESS directives
                           implements a classic fork-and-join multitasking
                           structure.

                           The CMIC$ END PROCESS directive marks the end of a
                           process and the end of a control structure.
                           PROCESS and END PROCESS directives can also be
                           used to ensure single-processor execution of a
                           portion of code.  The single-threaded section
                           contains a single CMIC$ PROCESS directive (that
                           is, the section does not contain an ALSO PROCESS
                           directive).

                           Example:

                                CMIC$ PROCESS
                                      DO 10 I= 1, 1000
                                      A(I) = X(I) * Y(I)
                                10    CONTINUE

                                CMIC$ ALSO PROCESS
                                      DO 20 I= 1, 1000
                                      B(I) = X(I) * Z(I)
                                20    CONTINUE

                                CMIC$ END PROCESS


                           In this example, it is possible that two
                           processors will do both the DO 10 loop and the DO
                           20 loop simultaneously.  Both portions must be
                           completed before execution of the remainder of the
                           program.

     CMIC$ RELCPUS
     4.3.2.2.8
                           The CMIC$ RELCPUS directive specifies that the
                           processors acquired for microtasking should be
                           released back to the system.  It is the reverse of
                           the GETCPUS directive.  This directive should be
                           used when no microtasking is to be done for a long
                           period of time or when the program is preparing to


     SG-3074 5.0               Cray Research, Inc.                         93


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           terminate.

                           This directive is optional; if it is not used, all
                           processors acquired by the GETCPUS directive are
                           held until the program terminates.  When a STOP,
                           END, or CALL EXIT statement is encountered, the
                           microtasking slave processors are released
                           automatically before the job step is terminated.

     CMIC$ STOP ALL
     PROCESS
     4.3.2.2.9
                           The CMIC$ STOP ALL PROCESS directive provides a
                           way to exit from both PROCESS and DO GLOBAL
                           control structures without performing all of the
                           processes or iterations.  This directive forces
                           all processors to complete work in a process if
                           they are in one and then to accept no more work,
                           closing the control structure.

                           The CMIC$ STOP ALL PROCESS directive must be
                           followed by a branch statement.  Processors resume
                           work at the target of this branch statement.  For
                           example, you may want to end processing in a DO
                           loop when a certain solution is found.  If the
                           solution is never found, the loop is executed some
                           maximum number of iterations.  STOP ALL PROCESS
                           provides this graceful exit.  Typically, the
                           program will appear as in the following example.

                           Example:

                                CMIC$ DO GLOBAL
                                      DO 1 I = 1,10000
                                      . . .
                                      IF end-condition THEN
                                CMIC$ STOP ALL PROCESS
                                      GO TO 2
                                      ENDIF
                                      . . .
                                1     CONTINUE
                                2     CONTINUE


                           The previous section of code is portable.  You
                           must ensure that work is not done between the
                           statement that ends the DO loop and the statement
                           at which processing resumes.  You must also ensure
                           that the statement number to which the single-
                           processor version jumps and the one to which the
                           microtasked version jumps (as a result of the STOP
                           ALL PROCESS directive) are the same.  The
                           preprocessor does not check for any errors you
                           might make in using the STOP ALL PROCESS
                           directive.


94                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     FMP data scope rules
     4.3.2.3
                           When FMP generates code to handle a CMIC$ DO ALL
                           or a CMIC$ PARALLEL statement, all the variables
                           and arrays in the region must be assigned either
                           shared or private status or the AUTOSCOPE
                           parameter must be specified.  A shared variable or
                           array is one that all the processors use.  A
                           private variable or array is one for which each of
                           the processors has it own copy.

                           If the AUTOSCOPE parameter is specified, the
                           following rules are used by fmp to determine
                           shared or private status:

                           Shared variables are any of the following:

                           * The variables or arrays are in a SHARED
                             statement.

                           * The variables are read-only variables.

                           * The arrays are read-only arrays.

                           * The arrays are indexed by the loop index.

                           * The variables are read-then-write variables.

                           * The arrays are read-then-write arrays.

                           Private variables are any of the following:

                           * The variables or arrays are in a PRIVATE
                             statement.

                           * The variables are write-then-read variables.

                           * The arrays are write-then-read arrays.

                           In both cases, the verbs SHARED and PRIVATE
                           override the default determination.

                           When you specify the AUTOSCOPE parameter, FMP can
                           scope data incorrectly.  Sometimes the scope of
                           data cannot be determined at compile time, because
                           of conditional blocks of code within a loop.  FMP
                           assumes the flow of all loops to be top to bottom.
                           Also, FMP cannot correctly determine the scope of
                           data passed as arguments to subroutines from
                           within a parallel region.  Therefore, if you
                           insert an Autotasking directive around a loop
                           containing subroutine calls, conditional branches,
                           or code blocks, do not use the AUTOSCOPE
                           parameter, but scope all data explicitly.






     SG-3074 5.0               Cray Research, Inc.                         95


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           These data scoping problems apply only when you
                           use the AUTOSCOPE parameter on user-inserted
                           directives.  This problem does not occur with
                           directives that are inserted autotmatically by
                           FPP.

     Read-only variables
     4.3.2.3.1
                           The following examples show read-only variables.

                                CMIC$ DOALL PRIVATE(I) SHARED(N1,N2,A)
                                      DO 10 I=N1, N2
                                      ...=A
                                  10  CONTINUE


                           A is a shared variable because it is a read-only
                           variable.  All processors share the same location
                           for A.

                           CMIC$ DOALL SHARED(N1,N2,M1,M2,V) PRIVATE(I,J)
                                 DO 10 I=N1, N2
                                 DO 10 J=M1, M2
                                 ... = V(J)
                             10  CONTINUE


                           V is shared because it is a read-only array.  M1
                           and M2 are also shared because they are read-only
                           variables.  I and J are written and then read, so
                           they are private variables.

     Array indexed by loop
     index
     4.3.2.3.2
                           The following example shows an array indexed by
                           the loop index:

                                CMIC$ DOALL SHARED(N1,N2,V,U,J) PRIVATE(I,T)
                                      DO 10 I=N1, N2
                                      T=V(I)
                                      U(I,J)=T
                                  10  CONTINUE


                           U and V are shared arrays because they are indexed
                           by the loop index.  All processors share the same
                           location for V and U.  T is written and then read,
                           so it is a private variable.  J is shared because
                           it is a read-only variable.

     Read-then-write
     variables
     4.3.2.3.3


     96                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The following example shows read-then-write
                           variables:

                                      SUM=0.0
                                CMIC$ DOALL SHARED(N1,N2,V,SUM) PRIVATE(I,T)
                                      DO 10 I = N1, N2
                                      T = V(I)
                                CMIC$ GUARD
                                      SUM = SUM + T
                                CMIC$ END GUARD
                                   10 CONTINUE


                           SUM is a shared variable because it is read before
                           it is written.  Special care is needed in writing
                           into a shared variable.

     Write-then-read
     variables and arrays
     4.3.2.3.4
                           The following example shows write-then-read
                           variables and arrays:

                           CMIC$ DOALL SHARED(N1,N2,M1,M2) PRIVATE(I,J,V)
                                 DO 10 I = N1, N2
                                 DO 10 J = M1, M2
                                 V(J) = ...
                                 ... = V(J)
                              10 CONTINUE


                           V is written to and then read.  It must be a
                           private array.

     User-added scope
     required
     4.3.2.3.5
                           The automatic determination also misses some
                           cases.  The flow of the loop is assumed to be top
                           to bottom.  If the code is not in this order, FMP
                           can misdetermine type.  In all cases, the
                           SHARED/PRIVATE verbs override the determination by
                           FMP.

                           The examples that follow show situations that
                           require you to add scope verbs.

                           In the following example, the flow of control is
                           confused:








     SG-3074 5.0               Cray Research, Inc.                         97


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                C     This is wrong
                                CMIC$ DO ALL AUTOSCOPE
                                      DO 10 I = N1, N2
                                      GO TO 3
                                    2 V(I) = T
                                      GO TO 10
                                    3 T = V(I)
                                      GO TO 2
                                   10 CONTINUE


                           FMP determines that T is read before it is
                           written.  It determines incorrectly that T is
                           shared.  A correction to this code is as follows:

                                C     This is correct
                                CMIC$ DOALL PRIVATE(T) AUTOSCOPE
                                      DO 10 I = N1, N2
                                      GO TO 3
                                    2 V(I) = T
                                      GO TO 10
                                    3 T = V(I)
                                      GO TO 2
                                   10 CONTINUE


                           The private declaration overrides the read-
                           before-write rule.

                           The following example shows use of a subroutine
                           call:

                                CMIC$ DOALL AUTOSCOPE
                                      DO 10 I = N1, N2
                                      CALL MMP(A(I), B, C)
                                   10 CONTINUE


                           It is not possible to determine whether A(I), B,
                           or C is read or written.  FMP assigns A as shared
                           because it is indexed by the control variable.
                           FMP assumes that B and C are read; therefore, it
                           designates them as shared variables.  FMP prints a
                           message stating that such variables as A, B, and C
                           "require a private/shared declaration."  However,
                           FMP treats them as shared.  If you want the scope
                           of arguments to such subroutine calls to be
                           private, you must explicitly declare them as
                           private.  To be certain of correct scope, all
                           variables or arrays that occur in a function or
                           subroutine call should be specified as shared or
                           private.






     98                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Compiler directives
     4.3.3
                           Besides the FPP and FMP directives already
                           described, FPP recognizes CFT77 directives, which
                           are briefly described in this subsection.  For
                           more information, see the CF77 Compiling System,
                           Volume 1:  Fortran Reference Manual, publication
                           SR-3071.

                           The FPP dependence analyzer accepts the following
                           CFT77 directives (preceded by CDIR$):  IVDEP,
                           NOVECTOR, VECTOR, SHORTLOOP, NEXTSCALAR, and
                           VFUNCTION.  Table 6 shows the correspondence
                           between CFT77 and FPP directives.

                                 Table 6.  CFT77 versus FPP directives

                           __________________________________________________
                           CFT77                            FPP treats as:
                           __________________________________________________
                           CDIR$ IVDEP                      CFPP$ NODEPCHK L
                           CDIR$ NOVECTOR                   CFPP$ NOVECTOR R
                           CDIR$ VECTOR                     CFPP$ VECTOR R
                           CDIR$ SHORTLOOP                  CFPP$ NOINNER L
                           CDIR$ NEXTSCALAR                 CFPP$ NOVECTOR L
                           CDIR$ VFUNCTION                  Recognized by FPP
                           __________________________________________________


                           Other Cray Fortran directives are treated as
                           comments by FPP.


























     SG-3074 5.0               Cray Research, Inc.                         99



                                            FPP Data Dependency Analysis  [5]
     ########################################################################







                           For certain loops, parallel or vector execution
                           would result (or could result) in incorrect
                           answers.  A loop in which results from one loop
                           pass feed back into a future pass of the same loop
                           is said to have a data dependency conflict and may
                           not be optimized completely.  (Such a loop is also
                           said to be recursive or to contain recurrences.)
                           In these cases, FPP detects the problem, reports
                           it to the user, and leaves the loop in its
                           original form.

                           You can assert that there is no recursion by using
                           a directive or switch, both of which are discussed
                           later in this section.  Indirect addressing of
                           arrays can create hidden dependency conflicts.
                           Unless you make such an assertion (that no
                           recursion exists), the following conditions apply
                           to loops containing indirectly addressed arrays:

                           * A gathered array must not also appear on the
                             left-hand side in the loop.

                           * A scattered array must not appear anywhere else
                             in the loop.

                           In certain cases, FPP can determine that the
                           problem is limited to a subset of the operations
                           in the loop, and it will cut the loop into
                           subloops that can be optimized and those that
                           cannot be optimized.  The "DP" field in the loop
                           summary measures how much of the loop is
                           dependent, that is, left unoptimized.  If FPP
                           determines that more than a certain percentage of
                           a loop is dependent, it will not optimize the loop
                           at all.  (The exact percentage depends on various
                           other factors.)   Partial optimization of this
                           kind is done only for vectorized loops, not for
                           autotasked loops.

                           FPP also examines EQUIVALENCE statements to see
                           whether they may be masking recursion, and it
                           suppresses any potentially unsafe transformations.

                           FPP data dependency analysis is summarized in
                           Figure 21, page 102.  The following subsections
                           expand the discussion of data dependency analysis
                           and provide examples of directives you can use to
                           communicate information to FPP.



     SG-3074 5.0               Cray Research, Inc.                        101


     FPP Data Dependency Analysis   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                            Data Dependency Analysis


       A loop in which results from one loop pass feed back into a future
       pass of the same loop is said to have a data dependency conflict
       and may not be completely optimized.  (Such a loop is also said to
       be recursive or to contain recurrences.)


       Indirect addressing of arrays can create hidden dependency
       conflicts.  Therefore, unless directives are inserted to assert
       otherwise, the following conditions must be true for optimization
       of loops containing indirectly addressed arrays:


       * A gathered array must not also appear on the left-hand side in
         the loop.


       * A scattered array must not appear anywhere else in the loop.


       In these cases, FPP detects the problem, reports it to you, and
       leaves the loop in its original form.  You can assert that there is
       no recursion by using a directive or switch.
     ________________________________________________________________________

                     Figure 21.  FPP data dependency analysis




     Data dependency
     examples
     5.1
                           Figure 22, page 104, demonstrates the concept of
                           data dependency.  Four similar loops are
                           displayed.  For each loop, the sequences of
                           instructions that would be executed in scalar mode
                           (one at a time) and in vector mode (whole arrays
                           at a time) are also shown.  Lowercase variables
                           (such as "a") stand for new values set in the
                           current loop, and uppercase variables (such as
                           "A") stand for old values that were set before the
                           loop started.











     102                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Data Dependency Analysis
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           It is easy to see that the scalar and vector
                           sequences for Figure 22, part 11.1A, are not the
                           same; the vector version uses only old values of
                           A, and the scalar version uses new ones.  FPP
                           detects that this loop is not safe to optimize,
                           puts out a data dependency conflict message, and
                           leaves the loop in its original form (the loop is
                           "rejected").  In contrast, the scalar and vector
                           sequences for Figure 22, part 11.1B, are
                           identical; no feedback of results from one loop
                           pass to another is occurring here.  FPP recognizes
                           that this loop is safe to optimize and does so.

                           The situation is less clear in Figure 22, part
                           11.1C; here the use of the variable "K" in A's
                           subscript makes the proper scalar sequence
                           impossible to determine at compile time (with some
                           exceptions; see "Ambiguous subscript resolution,"
                           page 108.)  If K is 1, the loop functions like the
                           recursive loop in part 11.1A; if K is -1, the loop
                           is safe to optimize, as was 11.1B.  When it is not
                           possible for FPP to tell whether a loop is
                           recursive, the loop is said to have ambiguous
                           subscripting.  Often the user knows that a loop
                           (or perhaps all the loops in a routine or program)
                           is not recursive, even though FPP cannot tell, as
                           in 11.1C.  In these cases, you can direct FPP to
                           ignore potential recursions.  Use the -Wd"-d d"
                           option to cf77, the -d d option to fpp, or the
                           NODEPCHK directive.  Examples are in "NODEPCHK
                           (declaring nonrecursion)," page 111.

                           Figure 22, part 11.1D, shows the same loop as in
                           part 11.1A, but with a DO increment of 2 rather
                           than 1.  FPP detects that the loop is now not
                           recursive, because no results feedback into the
                           calculation.  These four similar examples point
                           out the sensitivity of data dependency analysis to
                           offset and stride values of arrays that appear on
                           both sides of the equal sign within a loop.

                           Data dependency analysis extends to more than just
                           single-line loops, of course; in the following
                           loop, the reference to A at the top of the loop
                           conflicts with the store into A at the bottom.
                           FPP prints a message to this effect and inserts a
                           directive to inhibit vectorization explicitly.














     SG-3074 5.0               Cray Research, Inc.                        103


     FPP Data Dependency Analysis   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

              a: new value of A           A: old value of A
       ----------------------------11.1A---------------------------------
       ORIGINAL LOOP:                     VECTOR SEQUENCE CORRECT?
           DO 71 I = 2,N                  No. We are not using updated
        71 A(I+1) = A(I)*B(I)+C(I)        values of A.

       SCALAR SEQUENCE:                   VECTOR SEQUENCE:
           a(3) = A(2)*B(2)+C(2)          a(3) = A(2)*B(2)+C(2)
           a(4) = a(3)*B(3)+C(3)          a(4) = A(3)*B(3)+C(3)
           a(5) = a(4)*B(4)+C(4)          a(5) = A(4)*B(4)+C(4)
           a(6) = a(5)*B(5)+C(5)          a(6) = A(5)*B(5)+C(5)
                      :                              :
       ----------------------------11.1B---------------------------------
       ORIGINAL LOOP:                     VECTOR SEQUENCE CORRECT?
           DO 72 I = 2,N                  Yes. Sequence is identical.
        72 A(I-1) = A(I)*B(I)+C(I)

       SCALAR SEQUENCE:                   VECTOR SEQUENCE:
           a(1) = A(2)*B(2)+C(2)          a(1) = A(2)*B(2)+C(2)
           a(2) = A(3)*B(3)+C(3)          a(2) = A(3)*B(3)+C(3)
           a(3) = A(4)*B(4)+C(4)          a(3) = A(4)*B(4)+C(4)
           a(4) = A(5)*B(5)+C(5)          a(4) = A(5)*B(5)+C(5)
                      :                              :
       ----------------------------11.1C---------------------------------
       ORIGINAL LOOP:                     VECTOR SEQUENCE CORRECT?
           DO 73 I = 2,N                  ? Depends on K. Here, if 0 0 iterations)
                                       J = J + N
                           C           (Save last value, may be referenced)
                                 ENDIF


                           If such variables are not used by other program
                           units, use the
                           -du option of fpp, the -Wd"-du" option of cf77, or
                           the NOLSTVAL directive to suppress the generation
                           of last values.



     Indirect addressing
     5.9.5
                           When an array's subscript is itself an array
                           reference with a vector index, the array is said
                           to be indirectly addressed.  Such array references
                           are optimized only if they obey these rules:

                           * A gathered array may be used multiple times
                             within a loop, but only for reading.

                           * A scattered array may be defined only once in
                             the loop.

                           * The array's subscripts must be of type integer.



     120                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Data Dependency Analysis
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           In the following example, the appearances of B
                           violate the first two rules.

                           Example:

                                      DO 3265 I = 1,N        (Not optimized)
                                         B(IB(I)) = B(IB(I)) + A(I)
                                 3265 CONTINUE


                           You can override these rules by using the NODEPCHK
                           directive (and in some cases, the PERMUTATION
                           directive), if you know that the indexing is
                           nonrecursive.  These directives are discussed in
                           "NODEPCHK/DEPCHK," page 68, and "PERMUTATION,"
                           page 69.











































     SG-3074 5.0               Cray Research, Inc.                        121



                                            FPP Loop Analysis and Tuning  [6]
     ########################################################################







                           This section discusses the loop analysis FPP
                           performs, including loop types, allowable
                           statement types, conditional operations, loop
                           variable use, selection criteria for vectorization
                           and Autotasking, optimization and vectorization
                           inhibitors, and loop optimizations and
                           transformations that FPP performs independent of
                           vectorization or parallelization.  FPP loop
                           analysis is summarized in Figure 25, page 124, and
                           discussed in the first few subsections of this
                           section.

                           In addition, the following loop tuning parameters
                           are discussed:

                           * SELECT directive use

                           * INNER directive use

                           * Threshold test and threshold value calculation
                             and use

                           * CNCALL directive use

                           * Conditional Autotasking

                           * Private array use

                           * Expansion or combination of parallel regions

                           * Parallel cases

                           * Partial reduction use

                           * SPLIT directive use

                           * VFUNCTION directive use

                           * Cray scientific library call use

                           For more information about vectorization that can
                           be achieved with the compiling system, see the
                           CF77 Compiling System, Volume 3:  Vectorization
                           Guide, publication SG-3073.







     SG-3074 5.0               Cray Research, Inc.                        123


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                                FPP Loop Analysis


       Within a nest of loops, FPP examines all IF and DO loops for
       possible optimization.  In addition, FPP analyzes entire nests of
       DO loops (including converted IF loops) for possible parallelism.
       Unless you direct otherwise, FPP analyzes a loop nest in these
       steps:


       1. Loops are examined from innermost loops outward, until FPP finds
           a loop or loops appropriate for vectorization.


       2. If a nontranslatable construct is encountered (for example, a
           READ statement), analysis of any loops further out in the nest
           is disabled.


       3. Outward examination of loops continues until FPP finds a loop or
           loops appropriate for Autotasking.


       4. If an outer Autotasking loop is not found and the INNER
           directive or -e i switch is in force, FPP examines the
           innermost loop or loops for Autotasking.


           You can modify this procedure by using directives (for example,
           SELECT, NOCONCUR, or NOVECTOR) and/or setting switches.
     ________________________________________________________________________

                     Figure 25.  Summary of FPP loop analysis




     Loop analysis
     6.1
                           Within a nest of loops, FPP examines all IF and DO
                           loops for possible optimization.  In addition, FPP
                           analyzes entire nests of DO loops (including
                           converted IF loops) for possible parallelism.  For
                           optimization, loops do not need to be tightly
                           nested, and outer loops that can be optimized can
                           have multiple inner loops.  Data dependency
                           messages include the label and the index of the
                           loop to which they refer, which helps determine
                           the exact location of the dependency.







     124                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Unless you direct otherwise, FPP analyzes a loop
                           nest in these steps:

                           1. Loops are examined from innermost loops
                               outward, until FPP finds a loop or loops
                               appropriate for vectorization.

                           2. If a nontranslatable construct is encountered
                               (for example, a READ statement), analysis of
                               any loops further out in the nest is disabled.

                           3. Outward examination of loops continues until
                               FPP finds a loop or loops appropriate for
                               Autotasking.

                           4. If an outer Autotasking loop is not found and
                               the INNER directive or -ei switch is in force,
                               FPP examines the innermost loop or loops
                               (which may or may not be vectorized) for
                               Autotasking.

                           You can modify this procedure by using directives
                           (for example, SELECT, NOCONCUR, or NOVECTOR)
                           and/or setting switches.



     Loop types
     6.1.1
                           Both IF and DO loops are considered for
                           optimization.


     IF loops
     6.1.1.1
                           Innermost IF loops are converted into DO loops
                           under certain conditions.  An innermost IF loop is
                           one that contains no other IF or DO loops.  To be
                           convertible, the loop must meet the following
                           criteria:

                           1. The loop must have one entrance and one exit.

                           2. The iteration count for the loop must be
                               determinable at execution time before the loop
                               is entered.













     SG-3074 5.0               Cray Research, Inc.                        125


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                      SUBROUTINE IFLOOP (N, JB, A, B, S)
                                      REAL A(N), B(N)
                                C
                                      J = JB
                                  10  CONTINUE
                                      A(J) = 0.0
                                      J = J + 1
                                      IF ( J .GT. N ) GO TO 20
                                      B(J) = S
                                      GO TO 10
                                  20  CONTINUE


                           Translation:

                                      ...
                                      INTEGER J1X
                                C
                                      J = JB
                                   10 CONTINUE
                                      A(J) = 0.0
                                      J = J + 1
                                      J1X = J
                                      IF (N - J1X + 1 .GT. 0) THEN
                                CDIR@    IVDEP
                                         DO J = 1, N - J1X + 1
                                            B(J1X+J-1) = S
                                            A(J1X+J-1) = 0.0
                                         END DO
                                      ENDIF
                                  20  CONTINUE


                           You can disable conversion of IF loops to DO loops
                           by using the -d l option of fpp.  All directives
                           (such as NODEPCHK) affecting IF loops must have
                           routine or global scope.


     DO loops
     6.1.1.2
                           FPP analyzes entire nests of DO loops (including
                           converted IF loops) for possible parallelism.  DO
                           WHILE and DO loops without labels are also
                           accepted.

                           Example:

                                      DO WHILE ( I .GT. 0 )
                                         I = I - 1
                                         A(I) = 0
                                      END DO





     126                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Translation:

                                      J1S = I
                                      IF ( J1S .GT. 0 ) THEN
                                CDIR@ IVDEP
                                         DO I = 1, J1S
                                            A(J1S-I) = 0
                                         END DO
                                         I = 0
                                      ENDIF




     Allowable statement
     types
     6.1.2
                           Statements allowed in a loop that can be optimized
                           are as follows:

                           * Assignment

                                A(I) = B(I) + C(I)


                           * Conditional assignment

                                IF ( D(I).LT.0 ) A(I) = B(I) + C(I)


                           * GO TO label (label must be forward)

                           * IF ( ) GO TO label (label must be forward)

                           * IF ( ) THEN

                           * ELSEIF ( ) THEN

                           * ELSE

                           * ENDIF

                           * Arithmetic IF (all labels must be forward)

                                IF (IJK) 10,20,30


                           * Computed GOTO (with four or fewer labels)

                           * CALL to user subroutine (handled only on user
                             direction or function reference)

                           * Comment

                           * CONTINUE



     SG-3074 5.0               Cray Research, Inc.                        127


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * FORMAT

                           * DATA

                           The appearance of any other statement type in a
                           loop causes that loop to be rejected for
                           optimization.  A diagnostic message points out the
                           offending statement in the listing.

                           Example:

           DO 1 I = 1,N       (Not optimized)
           A(I) = B(I)*C(I)
           WRITE (6) A(I)     (This statement is not handled)
      1    CONTINUE


                           Statement labels may appear wherever legal in Cray
                           Fortran.

                           Loops containing character variables or constants
                           cannot be optimized.



     Conditional
     operations
     6.1.3
                           FPP can analyze any combination of conditional
                           assignments, conditional and unconditional forward
                           branching (including arithmetic IFs and computed
                           GOTOs with four or less labels), and block IFs.
                           Because of compilation-speed restrictions, there
                           is a limit of six simultaneously active
                           conditions.


     Testing on the loop
     index
     6.1.3.1
                           The following loop shows a loop-index-dependent
                           conditional assignment.  FPP eliminates such
                           conditions by adjusting the limits of the vector
                           operation.

                           Example:

                                      DO 1013 I = 1,100
                                         IF (I.NE.J) A(I) = B(I) + C(I)
                                 1013 CONTINUE








     128                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Translation:

                                      IF (J-1.GE.0 .AND. J-1.LE.99) THEN
                                CDIR@ IVDEP
                                         DO 1013 I = 1, J-1
                                            A(I) = B(I) + C(I)
                                 1013    CONTINUE
                                CDIR@ IVDEP
                                         DO I = J+1, 100
                                            A(I) = B(I) + C(I)
                                         END DO
                                      ELSE
                                CDIR@ IVDEP
                                         DO I = 1, 100
                                            A(I) = B(I) + C(I)
                                         END DO
                                      ENDIF


                           The IF (J-1.GE.0...) test determines whether J is
                           in the range of the DO loop index.  If so, the
                           operation is performed up to the J-1 element and
                           then from the J+1 element to N.  If J is not in
                           the range of the DO loop index, the operation is
                           performed on all elements.

                           A loop-index-dependent condition is of the general
                           form:

                           -------------------------------------------------
                           IF ( index expression.rel.invariant expression )
                           -------------------------------------------------

                           In the previous IF statement, variables are as
                           follows:

                           Variable        Description
                           --------        -----------

                           rel             A relational operator; possible
                                           values are EQ, NE, GT, GE, LT, or
                                           LE.

                           invariant expression
                                           Represents an arithmetic
                                           expression, constant, or variable.

                           index expression
                                           Defined in "Index expressions,"
                                           page 119.

                           Example:

                                      DO 100 I = 1, N
                                         A(I) = A(I) + 1.
                                         IF (I.EQ.K) B(I) = A(I)/C(I)
                                         D(I) = B(I) + A(I)
                                 100  CONTINUE



     SG-3074 5.0               Cray Research, Inc.                        129


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Translation:

                           CDIR@ IVDEP
                                 DO 100 I = 1, N
                                    A(I) = A(I) + 1.
                            100  CONTINUE
                                 IF (K.GE.1 .AND. K.LE.N) B(K) = A(K)/C(K)
                           CDIR@ IVDEP
                                 DO I = 1, N
                                    D(I) = B(I) + A(I)
                                 END DO




     Loop variable types
     6.1.4
                           Understanding loop optimization is aided by
                           understanding the uses of the variables in a loop
                           that make it a candidate for optimization.  Each
                           variable in a loop that can be optimized can be
                           categorized as one of three things:  scalar,
                           index, or vector.  A scalar variable is a single
                           value that does not change through the entire
                           execution of the loop.  An index variable is an
                           integer quantity that is incremented by a constant
                           amount each pass through the loop.  A vector array
                           is a range of memory locations, with a constant
                           skip or stride between consecutive elements.

                           The following loop provides an example of each
                           type of variable:

                                      DO 10 I = 1,N
                                      J = J+1
                                      A(J) = X
                                  10  CONTINUE


                           Classification of variables is as follows:

                           Variable   Category
                           --------   --------

                           I, J       Index variables

                           A(J)       Vector array

                           X          Scalar variable









     130                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




     Selection criteria
     for vectorization
     6.1.5
                           To pick the best loop in a nest of DO loops for
                           vectorization, FPP uses the criteria summarized in
                           Figure 26, page 132, and listed in the following:

                           * Loop iteration count

                           * Stride size of array references

                           * Percentage of data dependent code

                           * Percentage of conditionally executed code

                           Generally, longer vectors, smaller strides, less
                           dependence, and less conditionality are favored.
                           You can use an option switch to increase the
                           relative weight of these factors, particularly of
                           small stride, compared to the FPP bias toward
                           picking the original innermost loop for
                           vectorization.  The -et option of fpp increases
                           the relative weight of these factors.  Invoking
                           this switch can result in loops being interchanged
                           more often.































     SG-3074 5.0               Cray Research, Inc.                        131


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                  FPP Loop Selection Criteria for Vectorization




       * Loop iteration count


       * Stride size of array references


       * Percentage of data dependent code


       * Percentage of conditionally executed code



         Generally, longer vectors, smaller strides, less dependence, and
         less conditionality are favored.



         You can increase the relative weight of these factors by using:


                              -Wd"-et" option of cf77

                                        or

                                 -et option of fpp
     ________________________________________________________________________

       Figure 26.  Summary of FPP loop selection criteria for vectorization

                           For more information about vectorization that can
                           be achieved with the compiling system, see the
                           CF77 Compiling System, Volume 3:  Vectorization
                           Guide, publication SG-3073.

                           For example, in the following loop nest, the inner
                           loop is recursive; therefore, the outer loop is
                           chosen by FPP and switched to the inside.

                           Example:

                                 SUBROUTINE AGGRESS (N,M,L,A,B,C)
                                 REAL A(N,M,L), B(N,M,L), C(N,M,L)
                           C
                                 DO 800 K = 1, L-1
                                    DO 790 I = 1, N-1
                                       DO 780 J = 1, M-1
                                          A(I,J,K) = B(I,J,K) + C(I,J,K)
                            780        CONTINUE
                            790     CONTINUE
                            800  CONTINUE


     132                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Default translation:

                           CMIC@ DO ALL SHARED(L,N,M,B,C,A)
                           CMIC@1       PRIVATE(K,I,J)
                                 DO 800 K = 1, L-1
                                    DO 790 I = 1, N-1
                           CDIR@ IVDEP
                                       DO 780 J = 1, M-1
                                          A(I,J,K) = B(I,J,K) + C(I,J,K)
                            780        CONTINUE
                            790     CONTINUE
                            800  CONTINUE


                           The following translation is done by FPP if the
                           -et fpp option is specified:

                                      ...
                                CMIC@ DO ALL SHARED(L,N,M,B,C,A)
                                CMIC@1       PRIVATE(K,I,J)
                                      DO 800 K = 1, L - 1
                                         DO 790 J = 1, M - 1
                                CDIR@ IVDEP
                                          DO I = 1, N - 1
                                             A(I,J,K) = B(I,J,K) + C(I,J,K)
                                          END DO
                                  790    CONTINUE
                                  800 CONTINUE




     Selection criteria
     for Autotasking
     6.1.6
                           To choose the best loop in a nest for Autotasking,
                           FPP uses the following criteria:

                           * Loop iteration count

                           * Presence of data dependence

                           * Amount of work within the loop

                           Generally, more work and absence of dependence are
                           favored.

                           The selection criteria for Autotasking and
                           possible inhibitors to outer-loop optimization are
                           summarized in Figure 27, page 134; outer-loop
                           optimization inhibitors are also discussed in the
                           following subsections.







     SG-3074 5.0               Cray Research, Inc.                        133


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                   FPP Loop Selection Criteria for Autotasking


       * Loop iteration count

       * Presence of data dependence

       * Amount of work within the loop

         Generally, more work and absence of dependence are favored.


                        Outer Loop Optimization Inhibitors

         There are several constructs that can prevent optimization, and
         therefore, Autotasking, of an outer loop.  Sometimes recursion
         along the inner loop also prevents optimization of the outer
         loop.

       * Outer loops that do not contain a vector array reference
         (typically loops inserted for timing purposes) are not optimized.

       * An outer loop cannot be vectorized, but it can be autotasked, if
         it sets the iteration count of any inner loop.

       * An outer loop cannot be vectorized, but it can be autotasked, if
         it conditionally executes any inner loop.
     ________________________________________________________________________

     Figure 27.  Summary of FPP criteria for Autotasking and possible
                 inhibitors



     Outer loop
     optimization
     inhibitors
     6.1.7
                           Several constructs can prevent optimization, and
                           therefore, Autotasking, of an outer loop.

                           Sometimes recursion along the inner loop also
                           prevents optimization of the outer loop.  For
                           example, the DO 400 loop in the following example
                           has feedback on array A along the inner loop, and
                           it also cannot safely be optimized along the DO
                           500 loop.









     134                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                      DO 500 J = 1,M  (Not optimized)
                                         DO 400 I = 1,N  (Not optimized)
                                 400     A(I+1) = A(I) * B(I,J)
                                 500  CONTINUE


                           An outer loop must contain a vector array
                           reference; loops that do no indexing are not
                           optimized.  This sometimes occurs, for example,
                           when an outer loop has been added for timing
                           purposes.  In the following example, use is not
                           made of index I or any other index defined for
                           outer loop DO 900.

                           Example:

                                      DO 900 I = 1,N  (Not optimized)
                                         DO 800 J = 1,M    (Optimized)
                                 800     A(J) = B(J) + C(J)
                                 900  CONTINUE




     Outer loop
     vectorization
     inhibitors
     6.1.8
                           Some outer loop constructs inhibit vectorization,
                           but they do not inhibit Autotasking.

                           * An outer loop cannot be vectorized, but it can
                             be autotasked, if it sets the iteration count of
                             any inner loop.

                             When the iteration count of an inner loop is not
                             constant for all passes of the outer loop, the
                             outer loop cannot be vectorized (the inner loops
                             are still vectorized if possible).  In the
                             following example, the iteration count for both
                             the 100 loop and the 150 loop change with each
                             pass of the outer loop; thus, the 200 loop
                             cannot be vectorized.

                             Example:

                                      DO 200 J = 1,N   (Not vectorized)
                                         DO 100 I = 1,J    (Vectorized)
                                 100     A(J,I) = A(J,I)*B(J,I)
                                         DO 150 I = 1,IN(J)  (Vectorized)
                                 150     C(J,I) = 0.
                                 200  CONTINUE





     SG-3074 5.0               Cray Research, Inc.                        135


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * An outer loop cannot be vectorized, but it can
                             be autotasked if it conditionally executes any
                             inner loop.

                             When an inner loop is executed conditionally,
                             the outer loop cannot be vectorized.  In the
                             following example, the DO 700 loop is not
                             vectorized because the DO 600 loop inside is
                             executed conditionally.

                             Example:

                                      DO 700 I = 1,N   (Not vectorized)
                                         IF (A(I).GT.0) THEN
                                             DO 600 J = 1,N   (Vectorized)
                                 600         B(I,J) = A(I)*J
                                         ENDIF
                                 700  CONTINUE





     Loop optimizations
     6.2
                           FPP performs loop optimizations as summarized in
                           Figure 28, page 137, and described in the
                           following subsections.






























     136                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                        FPP Loop Optimization Techniques



       * Loop collapse


       * Loop fusion


       * Loop rerolling


       * Loop unrolling


       * Translation of array notation


       * Extended parallel regions


       * Parallel cases


       * Reductions
     ________________________________________________________________________

             Figure 28.  Summary of FPP loop optimization techniques



     Loop collapse
     6.2.1
                           Loop nests that traverse all inner dimensions of
                           the arrays in the loop can be automatically
                           "collapsed" into single loops with larger
                           iteration counts.

                           Collapse criteria are as follows:

                           * The loops must be tightly nested, with one loop
                             index per array dimension.

                           * The loop bounds must be identical to the array
                             bounds.

                           * All the "vector" array references in the loops
                             must conform; that is, have the same
                             subscripting.







     SG-3074 5.0               Cray Research, Inc.                        137


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                      SUBROUTINE DWEEB ( L,M,N,A,B )
                                      DIMENSION A(L,M,N), B(L,M,N)
                                      DO 100 K = 2, N-1
                                         DO 100 J = 1, M
                                            DO 100 I = 1, L
                                               A(I,J,K) = B(I,J,K)
                                 100        CONTINUE


                           Translation:

                                CMIC@ DO ALL VECTOR SHARED(N,M,L,B,A)
                                CMIC@1              PRIVATE(K,J,I)
                                CDIR@ IVDEP
                                      DO K = 1, L*M*(N-2)
                                         A(K,1,2) = B(K,1,2)
                                      END DO




     Loop fusion
     6.2.2
                           FPP combines consecutive loops that have no
                           statements between them and that give the same
                           answers when merged.  In addition, the following
                           restrictions on loop fusion exist:

                           * Each loop must contain less than 20 noncomment
                             source lines

                           * No inhibitors can exist in any fused loops

                           * A maximum of five loops can be fused together

                           * The resulting, fused loop cannot exceed 50 lines

                           Loop fusion aids other loop optimizations and
                           reduces loop overhead.

                           Example:

                                      DO 311 I = 1,100
                                         A(I) = B(I) + SQRT(C(I))
                                  311 CONTINUE
                                      DO 312 I = 1,100
                                         E(I) = B(I) + SQRT(C(I))
                                  312 CONTINUE









     138                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Translation:

                                      DO 311 I = 1,100
                                         A(I) = B(I) + SQRT(C(I))
                                         E(I) = B(I) + SQRT(C(I))
                                  312 CONTINUE




     Loop rerolling
     6.2.3
                           Loop rerolling is rerolling the iterations of an
                           inner loop that had been separate statements back
                           into one.

                           Example:

                                      SUBROUTINE REROLL
                                      COMMON N,A(999)
                                      DO 10 I = 1, 301, 3
                                         A(I) = 0
                                         A(I+1) = 0
                                         A(I+2) = 0
                                  10  CONTINUE
                                      END


                           Translation:

                                CDIR@ IVDEP
                                      DO 10 I = 1, 303
                                         A(I) = 0
                                   10 CONTINUE
                                      END


                           Loop rerolling is the exact inverse of unrolling,
                           which is explained in the following subsection.



     Loop unrolling
     6.2.4
                           Loop unrolling is unrolling the iterations of an
                           inner loop into separate statements.  Loop
                           unrolling benefits nonvector loops whose scalar
                           optimization is inhibited, because they have too
                           few operations per pass.  Loop unrolling reduces
                           the percentage of time spent in loop overhead, and
                           it provides more instructions for the optimizer to
                           overlap in each pass of the loop; it can also aid
                           other loop optimizations.




     SG-3074 5.0               Cray Research, Inc.                        139


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           FPP has two modes of loop unrolling:  automatic
                           and explicit.  The UNROLL directive controls both
                           of these modes.  When used with routine or file
                           scope (R or F), UNROLL/NOUNROLL enables or
                           disables automatic unrolling.  When used with
                           local scope (L or blank), UNROLL directs FPP to
                           explicitly unroll the following loop.

                           When automatic loop unrolling is enabled, FPP
                           unrolls inner loops that satisfy these criteria:

                           * The vector length is constant, and below the
                             vector threshold.

                           * The vector length times the number of statements
                             in the loop is less than 32.

                           * The loop contains only assignment statements.
                             No branches, I/O statements, or external
                             references are allowed.

                           * The DO loop control parameters are integer.

                           * The last value of the loop index is not required
                             after the loop is executed.

                           These restrictions do not apply to loops unrolled
                           explicitly.  The only inhibitors in this case are
                           assigned GOTOs and I/O keywords other than END=,
                           ERR=, FMT=, and UNIT=.  In explicit mode, outer
                           loops can also be unrolled.

                           The following is an example of automatic unrolling
                           of a loop with a small fixed iteration count; the
                           loop is completely unrolled into three assignment
                           statements.

                           Example:

                                CFPP$ UNROLL R
                                      ...
                                      DO 311 I = 1, 3
                                         D(I) = A(I) + B(I)*C(I)
                                  311 CONTINUE

                           Translation:

                                      D(1) = A(1) + B(1)*C(1)
                                      D(2) = A(2) + B(2)*C(2)
                                      D(3) = A(3) + B(3)*C(3)


                           For more information about the UNROLL directive,
                           see "Transformation directives," page 63.






     140                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Translation of array
     notation
     6.2.5
                           FPP translates array section syntax, a CFT77
                           extension, into DO loops, which can then be
                           vectorized and/or autotasked.

                           Example:

           A(IB:IE,JB:JE) = B(IB:IE,JB:JE) + C(IB:IE,JB:JE)
           D(IB:IE,JB:JE) = SQRT(B(IX(IB:IE),JB:JE))


                           Translation:

                           CMIC@ DO ALL SHARED(JE,JB,IE,IB,B,C,A,IX,D)
                           CMIC@1  PRIVATE(J1X,J2X)
                                 DO J2X = 1, JE - JB + 1
                           CDIR@ IVDEP
                                    DO J1X = 1, IE - IB + 1
                                       A(J1X-1+IB,J2X-1+JB) = B(J1X-1+IB,
                                1         J2X-1+JB) + C(J1X-1+IB,J2X-1+JB)
                                       D(J1X-1+IB,J2X-1+JB) = SQRT(B(IX
                                2         (J1X-1+IB),J2X-1+JB))
                                    END DO
                                 END DO


                           You can disable the conversion of array syntax by
                           using the -d 1 option of fpp.



     Extended parallel
     regions
     6.2.6
                           FPP tries to combine or expand regions of
                           parallelism to reduce Autotasking run-time system
                           overhead.  This is done in two ways.  If an
                           autotasked loop is inside an outer nonautotasked
                           loop, FPP tries to expand the parallel region
                           outside the outer loop.  (This means the outer
                           loop is executed redundantly by the processors,
                           not in parallel.)  FPP also tries to combine
                           adjacent or near-adjacent parallel loops into one
                           parallel region.

                           In either case, a maximum of 20 assignment
                           statements is allowed between the loops (between
                           the beginning of the outer and the beginning of
                           the inner or between the end of one loop and the
                           beginning of a subsequent loop).  Statements with
                           many operations are weighted more heavily,
                           lowering the maximum.




     SG-3074 5.0               Cray Research, Inc.                        141


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The variables in the intervening assignment
                           statements are subject to these rules:

                           * If the variable is modified in the parallel
                             region, it cannot be used previously in the
                             region.

                           * If the variable is used in the parallel region,
                             it cannot be modified subsequently in the
                             region.

                           Example:

                                      DO 100 J = 1,N
                                        DO 100 I = 1,N
                                          A(I,J) = B(I,J) + C(I,J)
                                  100 CONTINUE
                                      NM1 = N - 1
                                      DO 200 J = 1,NM1
                                        DO 200 I = 1,NM1
                                          A(I,J) = A(I,J) + D(I,J)
                                  200 CONTINUE


                           Translation:

                                CMIC@ PARALLEL IF (N.GT.28 .OR. NM1.GT.28)
                                CMIC@1   SHARED(A,D,N,B,C) PRIVATE(J,I,NM1)
                                CMIC@ DO PARALLEL
                                      DO 100 J = 1,N
                                CDIR@ IVDEP
                                        DO 100 I = 1,N
                                          A(I,J) = B(I,J) + C(I,J)
                                100   CONTINUE
                                      NM1 = N - 1
                                CMIC@ DO PARALLEL
                                      DO 200 J = 1,NM1
                                CDIR@ IVDEP
                                        DO 200 I = 1,NM1
                                          A(I,J) = A(I,J) + D(I,J)
                                200   CONTINUE
                                CMIC@ END PARALLEL


                           In this example, a single parallel region
                           encompasses both the DO 100 and DO 200 loop nests.
                           Multitasking occurs throughout the code bracketed
                           by the CMIC@ PARALLEL and CMIC@ END PARALLEL
                           directives.  Partitioning of the work occurs at
                           the CMIC@ DO PARALLEL directives.  Synchronization
                           of tasks occurs after each parallel loop.  All
                           processors execute the intervening assignment
                           statement (NM1=N-1) redundantly, but the slight
                           extra cost of the redundant work is outweighed by
                           eliminating a separate parallel startup for the DO
                           200 nest.




     142                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Parallel cases
     6.2.7
                           When parallelism cannot be found within a loop
                           nest, FPP tries to find loops or loop nests that
                           are completely independent of each other, and
                           execute them as parallel cases.

                           Example:

                                     DO 202 J = 1, M
                                        DO 201 I = 1, N
                                           A(I,J) = (A(I-1,J)+A(I,J-1))*0.5
                                201     CONTINUE
                                202  CONTINUE
                                     NK = N - K
                                     ML = M + L
                                     DO 302 J = 1, ML
                                        DO 301 I = 1, NK
                                           C(I,J) = (C(I-1,J)+C(I,J-1))*0.5
                                301     CONTINUE
                                302  CONTINUE


                           Translation:

                                CMIC@ PARALLEL SHARED(A,NK,ML,L,C,M,N,MJ)
                                CMIC@1PRIVATE(J, I) MAXCPUS(2)
                                CMIC@ CASE
                                      DO 202 J = 1, M
                                CDIR@ NEXTSCALAR
                                         DO 201 I = 1, N
                                            A(I,J) = (A(I-1,J)+A(I,J-1))*0.5
                                 201     CONTINUE
                                 202  CONTINUE
                                CMIC@ CASE
                                      NK = N - K
                                      ML = M + L
                                      DO 302 J = 1, ML
                                CDIR@ NEXTSCALAR
                                         DO 301 I = 1, NK
                                            C(I,J) = (C(I-1,J)+C(I,J-1))*0.5
                                 301     CONTINUE
                                 302  CONTINUE
                                CMIC@ END CASE
                                CMIC@ END PARALLEL


                           One of the restrictions for parallel cases is that
                           variables written in one case (loop nest) may not
                           be read or written in another case, except for
                           private variables that are always defined before
                           they are used within a case, such as loop index
                           variables (I and J in the previous example).

                           If inner loop Autotasking is enabled, FPP also
                           tries to generate parallel cases for single loops
                           that contain mutually exclusive dependencies.



     SG-3074 5.0               Cray Research, Inc.                        143


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                CFPP$ INNER
                                      DO 1989 K = 1, NBIG
                                         A(IX(K)) = A(IX(K)) + P(K)
                                         C(IX(K)) = C(IX(K)) + P(K)
                                 1989 CONTINUE


                           Translation:

                                CMIC@ PARALLEL SHARED(A, P, C, IX, NBIG)
                                CMIC@1PRIVATE(K) MAXCPUS(2)
                                CMIC@ CASE
                                      DO 1989 K = 1, NBIG
                                         A(IX(K)) = A(IX(K)) + P(K)
                                 1989 CONTINUE
                                CMIC@ CASE
                                      DO K = 1, NBIG
                                         C(IX(I)) = C(IX(K)) + P(K)
                                      END DO
                                CMIC@ END CASE
                                CMIC@ END PARALLEL


                           You can use the -d h option of fpp or the -Wd"-dh"
                           option of cf77 to disable this transformation.



     Reductions
     6.2.8
                           FPP does not autotask loops containing
                           dependencies between loop passes (data
                           dependencies), except for the following reduction
                           operations:  summation of elements, product of
                           elements, minimum or maximum element, index of
                           minimum or maximum element, and dot product.
                           These are autotasked by giving each task a partial
                           reduction to perform, and by combining the partial
                           results as each task finishes.

                           If a reduction function scalar appears anywhere
                           else in the loop, the loop is not autotasked.

                           Example:

                                      DO 200 J = 1, M
                                         DO 200 I = 1, N
                                            S = S + A(I,J)
                                 200  CONTINUE








     144                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Translation:

                                CMIC@ PARALLEL SHARED(M, N, A, S)
                                CMIC@1PRIVATE(J, I, R1S)
                                      R1S = 0
                                CMIC@ DO PARALLEL
                                      DO J = 1, M
                                CDIR@ IVDEP
                                        DO I = 1, N
                                           R1S = R1S + A(I,J)
                                        END DO
                                      END DO
                                CMIC@ GUARD
                                      S = S + R1S
                                CMIC@ END GUARD
                                CMIC@ END DO
                                CMIC@ END PARALLEL





     Loop tuning
     parameters
     6.3
                           Loop tuning parameters let you influence the loop
                           analysis and optimization performed by FPP.  The
                           previous sections discussed loop analysis and
                           optimizations performed by FPP.  This section
                           discusses the ways you can influence the choices
                           FPP makes.

                           Figure 29, page 146, summarizes these loop tuning
                           parameters; the following subsections discuss them
                           in more detail.























     SG-3074 5.0               Cray Research, Inc.                        145


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                           FPP Loop Tuning Parameters



       * SELECT directives control the optimization mode of a specific
         loop.


       * NOSYNC directives tell FPP that no data dependencies exist.


       * INNER directives force inner loops to be analyzed for Autotaking
         and vectorization.


       * FPP options or switches change or delete the FPP generated
         threshold test for conditional Autotasking.


       * CNCALL directives instruct FPP to autotask loops containing
         subroutine calls.


       * PRIVATEARRAY directives tell FPP to autotask loops that contain
         private arrays.


       * SPLIT directives allow splitting out of function and subroutine
         calls so that remaining parts of loops vectorize.


       * VFUNCTION directives tell FPP to treat functions as vector
         intrinsic functions.


       * FPP can replace calls to recognized coding sequences with
         optimized scientific library functions.
     ________________________________________________________________________

                Figure 29.  Summary of FPP loop tuning parameters



     SELECT directive use
     6.3.1
                           As discussed previously, when FPP chooses a single
                           loop from a nest, it weighs the Autotasking
                           selection criteria by using a heuristic algorithm.
                           Because all pertinent information is not available
                           at compile time, FPP may not always be able to
                           make the best choice.  You can dictate the
                           optimization mode of a specific loop by using the
                           SELECT directive.  Place the SELECT directive
                           directly before the DO statement of the loop to be
                           optimized.  An optional argument indicates the


     146                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           mode of optimization, either VECTOR or CONCUR.
                           The default is VECTOR.

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           The VECTOR/NOVECTOR and CONCUR/NOCONCUR directives
                           are not used to control loop selection within a
                           nest of loops.  For example, if a NOVECTOR
                           directive is placed on a loop, FPP does not
                           examine that loop or any outer loops around it for
                           vectorization.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


                           In the following loop, FPP by default would choose
                           the innermost loop for vectorization, because it
                           has no information about the relative values of
                           the iteration counts N and M, and the array
                           strides on the I loop are 1.  However, if N is
                           actually 3 and M is 1000, it would be better to
                           vectorize the middle loop.

                           Example:

                                      SUBROUTINE SELECT (A, B, L, M, N)
                                      REAL A(N,M,L), B(N,M,L)
                                C
                                      DO 200 K = 1, L-1
                                CFPP$ SELECT (VECTOR)
                                          (Select J loop for vectorization)
                                      DO 200 J = 1, M-1
                                      DO 200 I = 1, N-1
                                         A(I,J,K) = B(I,J,K) * .5
                                 200  CONTINUE


                           By default, the K loop would be autotasked.

                           Any loop that can be optimized in a nest can be
                           selected for either vectorization or Autotasking.
                           However, a loop cannot be selected for Autotasking
                           if a loop outside it is selected for either
                           vectorization or Autotasking, or if a loop inside
                           it is selected for Autotasking.  A loop cannot be
                           selected for vectorization if a loop outside it is
                           selected for vectorization, or if a loop inside it
                           is selected for either vectorization or
                           Autotasking.  If you use SELECT directives in a
                           conflicting way within a loop nest, FPP ignores
                           them.



     NOSYNC directive use
     6.3.2
                           If FPP detects potential data dependence conflicts


     SG-3074 5.0               Cray Research, Inc.                        147


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           in a loop it is examining for Autotasking, it may
                           generate one of the following messages:

                           Must synchronize to preserve order of accesses
                           Possible feedback of elements in array 


                           In either case, if you know the operations of each
                           loop iteration are entirely independent of the
                           other loop iterations, you can use the NOSYNC
                           directive to tell FPP that there is no actual
                           dependence.  This directive is a more powerful
                           version of the NODEPCHK directive (or IVDEP
                           compiler directive); it asserts not only that
                           there is no recursion, but that none of the arrays
                           in the loop overlap.  See the explanations for
                           messages in "FMP and FPP Messages," page xx.x 0.



     INNER directive use
     6.3.3
                           Generally, FPP chooses outer loops for Autotasking
                           and inner loops for vectorization.  Autotasking an
                           inner loop when outer loops are available usually
                           does not yield the best performance; that is, it
                           is generally best to autotask larger granularity
                           loops.  However, if a loop is not nested inside of
                           a loop available for Autotasking (that is, it does
                           not have an outer loop), and the loop has a large
                           iteration count, you may want that loop to be
                           considered for both Autotasking and vectorization.

                           You can use the CFPP$ INNER directive to force
                           inner loops to be analyzed for Autotaking and
                           vectorization.  (You can also specify inner-loop
                           Autotasking by using the -ei option of fpp.)  By
                           default, inner-loop Autotasking is disabled.

                           Example:

                                CFPP$ INNER
                                      DO 100 I = 1, N
                                         A(I) = SQRT(B(I)**2+C(I)**2)
                                 100  CONTINUE


                           Translation:

                                CMIC@ DO ALL VECTOR IF ( N .GT. 53 )
                                CMIC@1SHARED(N, A, B, C) PRIVATE(I)
                                CDIR@ IVDEP
                                      DO 100 I = 1, N
                                         A(I) = SQRT(B(I)**2+C(I)**2)
                                 100  CONTINUE



     148                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           When inner-loop Autotasking is enabled, the
                           iterations of the loop are stripmined by FMP.
                           (This is a different use of Stripmining than that
                           done explicitly by FPP; see "Stripmining," page
                           194, for more information about FPP stripmining.)
                           Stripmining makes the effective vector length less
                           than the original vector length.  If the effective
                           vector length is less than or equal to 128 for any
                           processor, any vector loop unrolling generated by
                           the compiler will not be executed.  If the
                           original iteration count is small, or the work
                           inside the loop is small, inner-loop Autotasking
                           can degrade performance.



     Threshold tests
     6.3.4
                           It is not always possible to determine at
                           compile-time whether a loop should be run in
                           parallel.  When the number of iterations in a loop
                           is unknown, or when a dependency may exist in the
                           loop, FPP generates a run-time test to determine
                           whether the loop is run in parallel or run with a
                           single processor.  The test appears as an IF
                           clause to the CMIC@ DO ALL or CMIC@ PARALLEL
                           directive.

                           This subsection describes how FPP determines the
                           threshold needed to determine whether sufficient
                           work is present.  The test generated to resolve
                           potential dependencies is discussed in "FPP Data
                           Dependency Analysis," page 101.

                           The threshold calculation is based on the cost to
                           invoke and terminate a parallel region, the number
                           of iterations of the loop, and the amount of work
                           in the loop.  The amount of work in a loop is
                           computed by using hardware information specific to
                           the various Cray Research architectures.  All CRI
                           hardware contains multiple vector pipelines,
                           allowing separate functional units to operate
                           concurrently.  On CX/CEA systems, vector
                           operations can also be chained.  A general form of
                           threshold calculation is as follows:

                                N .GT. (OVERHEAD/CHIMES)


                           In this calculation, N is the iteration count of
                           the parallel loop, or, if there are inner loops,
                           the product of the iteration counts of the
                           parallel loop and the inner loops.  OVERHEAD is
                           the approximate number of clock periods needed to
                           invoke and terminate a parallel region, and CHIMES
                           is the estimated number of separate vector
                           pipelines in the loop.


     SG-3074 5.0               Cray Research, Inc.                        149


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Threshold calculation
     on CX/CEA systems
     6.3.4.1
                           The CX/CEA hardware is designed to chain vector
                           operations.  A vector pipeline can deliver
                           multiple results each clock period.  However, the
                           number of operations that can be chained within a
                           pipeline is limited by the functional hardware
                           units.  A single vector pipeline could contain
                           only one addition or subtraction, one
                           multiplication, two vector loads, and one vector
                           store.

                           A loop that consists of two loads, a
                           multiplication, an addition, and a memory store
                           take almost the same amount of time as a loop that
                           contains only a store.  Therefore, the amount of
                           time in a loop can most accurately be determined
                           by counting the number of chimes, or separate
                           vector pipelines, rather than by counting the
                           number of instructions that occur.

                           On CX/CEA systems, the approximate number of clock
                           periods to invoke and terminate a parallel region
                           is 800.  Therefore, for CX/CEA systems, the
                           threshold calculation is as follows:

                                N .GT. (800/CHIMES)


                           In this calculation, CHIMES can be determined
                           using the following calculation:

     CHIMES = MAX(no. of + or - operations,
                  no. of * plus 3 times no. of / operations,
                  no. of vector stores,
                  (no. of vector loads + 1) / 2)


                           Example:

                                      DO 1 K = 1, N
                                 1    X(k) = Q + Y(k)*(R*ZX(k) + T*XZ(k))


                           Translation:

                           CMIC@ DO ALL VECTOR IF (N .GT. 266)
                           CMIC@1    SHARED(N,Q,R,T,Y,ZX,X)
                           CMIC@2    PRIVATE(K)
                           CDIR@ IVDEP
                                    DO 1 k = 1,N
                               1       X(k)= Q + Y(k)*(R*ZX(k) + T*XZ(k))







     150                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Threshold calculation:

                                CHIMES = MAX( 2, 3, 1, (3+1)/2 ) = 3
                                800/CHIMES = 266



     Threshold calculation
     on CRAY-2 systems
     6.3.4.2
                           On CRAY-2 systems, the approximate number of clock
                           periods to invoke and terminate a parallel region
                           is 3200.  No chaining is allowed, but the addition
                           and multiplication and memory functional units can
                           operate in parallel.  Although the term CHIMES is
                           not normally used for the CRAY-2 systems, it is
                           used in this subsection to mean a separate vector
                           pipeline.  For CRAY-2 systems, the threshold
                           calculation is as follows:

                                N .GT. (3200/CHIMES)


                           In this calculation, CHIMES can be determined by
                           using the following calculation:

     CHIMES = MAX(no. of + or - operations,
                  no. of * plus 3 times no. of / operations)
              +   no. of vector stores
              +   no. of vector loads


                           Example:

                                      DO 1 K = 1, N
                                 1    X(k) = Q + Y(k)*(R*ZX(k) + T*XZ(k))


                           Translation:

                           CMIC@ DO ALL VECTOR IF (N .GT. 457)
                           CMIC@1    SHARED(N,Q,R,T,Y,ZX,X)
                           CMIC@2    PRIVATE(K)
                           CDIR@ IVDEP
                                    DO 1 k = 1,N
                               1       X(k)= Q + Y(k)*(R*ZX(k) + T*XZ(k))


                           Threshold calculation:

                                CHIMES = MAX( 2, 3) + 1 + 3 = 7
                                3200/CHIMES = 457





     SG-3074 5.0               Cray Research, Inc.                        151


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Threshold test
     omitted
     6.3.4.3
                           The threshold test is omitted if the iteration
                           count is known at compile time or if it can be
                           determined that the loop contains enough work.
                           For example, if the computed threshold is less
                           than 64 on an inner loop; that is, the loop
                           contains more than 12 chimes, the loop is
                           unconditionally autotasked.  Loops that contain
                           calls to an intrinsic function such as sqrt or exp
                           are also unconditionally autotasked.  In addition,
                           no threshold tests are generated for autotasked
                           loops that contain inner loops nested to a level
                           of 3 or more.


     Threshold calculation
     on outer loops
     6.3.4.4
                           When Autotasking is performed on an outer loop of
                           a set of nested loops, the iteration count is
                           determined by multiplying the iteration counts of
                           the loops.


     Multiple loops
     contained in an
     extended parallel
     region
     6.3.4.5
                           Most Autotasking overhead is incurred entering and
                           leaving the parallel region.  When an extended
                           parallel region contains multiple loops, the
                           threshold is computed using the largest number of
                           chimes of any one of the loops.  A single
                           threshold test is generated for the entire
                           extended parallel region.

                           The default threshold is the same for each loop in
                           the parallel region even though the start-up time
                           for the second through last loops is much shorter
                           than for the first loop.  Therefore, a series of
                           1-chime loops with only 500 iterations each would
                           fail the threshold test, even though a speedup
                           could possibly be observed.


     Threshold calculation
     for reductions
     6.3.4.6
                           Reductions on autotasked inner loops have


     152                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           significantly more overhead for two reasons.  Each
                           partition, or chunk of vector iterations, must do
                           the reduction to a scalar.  In a unitasked loop,
                           this reduction is done only once.  In a parallel
                           loop, if a processor executes three vector chunks,
                           the three reductions will cost approximately 750
                           CPs on a CRAY Y-MP system.  Thus, the parallel
                           loop has 500 CPs of extra work that the unitasked
                           loop never incurs.

                           However, if each processor takes only one vector
                           chunk, for example, 512 or fewer iterations for 8
                           processors, effectively only one reduction is
                           being done (one on each processor).  Then the
                           partitioning of the vector iterations across the
                           processors makes this clearly a positive situation
                           if the slaves can enter quickly enough.  In this
                           case, the added overhead of each processor going
                           through a guarded region may have more of an
                           effect on overhead.

                           The overhead in the threshold calculation is
                           doubled to account for the added overhead, as in
                           the following calculation:

                                N .GT. (2*OVERHEAD)/CHIMES


                           Reduction example:

                                      DO 3 k= 1,n
                                    3      Q= Q + Z(k)*X(k)


                           Translation (using the fpp -ei inner loop option
                           on CRAY Y-MP systems):

                                CMIC@ PARALLEL IF (N .GT. 1600)
                                CMIC@1     SHARED(N,Q,Z,X) PRIVATE(K,R1S)
                                      R1S = 0
                                CMIC@ DO PARALLEL VECTOR
                                CDIR@ IVDEP
                                      DO K = 1, N
                                         R1S = R1S + Z(K)*X(K)
                                      END DO
                                CMIC@ GUARD
                                      Q = Q + R1S
                                CMIC@ END GUARD
                                CMIC@ END DO
                                CMIC@ END PARALLEL


                           Threshold computation:

                                CHIMES = MAX ( 1, 1, 0, (2+1)/2) = 1
                                N .GT. (2*800 / 1 ) = N .GT. 1600



     SG-3074 5.0               Cray Research, Inc.                        153


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Threshold calculation
     for reductions on
     outer loops
     6.3.4.7
                           If Autotasking is performed at an outer loop level
                           and the inner loop contains a reduction, the
                           amount of work in the loop is increased.  The
                           threshold calculation is changed to account for
                           this work by doubling the number of chimes.

                           The threshold calculation becomes:

                                N .GT. OVERHEAD/(2*CHIMES)


                           Example:

                                      DO 3 j= 1, Loop
                                      DO 3 k= 1, N
                                    3      Q= Q + Z(k)*X(j)


                           Translation:

                                CMIC@ PARALLEL IF(LOOP*N.GT.400)
                                CMIC@1    SHARED(LOOP,N,Q,Z,X)
                                CMIC@2    PRIVATE(J,K,R1S)
                                      R1S = 0
                                CMIC@ DO PARALLEL
                                      DO J = 1, LOOP
                                CDIR@    IVDEP
                                         DO K = 1, N
                                            R1S = R1S + Z(K)*X(J)
                                         END DO
                                      END DO
                                CMIC@ GUARD
                                      Q = Q + R1S
                                CMIC@ END GUARD
                                CMIC@ END DO
                                CMIC@ END PARALLEL


                           Threshold computation:

                           CHIMES = MAX ( 1, 1, 0, (2+1)/2) = 1
                           LOOP*N .GT. (800 / ( 1 * 2)) = LOOP*N .GT. 400



     User control of
     threshold tests
     6.3.4.8
                           You may want to omit the threshold test for a
                           specific loop or for an entire program.  For
                           example, suppose FPP generates a threshold test,
                           but the run-time test always gives the same


     154                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           result, either multiple processors are used or
                           only a single processor is used.  In either case,
                           the threshold test is adding only overhead; there
                           is no benefit to the program.

                           One way to do this is to use the FPP advisory
                           directives, CFPP$ COUNT or CFPP$ ITERATIONS.  See
                           "Advisory directives," page 71.  These directives
                           let you specify the approximate iteration count of
                           individual loops or of classes of loops.  Using
                           this information, FPP can determine, at compile
                           time, whether enough work exists to justify
                           Autotasking, and no run-time test will be
                           generated.

                           It is also possible to disable the threshold test
                           for an entire program, allowing all possible loops
                           to run in parallel regardless of the iteration
                           count.  If a program is dominated by long vector
                           lengths, this removes the added overhead of a
                           run-time test.  However, it should be used with
                           care because performance may be degraded when
                           loops containing insufficient work are autotasked.
                           You can do this by using the -Wd"-d0" option of
                           cf77 or the -d0 option of fpp.

                           In a program or routine dominated by small vector
                           lengths there may never be enough work to justify
                           concurrency.  Concurrency can be disabled for a
                           single loop by using a CFPP$ NOCONCUR L directive.
                           You can disable concurrency using the -Wd"-dc"
                           option of cf77 or the -dc option of fpp.

                           You can also change the assumption of how much
                           time is spent on Autotasking overhead, analogous
                           to changing the OVERHEAD parameter in the
                           preceding calculations.  Lowering the threshold
                           causes FPP to autotask loops with a smaller
                           iteration count; raising the threshold prevents
                           FPP from Autotasking loops with a small iteration
                           count.  Generally, this feature is meant to be
                           used for developing and fine tuning the threshold
                           calculation; you should use this cautiously for
                           user programs.

                           The threshold cannot be changed for a single loop;
                           it can, however, be changed for an entire program.
                           You can do this by using the -Wd"-Targ" option of
                           cf77 or the -Targ option of fpp.



     CNCALL directive use
     6.3.5
                           You can instruct FPP to autotask loops containing
                           subroutine calls by inserting a CNCALL directive
                           immediately preceding the loop.  This directive


     SG-3074 5.0               Cray Research, Inc.                        155


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           asserts that any subroutines and nonintrinsic
                           functions referenced in the loop may safely be
                           called in parallel (that is, they do not modify
                           data referenced in other iterations of the loop).

                           Example:

                                CFPP$ CNCALL
                                      DO 80 I = 1, N
                                         CALL CRUNCH ( A(I), B(I) )
                                  80  CONTINUE


                           Translation:

                                CMIC@ DO ALL SHARED(A, B, N) PRIVATE(I)
                                      DO 80 I = 1, N
                                         CALL CRUNCH ( A(I), B(I) )
                                  80  CONTINUE


                           CNCALL automatically implies INNER if it is
                           applied to an inner loop, as in the preceding
                           example.

                           FPP assumes that all variables not explicitly
                           defined in the loop are shared.  (FPP does not
                           issue a message stating that it considers
                           arguments to such subroutines to be shared,
                           however.)

                           Example:

                                CFPP$ CNCALL
                                      DO 90 I = 1, N
                                         CALL CRUNCH ( A(I), X )
                                  90  CONTINUE


                           Translation:

                                CMIC@ DO ALL SHARED(A, X, N) PRIVATE(I)
                                      DO 90 I = 1, N
                                         CALL CRUNCH ( A(I), X )
                                  90  CONTINUE


                           In this example, X is declared shared, which may
                           cause incorrect results if X is modified inside
                           CRUNCH.

                           You can also specify the subroutine(s) that can be
                           called concurrently by using the -Wd"-Carg" option
                           of cf77 or the -C option of fpp.






     156                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Conditional
     Autotasking
     6.3.6
                           Conditional Autotasking is also used for loops
                           that are potentially dependent.  FPP generates an
                           IF clause with the DO ALL directive.  When
                           evaluated at run time, this test determines
                           whether the loop can execute correctly on multiple
                           processors, or if it must be run on a single
                           processor.  For single-nested and double-nested
                           loops, this test is combined with the threshold
                           test.  See "FPP Data Dependency Analysis," page
                           101, for a discussion of potential dependency.



     Private array use
     6.3.7
                           FPP can autotask some uses of "private" arrays;
                           that is, arrays that are both set and used for
                           each pass of an outer, autotasked loop.  The basic
                           rule for private array loops to be autotasked is
                           that FPP must be able to determine that all
                           elements of a private array used in a parallel
                           region are previously assigned values in the same
                           parallel region.

                           Example:

                                      SUBROUTINE GETF1
                                      COMMON N, NM1, M, A(99,99), C(99,99),
                                     1E(99,99), F(99)
                                      REAL B(99)
                                      DO 20 J = 1, M
                                         DO 10 I = 1, N
                                            B(I) = SQRT(A(I,J))
                                  10     CONTINUE
                                         DO 11 I = 2, N-1
                                            A(I,J) = E(I,J) + F(J)
                                  11     CONTINUE
                                         DO 12 I = 1, N
                                            C(I,J) = A(I,J) + B(I)
                                  12     CONTINUE
                                  20  CONTINUE
                                      END













     SG-3074 5.0               Cray Research, Inc.                        157


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Translation:

                                CMIC@ DO ALL SHARED(M, N, A, E, F, C)
                                CMIC@1PRIVATE(B, J, I)
                                      DO 20 J = 1, M
                                CDIR@ IVDEP
                                         DO 10 I = 1, N
                                            B(I) = SQRT(A(I,J))
                                  10     CONTINUE
                                CDIR@ IVDEP
                                         DO 11 I = 2, N-1
                                            A(I,J) = E(I,J) + F(J)
                                  11     CONTINUE
                                CDIR@ IVDEP
                                         DO 12 I = 1, N
                                            C(I,J) = A(I,J) + B(I)
                                  12     CONTINUE
                                  20  CONTINUE


                           In this example, B is a private array whose
                           elements are all assigned before they are used.
                           Sometimes, FPP cannot determine whether all the
                           elements of a private array used in a parallel
                           loop have been previously generated in the
                           parallel loop.

                           In this case, FPP will print the following
                           message:

                                First value of private array may be needed


                           If you know that the array's values are truly
                           private (that is, all used values are generated
                           within the parallel loop), you can tell FPP to
                           autotask the loop by using the PRIVATEARRAY
                           directive or the -e4 option switch.  See
                           "PRIVATEARRAY," page 70, for more information.

                           In some cases, FPP will not transform a
                           potentially parallel loop because it uses a
                           private array that has an assumed size (that is,
                           the array is passed in as a dummy argument and
                           declared using a * or a 1 for the dimension size).
                           Because the size of the array is unknown at
                           compile time, the compiler does not know how much
                           space to allocate for the private copies.  When
                           this occurs, FPP generates the following
                           diagnostic:

                           Assumed-size private arrays inhibit concurrency









     158                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           In the following example, A is such an array.

                                      SUBROUTINE PRIV(A,B,C,N,M,L)
                                      REAL A(*),B(N,*),C(N,*)
                                C
                                      DO 40 J = 1, N
                                        DO 20 I = L, M
                                          A(I) = SQRT( B(I,J) )
                                20      CONTINUE
                                        C(L,J)=0.
                                        DO 30 I = L, M
                                          C(I,J) = 1.0/A(I) + C(I,J)
                                30      CONTINUE
                                40    CONTINUE
                                      RETURN
                                      END


                           However, if the declaration for A is changed to
                           A(M), FPP generates the translation:

                                CMIC$ DO ALL SHARED(N, M, L, B, C)
                                CMIC$1   PRIVATE(A, J, I) SAVELAST
                                      DO 40 J = 1, N
                                CDIR@ IVDEP
                                        DO 20 I = L, M
                                          A(I) = SQRT( B(I,J) )
                                20      CONTINUE
                                        C(L,J) = 0
                                CDIR@ IVDEP
                                        DO 30 I = L, M
                                          C(I,J) = 1.0/A(I) + C(I,J)
                                30      CONTINUE
                                40    CONTINUE




     SPLIT directive use
     6.3.8
                           Statement functions and standard intrinsic
                           functions are allowed in loops that can be
                           optimized.  User functions or subroutines can be
                           called from concurrent loops by using the CNCALL
                           directive, or split out of vectorizable loops by
                           using the SPLIT directive.

                           When you know that a user function or subroutine
                           has no recursive side effects, you can direct FPP
                           to split out function and subroutine calls from
                           loops that could otherwise vectorize.  The
                           function or subroutine will be called in its own
                           loop.  Subroutines to be split out cannot have
                           alternate returns.





     SG-3074 5.0               Cray Research, Inc.                        159


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                CFPP$ SPLIT    (Says splitting is OK)
                                      DO 3540 I = 1,N
                                      X = A(I)+B(I)
                                      CALL COMPUTE(C(I),X)
                                 3540 E(I) = F(I)+X


                           Translation:

                                CDIR@ IVDEP
                                      DO 3540 I = 1, N
                                         X1U(I) = A(I) + B(I)
                                 3540 CONTINUE
                                CDIR@ IVDEP
                                      DO I = 1, N
                                         CALL COMPUTE(C(I),X1U(I))
                                      END DO
                                CDIR@ IVDEP
                                      DO I = 1, N
                                         E(I) = F(I) + X1U(I)
                                      END DO




     VFUNCTION directive
     use
     6.3.9
                           FPP also recognizes the CFT77 VFUNCTION directive
                           and treats such functions as vector intrinsic
                           functions.  A function that has been specified on
                           a VFUNCTION directive does not inhibit
                           optimization by FPP.

                           Example:

                                CDIR$ VFUNCTION JOE
                                      DO 777 I = JB, JE
                                 777     A(I) = JOE(B(I))


                           Translation:

                                CDIR@ VFUNCTION JOE
                                CDIR@ IVDEP
                                      DO 777 I = JB, JE
                                 777     A(I) = JOE(B(I))









     160                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Calls to the Cray
     scientific library
     6.3.10
                           FPP recognizes certain Fortran coding sequences
                           that it can replace by calls to highly optimized
                           library routines, as summarized in Figure and
                           discussed in the following subsections.  An
                           alternate entry point is used to ensure that the
                           Cray scientific library function is called, rather
                           than a user routine of the same name.  For
                           example, if FPP recognizes a coding sequence that
                           it can replace with a call to ISAMAX, it calls the
                           library function by using the alternate entry
                           point ISAMAX@.














































     SG-3074 5.0               Cray Research, Inc.                        161


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                CRI Scientific Library Routines Recognized by FPP




       * Matrix-matrix multiplication (SGEMMX and CGEMMX)



       * Vector-matrix multiplication (SGEMVX and CGEMVX)



       * Rank one update (SGERX and CGERX)



       * Linear recursion (FOLR, FOLRP, SOLR, and SOLR3)



       * Index of maximum/minimum (ISMAX, ISMIN, ISAMAX, and ISAMIN)
     ________________________________________________________________________


     Matrix-matrix
     multiplication
     6.3.10.1
                           FPP recognizes matrix-matrix multiplication in
                           most common forms and translates it into calls to
                           CRI scientific library matrix multiply routines
                           SGEMMX  and CGEMMX.

                           Calls to the matrix-matrix multiplication routines
                           are made by default.  You can disable these calls
                           by using the -d j option of fpp or the -Wd"-d j"
                           option of cf77.

                           Example:

                                 REAL X(100,35), Y(100,66), A(90,80)
                                 DO 10 I = 1,20
                                    DO 20 J = 1,35
                                       DO 30 K = 1,70
                                          Y(I,J) = Y(I,J) + 3.*A(I,K)*X(J,K)
                            30         CONTINUE
                            20      CONTINUE
                            10   CONTINUE


                           Translation:

                                      CALL SGEMMX@(20,35,70,3.,A(1,1),1,90,
                                     1   X(1,1),100,1,1.,Y(1,1),1,100)



162                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Vector-matrix
     multiplication
     6.3.10.2
                           FPP recognizes vector-matrix multiplication in
                           most common forms, and translates it into calls to
                           CRI scientific library matrix multiply routines
                           SGEMVX or CGEMVX.

                           Calls to the vector-matrix multiplication routines
                           are made by default.  You can disable these calls
                           by using the -d j option of fpp or the -Wd"-d j"
                           option of cf77.

                           Example:

                                       REAL A(100), B(100), C(100,100)
                                       DO 100 J = 1,N
                                          DO 200 I = 1,N
                                             A(I) = A(I) + C(I,J)*B(J)
                                 200      CONTINUE
                                 100   CONTINUE


                           Translation:

                                      CALL SGEMVX@ (N, N, 1., C(1,1),1,100,
                                     1   B(1),1,1.,A(1),1)



     Rank one update
     6.3.10.3
                           FPP also recognizes a rank one update, in which a
                           vector is multiplied by a transposed vector, and
                           each product updates an element in a matrix.

                           Calls to the rank one update routines are made by
                           default.  You can disable these calls by using the
                           -d j option of fpp or the -Wd"-d j" option of
                           cf77.

                           Example:

                                      REAL A(100,100), B(100), C(100)

                                      DO 500 I = 1,N
                                      DO 500 J = 1,N
                                         A(I,J) = A(I,J) + B(I)*C(J)
                                 500  continue


                           Translation:

                                      CALL SGERX@ (N, N, 1., B(1),1,C(1),1,
                                     1   A(1,1),1,100)


     SG-3074 5.0               Cray Research, Inc.                        163


     FPP Loop Analysis and Tuning   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           SGERX, a BLAS routine, is very general.  It can be
                           used to add a scalar to a matrix or to add a
                           single vector to a matrix.  FPP recognizes these
                           constructs and translates them to SGERX@ or CGERX@
                           calls.


     Linear recursion
     6.3.10.4
                           FPP optionally translates certain linear recursion
                           patterns into calls to scientific library routines
                           FOLR, FOLRP, SOLR, or SOLR3.  See "Translation of
                           linear recursion," page 115, for more information.

                           Calls to these routines are not made by default.
                           You can enable these calls by using the -eb option
                           of fpp or the -Wd"-eb" option of cf77.


     Index of
     maximum/minimum
     6.3.10.5
                           Index of maximum and minimum element operations
                           are translated into scientific library calls
                           ISMAX, ISMIN, ISAMAX, or ISAMIN.

                           Calls to these routines are made by default.  You
                           cannot enable or disable these calls with options
                           or switches.

                           Example:

                                      DO 90 I = 1,N
                                         IF (B(I) .GT. BMX) THEN
                                            BMX = B(I)
                                            IBMX = I
                                         ENDIF
                                 90   CONTINUE


                           Translation:

                                      J1S = ISMAX@ (N, B(1), 1)
                                      IF (B(J1S) .GT. BMX) THEN
                                         BMX = B(J1S)
                                         IBMX = J1S
                                      ENDIF


                           The following is an example of finding the index
                           of the last maximum, rather than the first
                           maximum.





     164                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Loop Analysis and Tuning
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                      SUBROUTINE INCR
                                      COMMON V(999),RMAX,IMAX,N
                                      DO 10 I = 1, N
                                        IF ( V(I) .LT. RMAX ) GO TO 10
                                      RMAX = V(I)
                                      IMAX = I
                                  10  CONTINUE
                                      END


                           Translation:

                                        SUBROUTINE INCR
                                      INTEGER ISMAX@
                                      INTEGER J1S
                                        COMMON V(999), RMAX,IMAX,N
                                      J1S = ISMAX@ (N, V(1), (-1))
                                      J1S = N + 1 - J1S
                                      IF (V(J1S) .GE. RMAX) THEN
                                         RMAX = V(J1S)
                                         IMAX = J1S
                                      ENDIF



































     SG-3074 5.0               Cray Research, Inc.                        165



                                             Additional FPP Optimization  [7]
     ########################################################################







                           FPP looks for parallel constructs and may perform
                           some source code transformations to produce code
                           that executes faster.  In addition to the loop
                           optimizations and transformations discussed in
                           "FPP Loop Analysis and Tuning," page 123, FPP
                           optimizes the code in the following additional
                           ways:

                           * Vectorization enhancements

                           * Inline expansion

                           * Scalar operations




     Vectorization
     enhancement
     7.1
                           FPP enhances the vectorization done by CFT77 in
                           several ways.  FPP recognizes vectorization
                           opportunities and then uses various techniques to
                           produce code that can be vectorized.

                           You can suppress all vectorization enhancements by
                           using any of the following:

                           * The -dv option of fpp

                           * The -Wd"-dv" option of cf77

                           * The NOVECTOR directive

                           As discussed in "FPP Data Dependency Analysis,"
                           page 101, dependence analysis is necessary to
                           determine whether a loop can be executed in
                           parallel.  To be safe, a vectorizing or
                           parallelizing compiler must start by assuming the
                           loop cannot be correctly executed in parallel, and
                           then by trying to prove the opposite.

                           For more information about vectorization that can
                           be achieved with the compiling system, see the
                           CF77 Compiling System, Volume 3:  Vectorization
                           Guide, publication SG-3073.


     SG-3074 5.0               Cray Research, Inc.                        167


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Vectorization enhancements are summarized in
                           Figure 30, page 169, and discussed in the
                           following subsections.


























































     168                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                    Additional FPP Vectorization Enhancements


       * Data dependence analysis

         - Subscript clarification

         - Alternate code (conditional vectorization)

         - Minimization of recursion

         - Loop splitting

         - Loop peeling


       * Loop nest restructuring

         - IF conversion

         - Loop fusion

         - Loop unrolling


       You can suppress all FPP vectorization enhancements by using:

       * -d v option of fpp

       * -Wd"-d v" option of cf77

       * NOVECTOR directive
     ________________________________________________________________________

         Figure 30.  Summary of additional FPP vectorization enhancements



     Dependence analysis
     7.1.1
                           The following subsections discuss data dependency
                           issues in light of the enhanced vectorization done
                           by FPP.


     Subscript
     clarification
     7.1.1.1
                           In some cases, FPP can recognize that a loop has
                           no data dependence (that is, it does not feed back
                           data between iterations) when the compiler cannot
                           recognize this situation.  FPP inserts an IVDEP
                           directive in front of each inner loop it knows to
                           be free of data dependence, ensuring that these


     SG-3074 5.0               Cray Research, Inc.                        169


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           loops are vectorized.

                           Example:

           L = LW - 1
           DO 2 I = 1, N      (Conditionally vectorized)
              Y(L) = Y(L) + X(I)*Y(LW)
              LW = LW + 1
       2   CONTINUE


                           Translation:

           L = LW - 1
     CDIR@ IVDEP
           DO 2 I = 1, N   (Unconditionally vectorized)
              Y(L) = Y(L) + X(I)*Y(LW)
              LW = LW + 1
       2   CONTINUE



     Alternate code
     (conditional
     vectorization)
     7.1.1.2
                           When it is unclear whether or not a loop is data
                           dependent, because of variables in array
                           subscripts, FPP generates two versions of the loop
                           together with a run-time test.  If the loop is not
                           data dependent, a vectorized version of the loop
                           executes; otherwise, a scalar version executes.
                           (This is analogous to generating an IF clause for
                           Autotasking, as discussed in "Conditional
                           Autotasking," page 157.)

                           Example:

           SUBROUTINE POTNTL ( A, B, IP1, N )
           REAL A(*), B(*)
           DO 10 I = 1,N
              A(IP1+I) = A(I) + B(I)  (Potentially dependent)
       10  CONTINUE
















     170                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Translation:

                                      IF (ABS(IP1).GE.N .OR. IP1.EQ.0) THEN
                                CDIR@ IVDEP
                                        DO 10 I = 1, N   ("Vector" loop)
                                           A(IP1+I) = A(I) + B(I)
                                  10    CONTINUE
                                      ELSE
                                CDIR@ NEXTSCALAR
                                        DO I = 1, N       ("Scalar" loop)
                                           A(IP1+I) = A(I) + B(I)
                                        END DO
                                      ENDIF


                           You can disable this transformation by using the
                           -dm option of fpp, the -Wd"-dm" option of cf77, or
                           the NOALTCODE directive.


     Minimization of
     recursion
     7.1.1.3
                           When possible, FPP splits loops that contain data
                           dependence into vectorizable and nonvectorizable
                           parts.  See "Loop splitting to isolate recursion,"
                           page 115, for further information.)


     Other restructuring
     techniques
     7.1.1.4
                           Other restructuring techniques used by FPP include
                           index set splitting and loop peeling.  See "Loop
                           splitting to split index set," page 109, and "Loop
                           peeling," page 109, for more information.



     Loop nest
     restructuring
     7.1.2
                           If a loop other than the innermost loop is a more
                           suitable candidate for vectorization, FPP
                           exchanges that one with the innermost loop.  The
                           amount of data dependence, size of array strides,
                           and vector length are the criteria used to choose
                           the "best" loop.






     SG-3074 5.0               Cray Research, Inc.                        171


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                      DO 10 I = 1, M
                                      DO 10 J = 1, M
                                  10     A(I,J) = B(I,J)**2


                           Translation:

                                      DO J = 1, M
                                CDIR@ IVDEP
                                        DO I = 1, M
                                           A(I,J) = B(I,J)**2
                                        END DO
                                      END DO


                           Example:

                                 COMMON /BLOCK/ PRESSURE(3,47),
                                1TEMPERATURE(3,47), VOLUME(3,47)
                                 DO 10 J = 1, NLAT
                                    DO 20 K = 1, NDEPTH
                                       PRESSURE(K,J) = CONST*VOLUME(K,J)*
                                1               TEMPERATURE(K,J)
                              20    CONTINUE
                              10 CONTINUE


                           Translation:

                                 DO 10 K = 1, NDEPTH
                           CDIR@ IVDEP
                                   DO J = 1, NLAT
                                      PRESSURE(K,J) = CONST*VOLUME(K,J)*
                                1              TEMPERATURE(K,J)
                                   END DO
                              10 CONTINUE


                           You can override the FPP selection algorithm by
                           using the SELECT directive (see "SELECT directive
                           use," page 146, for more information).



     IF conversion
     7.1.3
                           FPP handles IFs in loops as described in the
                           following subsections.


     Arithmetic IFs
     7.1.3.1
                           Arithmetic IFs are converted to block IFs.



     172                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                      DO 500 I = 1, N
                                         IF (A(I)) 400,420,400
                                 400     B(I) = A(I)
                                         GO TO 500
                                 420     C(I) = A(I)
                                 500  CONTINUE


                           Translation:

                                CDIR@ IVDEP
                                      DO 500 I = 1, N
                                         IF (A(I) .NE. 0) THEN
                                             B(I) = A(I)
                                         ELSE
                                             C(I) = A(I)
                                         ENDIF
                                 500  CONTINUE



     IF loops
     7.1.3.2
                           Under certain conditions, loops formed from IF and
                           GOTO statements are converted into DO loops.  (See
                           "IF loops," page 125, for more information.)


     Conditional index
     tests
     7.1.3.3
                           FPP converts tests on the loop index into
                           restricted-range loops.  (See "Conditional
                           operations," page 128, for more information.)


     Invariant IF removal
     7.1.3.4
                           In certain cases, FPP moves an IF statement
                           outside of a loop, if the value tested by IF
                           clause does not change during the iterations of
                           the loop.  This is done for at most two IF
                           statements in any one loop, and it is not done for
                           loops larger than about 80 statements.

                           Example:

                                 DO 10 I = 1, N
                                    A(I) = SQRT(B(I)*2+C(I)**2)
                                    IF (IDEBUG .EQ. 1) WRITE (900) I,A(I)
                             10  CONTINUE


     SG-3074 5.0               Cray Research, Inc.                        173


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Translation:

                                      IF (IDEBUG .EQ. 1) THEN
                                         DO 10 I = 1, N
                                            A(I) = SQRT(B(I)*2+C(I)**2)
                                            WRITE (900) I, A(I)
                                    10   CONTINUE
                                      ELSE
                                 CDIR@   IVDEP
                                         DO I = 1, N
                                            A(I) = SQRT(B(I)*2+C(I)**2)
                                         END DO
                                      ENDIF




     Miscellaneous
     techniques
     7.1.4
                           FPP also applies some general loop optimizations,
                           such as loop fusion and loop unrolling.  (See
                           "Loop fusion," page 138, and "Loop unrolling,"
                           page 139, for more information.)




     Inline expansion
     7.2
                           This subsection describes the inline routine
                           expansion done by FPP; this inline expansion is
                           distinct from that performed by CFT77.  (The
                           inlining performed by FPP creates additional
                           automatic Autotasking possibilities; generally,
                           CFT77 inlining does not.)  Programs can often
                           receive a performance benefit from the expansion
                           of the bodies of certain subroutines and functions
                           into the loops that call them.  This allows the
                           calling loop, and the body of the called routine,
                           to be optimized.  It may also allow a loop outside
                           of the call to be Autotasked.  Application codes
                           sometimes have small external functions that are
                           called from inside many loops; these functions are
                           good candidates for inline expansion.

                           Inline expansion done by FPP is summarized in
                           Figure 31, page 175, and discussed in the
                           following subsections.






     174                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                          Inline Expansion Done by FPP


       Advantages of inline expansion:

       * Reduces subroutine calling overhead

       * Increases vectorization and Autotasking possibilities

       * Allows scalar optimizations


       Automatic inline expansion (AUTOEXPAND directive or switch):

       * Full (fpp switch -e7)

       * Safe (fpp switch -e6)


       Inline expansion directives:

       * AUTOEXPAND

       * EXPAND

       * NEXPAND


       Specifying location of code to be inlined:

       * SEARCH directive

       * Command-line specification

         - Switch -e 8

         - Option -S file
     ________________________________________________________________________

               Figure 31.  Summary of inline expansion done by FPP

                           The following is a small example of inline
                           expansion.

                           Original:













     SG-3074 5.0               Cray Research, Inc.                        175


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                      DO 100 I = 1, N
                                         A(I) = CALC (A(I), X+B(I), 2.0)
                                 100  CONTINUE
                                      .
                                      .
                                      .
                                      END
                                      FUNCTION CALC (A,B,C)
                                      CALC = A + SQRT( B**2 + C**2 )
                                      IF (CALC.LT.0) CALC = ABS (B + C)
                                      END


                           Expanding function CALC inline:

                                    DO 100 I = 1, N
                                    TEMP1X = X + B(I)
                                    CALC1X = A(I) + SQRT(TEMP1X**2 + 2.0**2)
                                    IF (CALC1X.LT.0) CALC1X =
                                1        ABS (TEMP1X + 2.0)
                                    A(I) = CALC1X
                            100     CONTINUE


                           Inline expansion reduces subroutine calling
                           overhead and increases vectorization and
                           Autotasking possibilities.  It also allows scalar
                           optimizations such as invariant code relocation
                           and common subexpression analysis to extend across
                           the body of the subroutine, and the body of the
                           calling loop.

                           There are two modes of inlining:  automatic and
                           explicit.  These modes can be requested by
                           directive or by command-line or control statement
                           specification.


     Automatic inline
     expansion
     7.2.0.1
                           The objective of automatic inlining is to remove
                           "leaves" of the calling tree.  No programmer
                           intervention is needed other than requesting
                           inlining on the command line.

                           When automatic inlining is enabled (using the
                           AUTOEXPAND directive or the -e7 option of fpp),
                           FPP expands each called subroutine or function
                           that meets the following criteria:

                           * It has less than a threshold number of lines of
                             code.  The default is 50, but the you can change
                             the threshold with the
                             -M option of fpp.

                           * It contains no calls inside the expanded routine
                             (no nesting is allowed with automatic


     176                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             expansion).

                           * It contains no expansion inhibitors (for
                             example, argument types must match).

     Full versus safe
     expansion
     7.2.0.1.1
                           Two forms of automatic inline expansion can be
                           used:  "full" and "safe".  You can enable full
                           automatic inline expansion by using the -e7 option
                           of fpp or the -Wd"-e7" option of cf77.

                           You can enable safe automatic inline expansion by
                           using the
                           -e6 option of fpp or the -Wd"-e6" option of cf77.

                           "Safe" mode is provided because in certain unusual
                           circumstances FPP cannot correctly determine how
                           to inline occurrences of dummy arrays whose
                           dimensions differ from those of the actual arrays
                           that were passed to them.  In safe mode, these are
                           not inlined; whereas in full mode, they are
                           inlined.  In either case, a warning is put in the
                           listing file.  The following example shows a case
                           where full mode inlining produces incorrect code.

                           Example:

                                       SUBROUTINE SAM(A,B,LDA,N)
                                       REAL A(LDA,*), B(*)
                                       DO 886 I = 1, N
                                          B(I) = ADD(N, A(1,I), 1)
                                  886  CONTINUE
                                       END
                                       REAL FUNCTION ADD(N, X, INCX)
                                       REAL X(2:1+INCX,1)
                                       Y = 0
                                       DO 882 J = 1, N
                                          Y = Y + X(1,J)
                                  882  CONTINUE
                                       ADD = Y
                                       END


                           In full mode, FPP inlines ADD into SUM, with the
                           following diagnostic messages and output:

                           fpp-345 fpp: WARNING SAM, Line = 4, File = safe.f
                              Possible discrepancy between actual and dummy
                               array extents.
                           fpp-222 fpp: WARNING SAM, Line = 4, File = safe.f
                              Routine 'ADD' expanded.





     SG-3074 5.0               Cray Research, Inc.                        177


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Output produced:

                           CMIC@ DO ALL IF (N .GT. 20) SHARED(N, LDA, A, B)
                           CMIC@1   PRIVATE(Y, I, ADD1X, J)
                                 DO 886 I = 1, N
                           C*****  Code Expanded From Routine:  ADD
                                 Y = 0
                           CDIR@ IVDEP
                                 DO J = 1, N
                                    Y = Y + A(1,J+I-1)
                                 END DO
                                 ADD1X = Y
                                 B(I) = ADD1X
                           C*****  End of Code Expanded From Routine:  ADD
                            886  CONTINUE


                           The J loop addresses A along the second dimension,
                           striding by LDA, when it should be striding by
                           one.

                           In safe mode, FPP does not inline ADD, but it does
                           generate the following diagnostic message:

                           fpp-343 fpp: WARNING SAM, Line = 4, File = safe.f
                              Extent of dummy and/or actual array cannot be
                              determined. (X)


                           Two options are provided because in safe mode, FPP
                           sometimes detects possible problems when there is
                           no actual problem.  In practice, FPP rarely
                           detects a problem; generally, either mode produces
                           similar results.  Use safe mode if you suspect a
                           dimensioning conflict of the kind previously
                           described.


     Explicit inline
     expansion
     7.2.0.2
                           In explicit inline expansion, you list the
                           routines to be expanded, or direct expansion in
                           the following line of code by using the EXPAND
                           directive.  You can list routines for explicit
                           inline expansion on the command line when FPP is
                           invoked by using the -I option of fpp or by using
                           the EXPAND directive.

                           You can request nested expansion with explicit
                           inlining.  The NEXPAND directive expands the
                           indicated routine and all routines it calls,
                           leaving no external references.




     178                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Inline expansion
     directives
     7.2.1
                           The following subsections describe the inline
                           expansion directives available with FPP.


     AUTOEXPAND directive
     7.2.1.1
                           The AUTOEXPAND directive invokes automatic routine
                           expansion.  It has the following syntax:

                           -------------------------------------------------
                           CFPP$ AUTOEXPAND
                           -------------------------------------------------

                           This directive can have F, R, and L scopes.
                           NOAUTOEXPAND cancels the action.

                           The AUTOEXPAND directive with global scope is
                           equivalent to the -Wd"-e7" option of cf77 or the
                           -e7 option of fpp.


     EXPAND directive
     7.2.1.2
                           The EXPAND directive provides explicit routine
                           expansion.  It has the following syntax:

                           -------------------------------------------------
                           CFPP$  EXPAND [ ( list ) ]
                           -------------------------------------------------

                           Argument list is a list of routines to expand in
                           this routine; if it is not supplied, any calls in
                           the next statement are expanded.

                           Scope is ignored on the EXPAND directive.

                           The EXPAND directive is equivalent to the -Wd"-
                           Iarg" option of cf77 or the -Iarg option of fpp.

                           Example:

                                CFPP$ EXPAND (CALC)
                                      ...
                                      DO 100 I = 1, N
                                         A(I) = CALC( A(I), B(I)+1., N )
                                 100  CONTINUE
                                      ...
                                      END





     SG-3074 5.0               Cray Research, Inc.                        179


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     NEXPAND directive
     7.2.1.3
                           Nested expansion is done only if specified by user
                           directives.  Nested routines are not expanded in
                           auto expansion mode, thus reducing the possibility
                           of code size mushrooming.  You must explicitly
                           specify nested routines in order for them to be
                           expanded.

                           The NEXPAND directive specifies explicit nested
                           routine expansion.  NEXPAND specifies a list of
                           routines; these routines are to be expanded,
                           together with all routines that they call directly
                           or indirectly.  It has the following syntax:

                           -------------------------------------------------
                           CFPP$  NEXPAND [(list) [#path]]
                           -------------------------------------------------

                           Argument list is a list of routines to expand.
                           path specifies the path where the routines to be
                           expanded are located.  Scope is ignored on the
                           NEXPAND directive.

                           The NEXPAND directive has no command line or
                           control statement equivalent.

                           Example:

                                CFPP$ NEXPAND ( CALC )#/USR/TEST
                                      SUBROUTINE ADD
                                          .
                                          .
                                      DO 100 I = 1, N
                                         A(I) = CALC( A(I), B(I)+1., N )
                                 100  CONTINUE
                                      ...
                                      END


                           In this example, routine CALC would be expanded
                           and any calls inside CALC would also be expanded.
                           FPP looks for calc.f in path /usr/test.



     Where to find code
     for inline expansion
     7.2.2
                           To automatically expand a routine, FPP must know
                           where to find the routine's source code.  The
                           source location depends on programming style and
                           on the operating system user interface.




     180                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     SEARCH directive
     7.2.2.1
                           SEARCH tells FPP where to look for the routines to
                           expand.  It has the following syntax:

                           -------------------------------------------------
                           CFPP$  SEARCH (files)
                           -------------------------------------------------

                           Argument files is a list of files (separated by
                           commas) in which to search for the routines to be
                           expanded.

                           If files is the special entry *.f, routine xyz is
                           looked for in file xyz.f.  (This is the default
                           search method.)

                           The SEARCH directive is equivalent to the -Wd"-S"
                           option of cf77 or the -S option of fpp.


     Same file
     7.2.2.2
                           Called routines can be searched for in the same
                           file as the calling routine ("stacked input").
                           This requires an initial pass by FPP through the
                           entire input file (and files INCLUDEd therein) to
                           build a directory of the program units in the
                           input file.  The
                           -e8 option of fpp or the -Wd"-e8" option of cf77
                           enables this initial pass, which by default is not
                           done.


     Explicitly named file
     7.2.2.3
                           You can supply the name of a file in which to
                           search for called routines by using the SEARCH
                           directive, the -S fpp, or the
                           -Wd"-S" cf77 options.


     Implicitly named file
     7.2.2.4
                           Fortran programs are frequently stored so that
                           each routine of the program resides in a separate
                           file with a canonical name (for example, the name
                           of the routine followed by ".f" ).  The .f files
                           are searched for in the current working directory.
                           This is the default search method when inlining is
                           requested and the -e8 option is not specified.


SG-3074 5.0               Cray Research, Inc.                        181


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Possible problems
     7.2.3
                           The following subsections discuss possible
                           problems you may encounter when using inline
                           expansion.


     Separate compilation
     7.2.3.1
                           Fortran allows program units to be compiled
                           separately and linked together.  Because different
                           program units can be compiled at different times,
                           it is not possible to completely "fool proof"
                           inline expansion.

                           For example, suppose program A has subroutine B,
                           which calls subroutine C.  Subroutine C is
                           expanded into subroutine B.  Later, you decide to
                           change the calculation in subroutine C, and you
                           edit it.  Rather than recompiling the entire
                           program, you just recompile C and link it with the
                           previously compiled routines.  Unfortunately, your
                           changes to C are not present in the actual code
                           executed, because C is no longer called from B (it
                           was expanded.)

                           Thus, you must be involved in the routine
                           expansion process at least to the point of knowing
                           which routines must be recompiled when a change is
                           made.  For this reason, FPP generates both a
                           warning message for each expansion and an
                           expansion event table so that it is clear what has
                           been expanded.  (You may want to use the make(1)
                           command to build your application.  Using make
                           lets you explicity record these kinds of
                           dependencies, so that proper recompilation is
                           performed.)


     Code size
     7.2.3.2
                           A problem that may result from inline expansion is
                           code mushrooming; if too many routines are
                           expanded, or the expanded routines are too large,
                           the size of the compiled code may reach
                           unacceptable proportions.  You can control this
                           potential problem through judicious application of
                           this transformation.





     182                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



     Debugging
     7.2.3.3
                           Inline code expansion can complicate user run-time
                           debugging; if the program fails in an expanded
                           section of code, the error is reported in a
                           different routine than the one in which it
                           originally appeared.  Further, referring to the
                           inlined routine by name (such as in a breakpoint)
                           will not find any references or will find the
                           nonexpanded calls to the routine.

                           FPP records (in a comment line in the listing
                           file) the original line number of the routine
                           invocation on all the expanded lines.


     Compilation rate
     7.2.3.4
                           Any scheme to analyze more than one program unit
                           at a time leads to significantly slower
                           compilation rates, and inline expansion is no
                           exception.  It may result in two passes over the
                           entire program, and potentially much longer
                           compile times, depending on how much code is
                           brought inline.



     Analysis inhibitors
     7.2.4
                           There are several ways in which analysis may be
                           inhibited, thereby prohibiting expansion.  An
                           appropriate message informs you of a failed
                           expansion.  A full list of messages appears in
                           "User messages," page xx.x 0.


     Expansion inhibitors
     7.2.4.1
                           The following are expansion inhibitors:

                           * The routine to be expanded cannot be located.

                           * Syntax errors are found in the expansion
                             routine.

                           * The arguments used in the calling sequence do
                             not match the arguments in the expansion
                             routine.


     SG-3074 5.0               Cray Research, Inc.                        183


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * There is a conflict between common blocks of the
                             calling routine and the expansion routine.

                           * A function name referenced in the expansion
                             routine conflicts with a nonfunction name used
                             in the calling routine.


     Automatic expansion
     inhibitors
     7.2.4.2
                           In automatic mode, all calls to routines must meet
                           the following criteria to be expanded:

                           * The routine to be expanded has less than the
                             maximum allowed number of noncomment lines.  The
                             default is 50 noncomment lines; it can be
                             changed by using the -Wd"-M" cf77 option or the
                             -M fpp option.

                           * The routine to be expanded does not call any
                             other external routines.

                           * There are no inhibitors to the expansion (common
                             blocks that do not agree, and so on.)

                           If these parameters are unsatisfactory, you can
                           always resort to explicit mode.  Automatic mode
                           should catch small, simple external functions that
                           could have been statement functions.  More
                           demanding cases must be explicitly requested.

                           Although called automatic, this mode still
                           requires you to enable it using the -e6 or -e7 fpp
                           option or the -Wd"-e6" or
                           -Wd"-e7" cf77 option or the AUTOEXPAND directive.
                           An informational message is issued for each
                           expansion action.  This is to remind you that
                           routines that have been expanded into need to be
                           recompiled each time you change the expanded
                           routine.



     Inline expansion
     mechanics
     7.2.5
                           FPP renames all variables and parameters in
                           expanded program units by combining the first four
                           characters of the variable with an integer number
                           and the suffix X to create a unique name.  For
                           example, the local variable I could become I1X,
                           and the next reference to I in another expanded
                           routine could become I2X.


     184                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Operations that are performed during inline
                           expansion of subroutines and functions are as
                           follows:

                           * Common blocks - All common blocks that are used
                             in both routines agree.  The callee's version of
                             the common block may be a subset of the
                             caller's.  Elements of the common block may have
                             different names in caller and callee, but they
                             must not differ in size or data type.

                           * Constant parameters - (from PARAMETER
                             statements) Names are examined and changed if
                             necessary to avoid conflict with any name from
                             the calling program.  If identical in name and
                             value with a constant parameter in the calling
                             program, no change is made.

                           * Local variables - Local variables used in the
                             expanded routine are checked against all names
                             defined in the caller, and made unique.

                           * Labels - All labels used in the called routine
                             (FORMATs, CONTINUEs, DOs, and so on) are changed
                             so that there is no conflict with labels in the
                             caller.

                           * Actual versus dummy parameters - Dummy
                             parameters are replaced with their corresponding
                             actual arguments.  In the case of expressions
                             passed as actual arguments, FPP creates a
                             temporary variable to hold the expression and
                             uses the temporary in each of the places the
                             dummy argument appeared in the called routine.

                           * Returns - RETURN statements are changed into
                             branches to a new label in the caller that
                             represents the end of the called routine.  If it
                             is an alternate RETURN statement, the branch
                             corresponding to that RETURN is directly to the
                             specified label.

                           * Function returned value - References to the
                             function name as a variable in an expanded
                             function are replaced with another name.  (Calls
                             to the original function may still exist
                             unexpanded.)



     Inline expansion user
     messages
     7.2.6
                           All inline expansion messages appear as warnings.
                           FPP does not expand any routine that has caused
                           the generation of any message except for the
                           ROUTINE EXPANDED message.


     SG-3074 5.0               Cray Research, Inc.                        185


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Inline expansion
     summary
     7.2.6.1
                           FPP notifies you of any routines that have been
                           expanded and also informs you as to why a
                           particular routine was not expanded.  A summary of
                           the routines and all the locations where a routine
                           was or was not expanded is displayed.




     Scalars in loops
     7.3
                           Scalar variables are single locations in memory,
                           such as a simple variable (X).  Array references
                           whose subscript values are invariant in a loop
                           (and thus represent a single location through all
                           passes of the loop) are called array constants.
                           FPP treats array constants similarly to simple
                           scalar variables.

                           Generally, scalar variables must be defined
                           (appear on the left side of the equal sign) before
                           they are used (appear on the right side of the
                           equal sign) if they are defined in the loop.
                           Otherwise, they are carry-around scalars, which
                           may inhibit optimization.

                           In addition, scalar variables that are modified in
                           a loop can sometimes inhibit optimization.  Scalar
                           variables that are not modified in the loop do not
                           inhibit optimization.  Figure 32, page 187,
                           summarizes, and this subsection discusses, the
                           transformations used to deal with the storing of
                           values into nonindexed scalars within a loop.




















     186                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                     FPP Transformations of Scalars in Loops


       Scalar promotion:



       * Last values of promoted scalars



       * Conditionally defined promoted scalars



       * Carry-around scalars



       * Equivalenced scalars



       * ASSOC directive use
     ________________________________________________________________________

          Figure 32.  Summary of FPP transformations of scalars in loops



     Scalar promotion
     7.3.1
                           When a scalar is set to a vector expression and
                           then used in more than one part of a split loop,
                           that scalar must be promoted to a vector.  This
                           requires the introduction of temporary vectors
                           that replace the promoted scalars.  (See "Loop
                           splitting to isolate recursion," page 115, for an
                           example.)


     Last values of
     promoted scalars
     7.3.1.1
                           When FPP restructures a loop, it may eliminate the
                           definition of some scalar temporaries.  When
                           necessary, it re-creates the correct final values
                           of such variables.  (See "Last value saving," page
                           119, for more information.)






     SG-3074 5.0               Cray Research, Inc.                        187


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Conditionally defined
     promoted scalars
     7.3.1.2
                           If the final value of a conditionally defined
                           promoted scalar is required, that scalar cannot be
                           optimized, because there is no efficient way to
                           determine the "last true" element.  This may
                           inhibit Autotasking and degrade vectorization
                           enhancement.

                           Example:

                                      SUBROUTINE CND2
                                      COMMON /SCAL/ S,N
                                      COMMON /ARRY/ A(1000),C(1000)
                                C
                                CFPP$ INNER
                                      DO 30 I = 1, N
                                        IF(A(I) .GT. 0.) THEN
                                          S=1./A(I)
                                          C(I) = S + C(I)
                                        ENDIF
                                30    CONTINUE
                                      RETURN
                                      END


                           In this example, the fact that S is in COMMON, and
                           defined conditionally in the loop, prevents FPP
                           from Autotasking the loop.  If the final value of
                           S is not needed (as is probably the case), you can
                           use the NOLSTVAL directive, or the -du fpp option,
                           to allow optimization.  An even better solution
                           would be to make S a local variable, rather than a
                           COMMON variable.



     Carry-around scalars
     7.3.2
                           Scalars that may be used before they are defined
                           in a loop are called carry-around scalars.  They
                           may or may not be recursive.  Recursive carry-
                           around scalars inhibit optimization.  All
                           references to these variables are collected in a
                           scalar loop and split out from the rest of the
                           calculation if possible.

                           Example:

           DO 3313 I = 1,N      (Not optimized)
           A(I) = S + 1/S       (S is "carried around")
           B(I) = C(I) - A(I) + S
      3313 S = B(I) + D(I)



     188                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide    Additional FPP Optimization
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




     Equivalenced scalars
     7.3.3
                           Sometimes, scalars that are equivalenced may
                           inhibit data dependency analysis.  The NOEQVCHK
                           directive may be used to allow such operations to
                           optimize, if the equivalencing does not actually
                           create recursion.  (It almost never does.  See
                           "NOEQVCHK (declaring nonrecursion in
                           equivalences)," page 112, for a discussion of data
                           dependence and the NOEQVCHK directive.)

                           Example:

                                      COMMON /BLOCK/ A(999)
                                      EQUIVALENCE (S,A(100))
                                      .
                                      .
                                      DO 67 I = M, N        (Not optimized)
                                         S = B(I)**2
                                         A(I) = S + 1.0/S
                                  67  CONTINUE




     ASSOC directive
     7.3.4
                           A reduction function is an operation that
                           condenses array operands into one scalar value
                           that characterizes some aspect of the input
                           arrays.

                           FPP ensures that the optimized code is
                           semantically equivalent; however, FPP may reorder
                           code and the order of operations grouped from that
                           in the original scalar loop.  In some very
                           sensitive calculations, this reordering can cause
                           small numerical differences; for example, altering
                           the result in the final decimal places, because of
                           round-off differences.

                           The ASSOC and NOASSOC directives let you disable
                           or enable all transformations that change the way
                           operands are associated with each other.  (An
                           example of an associative transformation is
                           changing (X+Y)+Z to X+(Y+Z).)   Such a
                           transformation is always mathematically correct,
                           but it does not always produce exactly the same
                           result on computers, because they use finite-sized
                           words to represent numbers.




     SG-3074 5.0               Cray Research, Inc.                        189


     Additional FPP Optimization    CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The most common type of associative transformation
                           made by FPP is the generation of autotasked
                           reductions.  If the NOASSOC directive is
                           specified, these optimizations are not done;
                           otherwise, by default, associative transformations
                           are performed.  Another example of associative
                           transformation occurs when FPP splits loops to
                           minimize recursion.

                           You can also enable or disable associative
                           transformations globally by using the -ea or -da
                           option of fpp or the -Wd"-ea" or -Wd"-da"  option
                           of cf77.














































     190                       Cray Research, Inc.                SG-3074 5.0



                                                       FPP Source Output  [8]
     ########################################################################







                           The FPP dependence analyzer acts as a preprocessor
                           for the Cray Research CFT77 compiler.  It accepts
                           CFT77 Fortran as input and produces CFT77 Fortran
                           as output.

                           Figure 33, page 192, provides a list of the
                           changes that appear in an FPP source output
                           listing; they are discussed in this section, and
                           an example listing is also provided.










































     SG-3074 5.0               Cray Research, Inc.                        191


     FPP Source Output              CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                               FPP Source Changes


       When FPP must create new variables or arrays, the following rules
       apply:


       * Names generated by FPP - FPP examines the names used in the
         program unit and creates names that do not conflict with any used
         in the input source.


       * Temporary arrays generated by FPP - FPP allocates temporary
         arrays statically, in an FPP-generated common block.


       * Task common block generated by FPP - The size of the FPP-
         generated common block can be controlled by the SWITCH directive.
         The default size is 8191 words.


       * Stripmining - If temporary arrays must be created for a loop, and
         the upper bound on the loop iteration count is not a constant (or
         is very large), FPP stripmines the translated loop.  (Stripmining
         puts a loop around the vector loop; this extra loop splits the
         total vector operation into "strips" so available temporary space
         is not exceeded.)
     ________________________________________________________________________

                    Figure 33.  summary of FPP source changes




     Names generated by
     FPP
     8.1
                           For some transformations, FPP must insert its own
                           "temporary" arrays and scalars into the translated
                           code.  FPP examines the names used in the program
                           unit and creates names that do not conflict with
                           any used in the input source code.  The name of a
                           varaible created by FPP is made by appending a
                           suffix to a root.  If the new variable is derived
                           directly from a user variable, the root consists
                           of the first four characters of the name of the
                           original variable.  Otherwise the root is a letter
                           indicating the type of the variable (R=real,
                           I=integer, C=complex, D=double, or L=logical).
                           The suffix consists of one or more digits followed
                           by one letter.  For user-derived variables, the
                           letter is U; otherwise, the letter is S (for
                           scalars), V (for vectors), or X (from routine
                           expansion).


     192                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide              FPP Source Output
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~





     Temporary arrays
     generated by FPP
     8.2
                           FPP allocates temporary arrays statically in a
                           task common block generated by FPP.  The following
                           example shows allocation of temporary arrays by
                           FPP.

                           Example:

                                      COMMON A(1000),B(1000),C(1000)
                                      .
                                      .
                                      .
                                      DO 10 I = M, N
                                  10  A(I) = A(I-1) + S*B(I) + T*C(I)


                           Translation:

                                      REAL R1V(8191)
                                      REAL QQQ
                                      TASK COMMON /Z1FPP3CM/ QQQ(8191)
                                      EQUIVALENCE (QQQ(1),R1V)
                                      .
                                      .
                                      .
                                CDIR@ IVDEP
                                      DO 10 I = 1, N - M + 1
                                            R1V(I) = S*B(M+I-1) + T*C(M+I-1)
                                   10 CONTINUE
                                      DO I = 1, N - M + 1
                                            A(M+I-1) = A(M-2+I) + R1V(I)
                                      END DO


                           You can request any FPP-generated declarations to
                           be shown in the listing file.  Use the -pd option
                           of fpp or the -Wd"-pd" option of cf77.



     FPP task common block
     8.2.1
                           The size of the task common block generated by FPP
                           can be controlled by the -Q option to fpp, as
                           shown in the following sample command line:

                           -------------------------------------------------
                           $ fpp -Q 200000 bigstuff.f
                           -------------------------------------------------



     SG-3074 5.0               Cray Research, Inc.                        193


     FPP Source Output              CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The default size is 8191 words.

                           For more information about memory use by FPP, see
                           "Autotasking Memory Usage," page xx.x 0.



     Stripmining
     8.2.2
                           If temporary arrays must be created for a loop,
                           and the upper bound on the loop iteration count is
                           not a constant (or is very large), FPP stripmines
                           the translated loop.  Stripmining puts a loop
                           around the vector loop; this extra loop splits the
                           total vector operation into "strips" so that the
                           available temporary space is not exceeded.

                           Example:

                                      SUBROUTINE STRIP (A, B, C, M, N)
                                      REAL A(N), B(N), C(N)
                                      .
                                      .
                                      .
                                      DO 10 I = M, N
                                  10     A(I) = A(I-1) + S*B(I) + T*C(I)


                           Translation:

                                      REAL R1V(8191)
                                      REAL QQQ

                                      TASK COMMON /Z1FPP3CM/ QQQ(8191)
                                      EQUIVALENCE (QQQ(1),R1V)
                                      .
                                      .
                                      .
                                      DO J1S = 0, N - M, 8185
                                         J2S = MIN0(N-M+1-J1S,8185)
                                CDIR@ IVDEP
                                        DO 10 I = 1, J2S
                                           R1V(I) = S*B(M+J1S+I-1) +
                                     1       T*C(M+J1S+I-1)
                                   10   CONTINUE
                                        DO I = 1, J2S
                                           A(M+J1S+I-1) = A(M-2+J1S+I) +
                                     1       R1V(I)
                                        END DO
                                      END DO







     194                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide              FPP Source Output
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Example FPP listings
     8.3
                           The following pages show two example FPP listings.
                           Both examples are matrix multiplies; one with an
                           apparent dependency, the other written so that the
                           parallelism can be recognized and exploited.

                           The cf77 command line used to generate the example
                           is as follows:

                                cf77 -Zp -Wd"-l listfile" file.f


                           Figure 34, page 196, gives you a summary of
                           information necessary to understand the listing.













































     SG-3074 5.0               Cray Research, Inc.                        195


     FPP Source Output              CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                                  FPP Listings


       The command line to generate a FPP listing is as follows:
                        cf77 -Zp -Wd"-l listfile" file.f


       Features of the FPP listing are as follows:


       * Input lines are shown numbered consecutively.


       * A graph of the loop nest structure is shown on the left side of
         the input.


       * A table of messages and a listing of the output source with
         references back to the original source follow the input listing.


       * Numbers at the right edge of the output source listing correspond
         to the input line numbers.


       * A summary of the loops in the subroutine follows the source
         listing.


       * Last on the listing is an "Event Summary," giving counts of
         various events that occurred in translating the specified
         subroutine.
     ________________________________________________________________________

                   Figure 34.  Summary of FPP listing features

                           The input lines are shown numbered consecutively.
                           To the left of the input lines is a graph of the
                           loop nest structure.  The letters indicate the
                           disposition of each loop, as explained in Table 7.
                           The numbers at the right edge of the output source
                           listing correspond to the input line numbers.

                           A summary of the loops in the subroutine is given
                           after the source listing, followed by counts of
                           various events that occurred in translating this
                           subroutine ("Event Summary").  In the loop
                           summary, "CD" is the percentage of code within
                           the loop that is conditional, and "DP" is the
                           percentage that is dependent.








     196                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide              FPP Source Output
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                    Table 7.  Loop disposition codes

                           __________________________________________________
                            Value         Description
                           __________________________________________________
                              A		  User has manually inserted an
                                          Autotasking directive; FPP takes no
                                          action with this loop.
                              B           Parallel case.
                              C           Collapsed loop.
                              D           Data dependent loop.
                              E           Deleted.
                              H           Iteration count too short.
                              I           In-line expanded.
                              M           Vector parallel case.
                              N           Loop not optimized or not selected.
                              P		  Parallel construct; FPP autotasks
                                          (DO ALL or DO PARALLEL).
                              R           Unrolled.
                              T		  Translation diagnostic (that is, a
                                          problem other than a data
                                          dependency problem was encountered;
                                          for example, an I/O statement).
                              V		  Vector construct; FPP vectorizes
                                          (IVDEP).
                              Z		  Parallel/Vector construct (inner
                                          loop); FPP uses DO ALL VECTOR.
                           __________________________________________________


                           The first listing shows a matrix multiply
                           containing an apparent dependency.

     ------------------------------------------------------------------------

          1.             SUBROUTINE PRDMX4(A,B,C,L,M,N,NA2,NB2,NC2)
          2.       *     PRODUCT OF MATRIX 4
          3.             DIMENSION  A(10000),B(10000),C(10000)
          4.             NA=SQRT(FLOAT(NA2))+0.1
          5.             NB=SQRT(FLOAT(NB2))+0.1
          6.             NC=SQRT(FLOAT(NC2))+0.1
          7.             NBM=NB-M
          8. N-----      DO 30 IC=1,N
          9. N           IA=IC
         10. N V---      DO 5 IX=IC,L*NC,NC
         11. N V         C(IX)=0.0
         12. N V---    5 CONTINUE
         13. N N---      DO 20 J=1,M
         14. N N         IB=J

                                                                 (continued)
     ------------------------------------------------------------------------


     SG-3074 5.0               Cray Research, Inc.                        197


     FPP Source Output              CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
         15. N N V-      DO 10 IX=IC,L*NC,NC
         16. N N V       C(IX)=C(IX)+A(IA)*B(IB)
         17. N N V       IB=IB+NB
         18. N N V-   10 CONTINUE
         19. N N         IA=IA+NA
         20. N N---   20 CONTINUE
         21. N-----   30 CONTINUE
         22.             RETURN
         23.             END

      fpp-36 50fpp.new: COMMENT PRDMX4, Line = 10, File = t.f, Line = 10
         Possible feedback of elements in array 'C'.
         Use NODEPCHK directive if OK.
         Conflict on line 15. The DO index is 'IC', the DO label is '30'.

      fpp-36 50fpp.new: COMMENT PRDMX4, Line = 15, File = t.f, Line = 15
         Possible feedback of elements in array 'C'.
         Use NODEPCHK directive if OK.
         Conflict on line 15. The DO index is 'IC', the DO label is '30'.

      -----------------------------------------------------------------------

     C...Translated by FPP 5.0 (3.03K3) 02/15/91  11:35:50
                .
                .
                .
            NA=SQRT(FLOAT(NA2))+0.1                                4
            NB=SQRT(FLOAT(NB2))+0.1                                5
            NC=SQRT(FLOAT(NC2))+0.1                                6
            NBM=NB-M                                               7
            DO 30 IC=1,N                                           8
            IA=IC                                                  9
      CDIR@ IVDEP                                                 10
            DO 5 IX=IC,L*NC,NC                                    10
            C(IX)=0.0                                             11
          5 CONTINUE                                              12
            DO 20 J=1,M                                           13
            IB=J                                                  14
      CDIR@ IVDEP                                                 15
            DO 10 IX=IC,L*NC,NC                                   15
            C(IX)=C(IX)+A(IA)*B(IB)                               16
            IB=IB+NB                                              17
         10 CONTINUE                                              18
            IA=IA+NA                                              19
         20 CONTINUE                                              20
         30 CONTINUE                                              21
            RETURN                                                22
            END                                                   23


      ------------------- LOOP SUMMARY FOR ROUTINE PRDMX4 -------------------
       LABEL  INDEX   START   END NEST    COMMENT     CD DP  ITERATIONS

          30   IC         8    21  1  NOT CHOSEN          100  N
           5   IX        10    12  2  VECTORIZED          ((L+1)*NC-IC)/NC

                                                                 (continued)
     ------------------------------------------------------------------------


     198                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide              FPP Source Output
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
          20   J         13    20  2  NOT CHOSEN          M
          10   IX        15    18  3  VECTORIZED          ((L+1)*NC-IC)/NC

      -------------------- EVENT SUMMARY FOR ROUTINE PRDMX4 -----------------
      WARNING MESSAGES          --     0  SYNTAX ERRORS             --     0
      TRANSLATION DIAGNOSTICS   --     0  DATA DEPENDENCY CONFLICTS --     2
      LOOPS EXAMINED            --     4  LOOPS TRANSLATED          --     2

     ------------------------------------------------------------------------

                           The following listing shows a matrix multiply job
                           containing parallelism that can be recognized and
                           exploited.















































     SG-3074 5.0               Cray Research, Inc.                        199


     FPP Source Output              CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
          1.         subroutine prdmx1 (a,b,c,l,m,n,na,nb,nc)
          2.       * product of matrix 1
          3.         dimension a(100,100), b(100,100), c(100,100)
          4. P-----  do 10 i = 1,l
          5. P V---  do 10 k = 1,n
          6. P-V---10        c(i,k) = 0.0
          7. V-----  do 20 j = 1,m
          8. V V---  do 20 k = 1,n
          9. V V V-  do 20 i = 1,l
         10. V-V-V-20        c(i,k) = c(i,k) + a(i,j) * b(j,k)
         11.         return
         12.         end
     fpp-384 fpp: WARNING PRDMX1, Line = 10, File = t.f, Line = 10
        Outer reduction along inner loop inhibits concurrency.

      -----------------------------------------------------------------------

     C...Translated by FPP 5.0 (3.03Y4) 04/18/91  10:21:49
               .
               .
               .
      CMIC@ DO ALL IF (L*N .GT. 3200) SHARED(L, N, C) PRIVATE(I, K)         4
            do 10 i = 1,l                                                   4
      CDIR@ IVDEP                                                           5
          do 10 k = 1,n                                                     5
       10  c(i,k) = 0.0                                                     6
          CALL SGEMMX@(L,N,M,1.,A(1,1),1,100,B(1,1),1,100,1.,C(1,1),1,100) 10
          return                                                           11
          end                                                              12

      -------------------    LOOP SUMMARY FOR ROUTINE PRDMX1  ---------------
       LABEL  INDEX   START   END NEST    COMMENT       CD DP  ITERATIONS

          10   I          4     6  1  AUTOTASKED                 L
          10   K          5     6  2  VECTORIZED                 N
          20   J          7    10  1  VECTORIZED                 M
          20   K          8    10  2  VECTORIZED                 N
          20   I          9    10  3  VECTORIZED                 L

      -----------------   EVENT SUMMARY FOR ROUTINE PRDMX1  -----------------
      WARNING MESSAGES          --     1     SYNTAX ERRORS             -- 0
      TRANSLATION DIAGNOSTICS   --     0     DATA DEPENDENCY CONFLICTS -- 0
      LOOPS EXAMINED            --     5     LOOPS TRANSLATED          -- 5

     ------------------------------------------------------------------------












     200                       Cray Research, Inc.                SG-3074 5.0



                                                 Autotasking Performance  [9]
     ########################################################################







                           Autotasking provides a mechanism for automatic
                           multiprocessing.  The goal of Autotasking is to
                           have the autotasked program perform the same work
                           as the unitasked version of the program in less
                           wall-clock time, thereby improving program
                           throughput.  Figure 35, page 200. summarizes, and 
                           the following subsections discuss, performance
                           expectations for vectorization and Autotasking.




     Performance
     expectations for
     vectorization
     9.1
                           A short discussion of vectorization performance
                           expectations provides a foundation for a
                           discussion of Autotasking performance
                           expectations.  When the concept of automatic
                           vectorization was first introduced, there was a
                           tendency to expect that, compared to scalar code,
                           there would be large speedups from a Fortran
                           program, with minimal development cost.  In fact,
                           large speedups were obtained in loops and small
                           kernel programs that could be vectorized.
                           Speedups of 10 to 20 times were typical on CRAY-1
                           systems.

                           Usually, these large speedups were usually not
                           observed in real application programs.  The first
                           reason was that many of these programs contained
                           loops that were so large, so complex, contained so
                           many data dependencies, or had a flow of control
                           that was so difficult to follow that the compiler
                           could not vectorize them.  The second reason was
                           that the real programs almost always contained
                           algorithms that were intrinsically scalar.  The
                           scalar algorithms dominated performance, which is
                           what Amdahl's Law states:  performance is
                           dominated by the slowest component.









     SG-3074 5.0               Cray Research, Inc.                        201


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                             Autotasking Performance



       * Performance expectations for vectorization


         - Decreases both CPU time and wall-clock time


         - Slowest component is scalar code



       * Performance expectations for Autotasking


         - Decreases only wall-clock time


         - Can increase CPU time


         - Two factors decrease the speedup ratio of a multitasked code:
           overhead and single-threaded code segments



       * Use Amdahl's Law to forecast both performance for vectorization
         and Autotasking
     ________________________________________________________________________

                    Figure 35.  Summary of performance issues

                           Over time, compilers handled larger and more
                           complex loops.  Users made vectorization of their
                           programs easier by breaking large loops into
                           smaller ones that could be vectorized.  The
                           compilers vectorized certain types of data
                           dependencies and printed informative messages
                           indicating why other loops could not be
                           vectorized.  The flow of control within loops was
                           improved, and the order of loops was inverted to
                           make the innermost loop have the longer vector
                           length.  The result of all this work is that,
                           currently, many programs are 70to 80
                           vectorized; that is, 70to 80300003000000202034f the execution
                           time is spent executing vector instructions.
                           These programs execute 2 to 4 times faster than
                           the original scalar code.






     202                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Amdahl's Law for
     vectorization
     9.2
                           The reason vectorized code executes only 2 to 4
                           times faster (and not 10 to 20 times faster than
                           its scalar equivalent) is explained by Amdahl's
                           Law, which states that the performance of a
                           program is dominated by its slowest component.
                           For vectorization, the slowest component is scalar
                           code; for multitasking, the slowest component is
                           serial code (which might not be vectorized).  A
                           formulation of this law for vector code, which is
                           R times faster than scalar code, is shown in the
                           following equation:

                                   1
                                 _____
                           Sv = fs+fv
                                   --
                                   Rv

                           Sv   Maximum expected speedup from vectorization

                           fv   Fraction of a program that is vectorized

                           fs   Fraction of a program that is scalar = 1 - fv
                           Rv   Ratio of scalar to vector processing time

                           For Cray Research systems, Rv ranges from 10 to
                           20; for the following discussion, assume a value
                           of 10.  For a program that is 50vectorized, this
                           formula gives a speedup of only 1.8.  For a
                           program that is 80vectorized, the speedup is
                           still only 3.6.  Only with 100vectorization can
                           a program actually achieve a speedup of 10.  It is
                           not always an easy task to reach 70to 80
                           vectorization in a program, and it becomes
                           increasingly more difficult to vectorize beyond
                           this level; usually, major algorithm changes are
                           necessary.  Consequently, many users stop their
                           vectorization efforts after the code is running 2
                           to 4 times faster than scalar code.

                           Additionally, the preceding formula assumes that
                           all vector code is Rv times faster than equivalent
                           scalar code, and that the overhead for vector code
                           is insignificant.  In reality, not all vector code
                           is Rv times faster than scalar code, and short
                           vector lengths further reduce this factor.  Also,
                           vector code has a small startup cost, so that, at
                           vector lengths of two or three elements, scalar
                           code is usually faster.  A realistic speedup for a
                           program that is 80vectorized might be 3.0 rather
                           than 3.6.






     SG-3074 5.0               Cray Research, Inc.                        203


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~





     Performance
     expectations for
     Autotasking
     9.3
                           Multitasking is used to decrease wall-clock
                           execution time for a program relative to that
                           required for single-processor execution.  A
                           multitasked program generally has the same amount
                           of work for the processors to perform as does the
                           corresponding unitasked program.  However, when
                           the work is spread across many processors, the
                           wall-clock time required to complete the work
                           should be less.

                           This is an important distinction between
                           vectorization and parallel processing.  Although
                           vectorization usually decreases both CPU time and
                           wall-clock time, multitasking decreases only
                           wall-clock time.  In fact, multitasking generally
                           increases CPU time because of the extra code
                           required for starting, stopping, and synchronizing
                           processors.  On a dedicated system, the speedup
                           ratio for a multitasked program can be calculated
                           in the following manner:

                           Wall-clock execution time (single-processor)
          Speedup ratio =  ____________________________________________
                           Wall-clock execution time (multitasked)

                           With N CPUs, a speedup ratio as close as possible
                           to N is desired.

                           When analyzing Autotasking performance, you must
                           understand some of the implementation details of
                           Autotasking.  The remaining subsections in this
                           section discuss implementation details.




     Amdahl's Law for
     multitasking
     9.4
                           The formulation of Amdahl's Law for multitasking
                           is shown in the following equation.

                           Equation:

                                 1
                           Sm = _____
                                fs+fp
				   --
                                    N





     204                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Sm   Maximum expected speedup from multitasking

                           N    Number of processors available for parallel
                                 execution

                           fp   Fraction of a program that can execute in
                                 parallel

                           fs   Fraction of a program that is serial = 1 - fp
                           The speedup from multitasking, Sm , is in terms of
                           wall-clock time, not CPU time.

                           The formulation for multitasking contains two
                           independent variables, fp and N, both of which
                           affect multitasking performance.  Table 8.
                           shows the maximum expected multitasking speedup,
                           Sm, for a range of values for fp and N
                           (fp is expressed as a percentage in the table).

                           For example, the columns labeled N = 4 and N = 8
                           in Table 8, correspond to the number of CPUs
                           available on CRAY X-MP and CRAY Y-MP systems,
                           respectively.  Rows corresponding to 20, 50, and
                           80300003000000202034arallelism represent three levels of
                           parallelism that the Autotasking system might
                           detect and exploit in a Fortran program.


































     SG-3074 5.0               Cray Research, Inc.                        205


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                     Table 8.  Amdahl's Law for multitasking

                 Maximum theoretical speedup Sm, on N CPUs with 300003000000202034arallelism
     _______________________________________________________________________
       N=2    N=4    N=8    N=16    N=32    N=64      N=128     N=256
     _______________________________________________________________________
       0.0
      10.0
      20.0
      30.0
      40.0
      50.0
      55.0
      60.0
      65.0
      70.0
      75.0
      80.0
      82.0
      84.0
      86.0
      88.0
      90.0
      91.0
      92.0
      93.0
      94.0
      95.0
      96.0
      97.0
      98.0
      99.0
      99.1
      99.2
      99.3
      99.4
      99.5
      99.6
      99.7
      99.8
      99.9
     100.0
     _______________________________________________________________________

















     206                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Single-threaded code
     segments
     9.4.1
                           There are sections of each program that must use a
                           single processor; these sections are called
                           single-threaded code segments.  Some programs
                           approach 100300003000000202034arallelism, but virtually no code
                           reaches 100The previous discussion of Amdahl's
                           Law for vectorization also applies to
                           multitasking.  Amdahl's Law for multitasking
                           illustrates the effect of single-threaded code
                           segments on multitasking performance.

                           If half of the execution time for a program can be
                           spent in parallel execution (50300003000000202034arallelism) on
                           an eight-processor CRAY Y-MP system, the
                           theoretical potential speedup would be 1.78.  If
                           95300003000000202034f the program executes in parallel, the
                           theoretical speedup would be 5.93.  Therefore,
                           based on Amdahl's Law, it is clear that
                           significant speedups cannot be obtained unless a
                           significant portion of the execution is done in
                           parallel.

                           In neither of these cases is the speedup on a
                           CRAY Y-MP system twice that on a CRAY X-MP system,
                           nor are the speedups close to the physical number
                           of processors available on each system.  To obtain
                           speedups equal to the physical number of
                           processors requires the executing program to use
                           all processors effectively 100300003000000202034f the time with
                           no overhead.  Because this is virtually
                           impossible, performance is dominated by the
                           fraction of the time spent executing serial code.

                           Therefore, just as vectorization does not yield
                           speedups equal to the ratio of vector-to-scalar
                           processing rates, Autotasking does not yield
                           speedups equal to the number of physical
                           processors available.




     Estimating the
     percentage of
     parallelism within a
     program
     9.5
                           You can estimate the percentage of total execution
                           time that can be spent in parallel execution in
                           many ways.  A simple technique for making a rough
                           estimate of the level of parallelism within a code
                           is summarized in Figure 36, page 206.



     SG-3074 5.0               Cray Research, Inc.                        207


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                             Estimating Parallelism


       The simplest technique for making a rough estimate of the level of
       parallelism within a code is as follows:



       1. Determine which subroutines consume the majority of the program
           execution time.  You can do this by executing the program in
           single-CPU mode with Flowtrace enabled (cf77 option -F).



       2. Identify the subroutines that have significant sections of code
           that can be executed in parallel.  (See "Characteristics of
           parallel programs," for assistance in identifying
           subroutines that can be executed in parallel.)



       3. Add up the total percentage of execution times (as provided by
           Flowtrace) for the subroutines that have been identified as
           significantly parallel.
     ________________________________________________________________________

       Figure 36.  Simple technique for estimating parallelism in a program

                           A somewhat expanded explanation of figure 36, page
			   206, is as follows:

                           1. Execute the program in single-CPU mode with
                               Flowtrace enabled (cf77 option -F).  This
                               helps you determine which subroutines consume
                               the majority of the program execution time.

                           2. Identify the subroutines that have significant
                               sections of code that can be executed in
                               parallel.  See "Characteristics of parallel
                               programs," for assistance in
                               identifying entire subroutines that can be
                               executed in parallel.

                           3. Add up the total percentage of execution times
                               (as provided by Flowtrace) for the subroutines
                               that have been identified as significantly
                               parallel.

                           A more accurate estimate of the amount of
                           parallelism within a code can be obtained by using
                           the following procedure:

                           1. Execute the program in single-CPU mode with the
                               prof command.  This helps you determine which
                               subroutines consume the majority of the


     208                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                               program execution time.  prof also provides
                               information to help you determine which
                               sections of code within a subroutine have
                               significant execution times.

                           2. Identify significant sections (loops) of
                               subroutines.  See "Characteristics of parallel
                               programs," for assistance in
                               identifying loops or subroutines that can be
                               executed in parallel.

                           3. Add up the percentage of execution times (as
                               provided by prof) for all of the identified
                               sections of code.

                           Flowtrace and prof are documented in the UNICOS
                           Performance Utilities Reference Manual,
                           publication SR-2040.

                           After you have estimated the amount of parallelism
                           and have changed your codes to exploit this
                           parallelism, you can use atexpert(1) to measure
                           and graphically display autotasked performance.
                           You can also predict speedups on a dedicated
                           system from data collected from a single run on a
                           nondedicated system by using atexpert.  Further
                           parallelism may still exist at a higher level, in
                           serial sections that were overlooked, or by more
                           efficient parallelism of existing areas.




     Prerequisites for
     high performance
     9.6
                           Vectorization is still the best means to obtain
                           high performance, because of the following
                           reasons:

                           * Vectorization decreases both wall-clock and CPU
                             time.

                           * Vectorization decreases total job times, even in
                             a batch environment.

                           * Finally, it is often easier to write code that
                             can be vectorized than to write code that can
                             execute in parallel.

                           With the current number of processors on Cray
                           Research systems, vectorization can increase
                           performance of loops by a factor of 10 to 20;
                           multitasking is limited to a speedup of 4 to 8
                           times.




     SG-3074 5.0               Cray Research, Inc.                        209


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           After a certain point, however, it becomes
                           increasingly difficult to continue vectorizing a
                           program.  Often this point is reached when a
                           program is 70to 80vectorized.  According to
                           Amdahl's Law for vectorization, performance at
                           this point would be 2 to 4 times that of scalar
                           code.  When further vectorization becomes more
                           difficult, Autotasking can provide the means to
                           further decrease wall-clock times.  Even largely
                           scalar programs can be improved by Autotasking.

                           For Autotasking to be effective, some degree of
                           parallelism must exist in a program.  Autotasking
                           does not look for areas of parallelism past the
                           scope of subroutine or function boundaries.
                           Because analysis is currently limited to areas in
                           the code that Autotasking determines can be
                           executed in parallel, using Autotasking options
                           such as the inlining option for FPP, which
                           eliminates subroutine calls, helps increase the
                           possibility of Autotasking being able to identify
                           areas that can run in parallel.




     Characteristics of
     parallel programs
     9.7
                           Figure 37, page 209, summarizes, and the following
                           list describes, the characteristics of parallel
                           programs.  This list does not contain every
                           characteristic, and it does not ensure that
                           parallelism exists and can be exploited in
                           programs demonstrating one or more of these
                           characteristics.  You must determine the extent of
                           parallelism in your program.






















     210                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                           Characteristics of Programs
                      with a High Potential for Parallelism



       * Heavily vectorized



       * Performance is dominated by nested loops that do not contain CALL
         statements



       * Performance is dominated by loops with calls to self-contained
         subroutines (or better still, subroutines that can be expanded
         inline)



       * Nested loops with an innermost loop that can be vectorized, and
         these loops should have a large number of both iterations and
         operations



       * Loops that do the work of a matrix multiply, a first- or second-
         order linear recurrence, or a search for the index of a minimum
         or maximum element that can be replaced by calls to scientific
         library routines
     ________________________________________________________________________

     Figure 37.  Characteristics of programs with a high potential for
                 parallelism























     SG-3074 5.0               Cray Research, Inc.                        211


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The following types of programs have a high
                           potential for parallelism:

                           * Programs that are heavily vectorized.

                           * Programs whose performance is dominated by
                             nested loops that do not contain CALL
                             statements.

                           * Programs whose performance is dominated by loops
                             with calls to self-contained subroutines (that
                             is, subroutines that make no external calls);
                             further, it is necessary that these subroutines
                             can be expanded inline or explicitly flagged as
                             concurrently callable by use of the CFPP$ CNCALL
                             directive.  This type of program represents an
                             indirect source of parallelism because the
                             parallelism can be exploited after the inline
                             expansion or after a programmer determines which
                             routines are concurrently callable.

                           * Programs that contain nested loops with an
                             innermost loop that can be vectorized.

                           * Programs that have inner loops that can be
                             vectorized; further, these loops should have a
                             large number of both iterations and operations.
                             If the number of operations is small, the number
                             of iterations must be large for Autotasking to
                             be effective.  You can force loops to be
                             examined for inner-loop Autotasking by
                             specifying the -e i option of fpp; however, the
                             number of iterations for inner loops must be
                             greater than 64 for the -e i option to be
                             effective.

                           * Programs with loops that do the work of a matrix
                             multiply, a first- or second-order linear
                             recurrence, or a search for the index of a
                             minimum or maximum element.  These loops may be
                             replaced by calls to scientific library
                             routines, which are optimized for
                             multiprocessing performance.  This type of
                             program represents an indirect source of
                             parallelism, because the parallelism exists in a
                             library routine.

                           Highly vectorized programs do not always exhibit
                           high levels of directly exploitable parallelism,
                           as shown in the following example.

                                       DO 10 I = 1, N
                                               Y(I) = A*X(I)+Y(I)
                                10     CONTINUE







     212                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           This example comes from the SAXPY loop found in a
                           LINPACK test program.  The SAXPY loop vectorizes
                           very well.  In fact, CFT77 unrolls the vector
                           loop, causing 128 vector elements to be computed
                           on each pass through the unrolled loop.  If the
                           SAXPY loop is not surrounded by outer loops in a
                           subroutine, however, default Autotasking cannot
                           find any parallelism to exploit.

                           If inner-loop Autotasking is used (that is, option
                           -e i is specified), the iterations of the loop are
                           stripmined, or partitioned, so that each processor
                           works on a different strip of work.  This means
                           that an effective vector length for each processor
                           is less than N.  The compiler's unrolled vector
                           loop is not used if the effective vector length is
                           less than or equal to 128 for any processor.
                           Unless N is very large, vector performance
                           degrades.  (See "Stripmining," for more
                           information.)

                           In the preceding LINPACK test program, N has a
                           maximum value of 100.  The SAXPY loop exists in a
                           subroutine with no outer loops.  Consequently,
                           default Autotasking produces no change in the
                           performance of the program.  Inner-loop
                           Autotasking, however, drops the performance of the
                           LINPACK test from 50 MFLOPs to only 35 MFLOPs on
                           CRAY Y-MP systems.

                           Parallelism that can be exploited exists in the
                           LINPACK test program.  To accomplish this, the
                           SAXPY loop must be expanded into its calling
                           routine.  After this expansion, the iterations of
                           the surrounding outer loop can execute in
                           parallel.  With this change, the performance of
                           the LINPACK test jumps from 90 MFLOPs to 275
                           MFLOPs on CRAY Y-MP systems.




     Extent of parallelism
     and load balancing
     9.8
                           The extent of parallelism for any parallel region
                           is determined by the number of partitions, or
                           chunks, of independent work that constitute the
                           region.  If an autotasked loop has N iterations, N
                           is the extent of parallelism for that loop.  This
                           means that the loop can effectively be broken into
                           N independent chunks of work.  For autotasked
                           inner loops, the extent of parallelism is the
                           number of iterations divided by 64.





     SG-3074 5.0               Cray Research, Inc.                        213


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Load balancing is the process of trying to ensure
                           that the amount of work done by each processor
                           available to a job is approximately equal.  In a
                           batch environment, the number of processors
                           available to a job is constantly changing.
                           Autotasking tries to do dynamic load balancing
                           automatically by creating small-granularity
                           parallelism.  For example, if a DO loop is
                           autotasked, work is allocated to each task one
                           iteration at a time.  The interaction between
                           parallelism and load balancing is summarized in
                           Figure 38, page 213.

                           The concept of load balancing is best described
                           through examples.  The examples in the following
                           subsections illustrate load balancing with respect
                           to extent of parallelism and granularity size.



     Load balancing
     example 1
     9.8.1
                           Given an outer DO loop (10 iterations long) that
                           is autotasked, if the total execution time for the
                           single-tasked loop is 500 seconds, the extent of
                           the parallelism would be 10, and each task would
                           be allocated 50 seconds worth of work at a time.
                           Figure 39, page 214, illustrates a possible 
                           execution scenario for the loop.  For purposes of
                           simplification, each iteration in this scenario is
                           executed completely without interruption; however,
                           Autotasking does not ensure that a processor will
                           remain available for an entire allocation of work.


























     214                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                         Parallelism and Load Balancing


       The extent of parallelism for any parallel region is determined by
       the number of partitions, or chunks, of independent work that
       constitute the region.

       Load balancing is the process of attempting to ensure that the
       amount of work done by each processor available to a job is
       approximately equal.

       There is a direct relationship between load balancing and extent of
       parallelism:


       * The higher the extent of parallelism, the easier it is to balance
         the workload evenly across the processors.


       * Small-granularity parallelism is easier to balance across
         available processors than large granularity parallelism.


       * Small-granularity parallelism generates more overhead than large
         granularity parallelism.  Synchronization is required each time a
         chunk of work is allocated to a processor.
     ________________________________________________________________________

              Figure 38.  Summary of parallelism and load balancing





























     SG-3074 5.0               Cray Research, Inc.                        215


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



                   Figure 39.  Execution scenario for example 1

                           The sequence of events depicted in Figure 39 is
                           as follows:

                           * Logical CPU-0 begins executing the first
                             iteration of the loop (chunk 1) at t0.

                           * This chunk is completed 50 seconds later.

                           * As other processors become available, they are
                             each allocated one iteration at a time.  The
                             number of CPUs available to the job changes
                             during execution.

                           * There are sections of time during which a CPU
                             has finished an iteration and is called away for
                             some other request.  These times are illustrated
                             as gaps in the execution chain.

                           * CPU-3 executes only one iteration (chunk 4) and
                             never returns again.


































     216                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




     Load balancing
     example 2
     9.8.2
                           Given a second autotasked DO loop (5 iterations
                           long), if the total execution time for the loop is
                           500 seconds (as was the case in example 1), the
                           extent of parallelism would be 5, and each task
                           would be allocated 100 seconds worth of work at a
                           time.  A simplified execution scenario for this
                           loop might look like Figure 40.



     See the printed manual for this figure; it doesn't display on-line.



                   Figure 40.  Execution scenario for example 2

                           The sequence of events depicted in Figure 40 is
                           as follows:

                           * Logical CPU-0 begins executing the first
                             iteration of the loop (chunk 1) at t0.

                           * This chunk is completed 100 seconds later.

                           * As other processors become available, they are
                             allocated iterations 2 through 4.

                           * When iteration 1 is complete, CPU-0 begins
                             executing iteration 5.  As the other processors
                             finish their iterations, they cannot assist with
                             any of the remaining work because all of the
                             remaining work has been allocated to CPU-0.

                           These examples illustrate the relationship between
                           load balancing and extent of parallelism.  The
                           higher the extent of parallelism, the easier it is
                           to balance the workload evenly across the
                           processors.  The work in example 1 is spread out
                           evenly among the available processors.  In example
                           2, CPU-0 does twice as much work as any of the
                           other processors.

                           The examples also show that small-granularity
                           parallelism is easier to balance across available
                           processors than large granularity parallelism.
                           Consequently, all of the work in example 1 was
                           completed in well under 200 seconds; the work in
                           example 2 could not be completed in less than 200
                           seconds.  However, small-granularity parallelism
                           generates more overhead than large-granularity


     SG-3074 5.0               Cray Research, Inc.                        217


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           parallelism.  Synchronization is required each
                           time a chunk of work is allocated to a processor;
                           therefore, example 1 generates more overhead than
                           example 2.  You must evaluate the trade-offs
                           between load balancing and overhead.




     Overhead produced by
     Autotasking
     9.9
                           Overhead is the extra execution time that is
                           created because of the multiprocessing process
                           itself.  Different types of overhead can be
                           introduced by parallel processing; these are
                           summarized in Figure 41, page 217, and discussed 
			   in the following subsections.








































     218                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                              Autotasking Overhead


       Overhead is the extra execution time created because of the
       multiprocessing process itself.


       * Time spent waiting on semaphores - Processors are forced to wait
         on semaphores for certain lengths of time during the process of
         synchronization.


       * Time spent executing extra code for Autotasking - Overhead is
         incurred when slave processors are acquired (upon entrance to a
         parallel region) and at synchronization points within parallel
         regions.  Tests show that the overhead of executing extra
         Autotasking code adds a nominal 0to 5to the overall execution
         time.


       * Extra memory bank conflicts - Additional memory bank conflicts
         created by inter- and intraprocessor memory references can
         degrade vector performance.


       * Possible decreased vector performance - If inner-loop Autotasking
         is used, vector performance decreases because of shorter vector
         lengths and more vector loop startups.
     ________________________________________________________________________

                   Figure 41.  Summary of Autotasking overhead



     Time spent waiting on
     semaphores
     9.9.1
                           Processors are forced to wait on semaphores for
                           certain lengths of time during the synchronization
                           process.  It is difficult to estimate the amount
                           of overhead produced from waiting on semaphores.
                           The most common example of semaphore wait overhead
                           is when the master must wait for all slaves to
                           finish their work at the bottom of a parallel
                           region.  If the master is the last to finish,
                           there is no semaphore wait time.  If the master
                           finishes first, it must wait for all slaves to
                           finish their remaining work.  The semaphore wait
                           overhead is proportional to the granularity of the
                           work in the parallel region.






     SG-3074 5.0               Cray Research, Inc.                        219


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Time spent executing
     extra code for
     Autotasking
     9.9.2
                           Overhead is incurred when slave processors are
                           acquired (on entrance to a parallel region) and at
                           synchronization points within parallel regions.
                           Tests show that the overhead of executing extra
                           Autotasking code adds a nominal 0to 5to the
                           overall execution time.  Figure 42 shows
                           the amount of overhead that is introduced by
                           Autotasking on CX/CEA systems; this overhead is
                           incurred by the creation of slave tasks,
                           establishment of parallel regions, loop iteration
                           synchronization, slave task initiation, and slave
                           task completion.  Figure 43 shows the same
                           information for CRAY-2 systems.

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           The timings provided in the figures are
                           approximate; they are provided to illustrate the
                           relative amounts of time associated with
                           Autotasking synchronizations.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


































     220                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



         Figure 42.  Overhead introduced by Autotasking on CX/CEA systems


     See the printed manual for this figure; it doesn't display on-line.



         Figure 43.  Overhead introduced by Autotasking on CRAY-2 systems

                           Some important facts can be extracted from these
                           figures.  These facts give you insight into ways
                           in which you can reduce the amount of overhead
                           generated by an autotasked program.  The following
                           list outlines some of the major points shown in
                           Figures 42 and 43.

                           * Because of hardware differences, the overhead
                             produced by Autotasking on CRAY-2 systems is
                             significantly higher than the overhead produced
                             by Autotasking on CX/CEA systems.  Therefore,
                             large granularity parallelism should be
                             exploited to the largest extent possible on
                             CRAY-2 systems to reduce the amount of overhead
                             generated.

                           * The creation of slave tasks at program startup
                             time is quite expensive in terms of overhead.
                             However, because this happens only one time
                             during the execution of an autotasked program,
                             the overhead is insignificant.

                           * Establishing a parallel region generates a
                             significant amount of overhead; slave task
                             initiation overhead is also associated with each
                             parallel region.  Therefore, Autotasking tries
                             to keep the number of parallel regions to a
                             minimum.  By combining separate parallel regions
                             into a single parallel region, you can
                             significantly decrease the overhead that is
                             generated when executing that code.

                           * The overhead associated with getting the next
                             iteration of an autotasked loop is relatively
                             low.  However, because this overhead is
                             generated in a protected critical region with
                             each iteration of an autotasked loop, it
                             influences the minimum amount of work that
                             should be contained in the loop.  If the work
                             within an autotasked loop takes less time than
                             the following equation, multiple processors
                             executing that loop get backed up waiting for
                             their turn to pass through the protected
                             critical region to get the next iteration:



     SG-3074 5.0               Cray Research, Inc.                        221


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                 NCPUS * (clock periods to get the next
                             iteration)



     Extra memory bank
     conflicts created
     9.9.3
                           Memory bank conflicts created by inter- and
                           intraprocessor memory references can degrade
                           vector performance.  Increased memory contention
                           is always a possibility when the number of
                           processes on a system increases.  Memory
                           contention increases still more if the processes
                           are executing identical code with identical memory
                           access patterns.  This contention is hardware-
                           related and cannot be controlled by the programmer
                           or by Autotasking software.  However, it is
                           important to note that if a code has a significant
                           number of memory bank conflicts when executed on a
                           single processor, the memory contention is
                           multiplied when the job is multitasked.



     Possible decreased
     vector performance
     9.9.4
                           If inner-loop Autotasking is used, vector
                           performance decreases due to shorter vector
                           lengths and more vector loop startups.

                           The effects of Autotasking overhead are minimized
                           by a threshold test that is executed prior to
                           entering autotasked code.  This test determines
                           whether the work that needs to be performed is
                           sufficient to compensate for the overhead of using
                           additional processors.  Generally, the total work
                           required must be at least 800 CPs on CX/CEA
                           systems or 3200 CPs on CRAY-2 systems before the
                           addition of one more processor will produce a
                           speedup.  This is roughly equivalent to a loop
                           containing ten
                           64-element vector adds or multiplies, or four
                           vector divides, or one vector square root with a
                           few multiplies.









     222                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Autotasking
     performance example:
     NAS Kernel Benchmark
     9.10
                           The NAS Kernel Benchmark is not a true application
                           program.  Rather, it consists of computation
                           kernels including matrix multiplication, Fast
                           Fourier Transform, and square root of a matrix.
                           The kernels represent the basic computational
                           sections of many scientific application programs;
                           they also provide excellent examples of
                           Autotasking performance issues.

                           This subsection offers a brief analysis of the
                           Autotasking performance results obtained by
                           executing two different versions of the NAS Kernel
                           Benchmark program on a CRAY Y-MP computer system
                           with one, two, four, and eight CPUs.  The
                           different versions of the benchmark can be
                           classified by the number of lines of code that
                           were added, subtracted, or otherwise changed by a
                           programmer.

                           Version       Description
                           -------       -----------

                           Zero changes  No source code modifications were
                                         performed by the programmer.  All
                                         performance enhancements were
                                         provided automatically by FPP or as
                                         the result of a command-line option.

                           Twenty changes
                                         A total of 20 lines of source code
                                         were modified by the programmer, in
                                         addition to the performance
                                         enhancements provided by FPP.  Most
                                         of the programmer's modifications
                                         consisted of redefining the leading
                                         dimensions of arrays to reduce
                                         memory bank conflicts.  Two source
                                         code transformations were also
                                         performed to enable FPP to autotask
                                         two DO loops.



     Version with zero
     changes
     9.10.1
                           Table 9 shows the performance results of
                           the zero changes version of the NAS Kernel
                           Benchmark in megaflops (or MFLOPs).  MFLOPs refers
                           to millions of floating-point operations per
                           second.  The performance results given are for
                           one, two, four, and eight CPUs of a CRAY Y-MP
                           computer system.  The "Total" column reflects the


     SG-3074 5.0               Cray Research, Inc.                        223


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           weighted mean MFLOPs value of all of the kernels
                           for a specific number of CPUs.  The "Speedup"
                           entries represent the ratio of multiple-CPU MFLOPs
                           to single-CPU MFLOPs.

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           The timing information in this subsection is
                           provided only for demonstration purposes and must
                           not be used as a benchmark for comparison with
                           other computers or software.  These times were
                           obtained using prerelease software.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



                             Table 9.  NAS Kernel Benchmark - zero changes

                           _________________________________________________
                                       1 CPU     2 CPUs    4 CPUs    8 CPUs
                           Program   MFLOPs     MFLOPs     MFLOPs    MFLOPs
                           _________________________________________________
                           MXM         281.4      548.9     1065.6    1975.3
                           CFFT2D       76.3      134.6      206.2     264.3
                           CHOLSKY      89.6      143.7      205.5     208.1
                           BTRIX       139.7      198.8      249.0     272.3
                           GMTRY       252.3      486.5      918.2    1526.7
                           EMIT        182.0      356.5      676.7    1240.6
                           VPENTA       51.8       71.3       71.9      72.4
                           _________________________________________________
                           Total       109.8      176.2      230.6     261.2
                           _________________________________________________
                           Speedup       1.0        1.6        2.1       2.4
                           _________________________________________________


                           Although some of the benchmark data yields results
                           that might have been expected, other data seems to
                           require more examination.  Some programs (for
                           example, MXM, GMTRY, and EMIT) show significant
                           improvements as the number of CPUs increase.

                           On the other hand, CHOLSKY, BTRIX, and VPENTA show
                           significant improvements only to two or four CPUs.
                           One reason for the less significant performance
                           increases that occurred when these codes were
                           executed with eight CPUs is the fact that the
                           extent of parallelism for the recognized parallel
                           regions in each code was less than or equal to 8.
                           That means that the work within the parallel
                           regions could be broken into 8 chunks (or fewer)
                           to be executed in parallel.  Adding additional
                           processors creates additional overhead; if these
                           processors have little or no work to do, their
                           addition degrades performance.  Memory bank
                           problems caused by row-wise processing of arrays
                           with leading dimensions of 128 also contributed to


     224                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           the decreased performance as the number of CPUs
                           was increased.



     Version with twenty
     changes
     9.10.2
                           The performance improvements achieved by making 20
                           lines of changes to the original NAS Kernel
                           Benchmark are significant.  All of the kernels in
                           this version execute at more than 100 MFLOPs.

                           Many of the changes made to the benchmark were
                           done simply to redefine the leading dimension of
                           arrays to eliminate memory bank conflicts.  The
                           main Autotasking changes involved transforming
                           loops to enable automatic generation of calls to
                           optimized linear algebra library functions (MXM
                           and SGER) and transforming loops to generate
                           large-granularity parallelism.  Table 10
                           shows the results of executing the twenty changes
                           version of the NAS Kernel Benchmark on a CRAY Y-MP
                           system with one, two, four, and eight CPUs.


                            Table 10.  NAS Kernel Benchmark - twenty changes

                           ___________________________________________________
                                       1 CPU     2 CPUs     4 CPUs     8 CPUs
                           Program   MFLOPs     MFLOPs     MFLOPs     MFLOPs
                           ___________________________________________________
                           MXM         303.60     626.88    1208.18    2338.65
                           CFFT2D      182.30     354.31     562.96     806.39
                           CHOLSKY      77.21     155.21     264.46     463.68
                           BTRIX       147.10     305.42     559.37    1053.88
                           GMTRY       249.16     513.24     918.84    1753.88
                           EMIT        180.51     376.43     701.58    1294.77
                           VPENTA      122.39     244.31     392.30     714.08
                           ___________________________________________________
                           Total       161.57     326.36     560.89     967.00
                           ___________________________________________________
                           Speedup       1.0        2.0        3.5        6.0
                           ___________________________________________________


                           The "Total" entries in Table 10 represent the
                           weighted arithmetic mean for the seven kernels.
                           The "Speedup" entries represent the ratio of
                           multiple-CPU MFLOPs to single-CPU MFLOPs.






     SG-3074 5.0               Cray Research, Inc.                        225


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Case study:  VPENTA
     9.10.3
                           The VPENTA subroutine of the NAS Kernel Benchmark
                           provides a good case study in Autotasking.  The
                           original version of this code has a number of
                           memory bank conflicts and only two sets of nested
                           loops.  In both cases, the outer loops contain
                           data dependencies that inhibit multitasking.  The
                           inner-loop Autotasking option (fpp -ei) provides
                           some parallel processing opportunities, but the
                           granularity of the parallelism remains small.


     VPENTA:  original
     code
     9.10.3.1
                           The following listing shows the original code for
                           VPENTA after processing by FPP.  FPP changes to
                           the original source code are highlighted in bold.
                           The FPP inner-loop Autotasking feature was used.
                           However, because the number of iterations for each
                           inner loop is only 128, inner-loop Autotasking
                           creates only two chunks of 64 iterations for each
                           CMIC$ DO ALL VECTOR directive.  Therefore, the
                           extent of parallelism for each autotasked inner
                           loop is 2.  The DO 3 and DO 4 loops have
                           dependencies that inhibit multiprocessing.  All
                           arrays have a leading dimension of 128, which
                           causes a large number of memory bank conflicts.

     ------------------------------------------------------------------------
     C
           SUBROUTINE VPENTA
     C
     C    ROUTINE TO INVERT 3 PENTADIAGONALS SIMULTANEOUSLY
     C
     C   12/05/84  D H BAILEY   MODIFIED FOR NAS KERNEL TEST
     C
           PARAMETER (NJA=129, NJB=128, JL=1, JU=128, KL=1, KU=128)
     C...Translated by FPP 5.0 (3.03L5) 05/10/91  09:48:50   -ei
           COMMON /ARRAYS/ A(NJA,NJB), B(NJA,NJB), C(NJA,NJB), D(NJA,NJB),
          $ E(NJA,NJB), F(NJA,NJB,3), X(NJA,NJB), Y(NJA,NJB), FX(NJA,NJB,3)
     C
     C       ! START FORWARD GENERATION PROCESS AND SWEEP
     C
           J = JL
     CMIC@ DO ALL VECTOR SHARED(C, F, D, X, E, Y) PRIVATE(K, RLD, RLDI)
     CDIR@ IVDEP
           DO 1 K = 1, 128
              RLDI = 1./C(1,K)
              F(1,K,1) = F(1,K,1)*RLDI
              F(1,K,2) = F(1,K,2)*RLDI
              F(1,K,3) = F(1,K,3)*RLDI
              X(1,K) = D(1,K)*RLDI

                                                                 (continued)
     ------------------------------------------------------------------------


     226                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
              Y(1,K) = E(1,K)*RLDI
         1 CONTINUE
     C
           J = JL+1
     CMIC@ DO ALL VECTOR SHARED(B,C,X,F,D,Y,E)PRIVATE(K,RLD1,RLD,RLDI)
     CDIR@ IVDEP
           DO 2 K = 1, 128
              RLDI = 1./(C(2,K)-B(2,K)*X(1,K))
              F(2,K,1) = (F(2,K,1)-B(2,K)*F(1,K,1))*RLDI
              F(2,K,2) = (F(2,K,2)-B(2,K)*F(1,K,2))*RLDI
              F(2,K,3) = (F(2,K,3)-B(2,K)*F(1,K,3))*RLDI
              X(2,K) = (D(2,K)-B(2,K)*Y(1,K))*RLDI
              Y(2,K) = E(2,K)*RLDI
         2 CONTINUE
     CMIC@ PARALLEL SHARED(A,B,X,C,Y,F,D,E)PRIVATE(K,RLD2,RLD1,RLD,RLDI,J)
           DO 3 J = JL+2,JU-2
     CMIC@ DO PARALLEL VECTOR
     CDIR@ IVDEP
             DO 11 K = KL,KU
               RLD2 = A(J,K)
               RLD1 = B(J,K) - RLD2*X(J-2,K)
               RLD = C(J,K) - (RLD2*Y(J-2,K) + RLD1*X(J-1,K))
               RLDI = 1./RLD
               F(J,K,1) = (F(J,K,1) - RLD2*F(J-2,K,1) - RLD1*F(J-1,K,1))*RLDI
               F(J,K,2) = (F(J,K,2) - RLD2*F(J-2,K,2) - RLD1*F(J-1,K,2))*RLDI
               F(J,K,3) = (F(J,K,3) - RLD2*F(J-2,K,3) - RLD1*F(J-1,K,3))*RLDI
               X(J,K) = (D(J,K) - RLD1*Y(J-1,K))*RLDI
               Y(J,K) = E(J,K)*RLDI
     11      CONTINUE
     3     CONTINUE
     CMIC@ END PARALLEL
     C
           J = JU-1
     CMIC@ DO ALL VECTOR SHARED(A,B,X,C,Y,F,D)PRIVATE(K,RLD2,RLD1,RLD,RLDI)
     CDIR@ IVDEP
           DO 12 K = 1, 128
              RLD1 = B(127,K) - A(127,K)*X(125,K)
              RLD = C(127,K) - (A(127,K)*Y(125,K)+RLD1*X(126,K))
              RLDI = 1./RLD
              F(127,K,1) = (F(127,K,1)-A(127,K)*F(125,K,1)-RLD1*F(126,K,1))*
          1      RLDI
              F(127,K,2) = (F(127,K,2)-A(127,K)*F(125,K,2)-RLD1*F(126,K,2))*
          1      RLDI
              F(127,K,3) = (F(127,K,3)-A(127,K)*F(125,K,3)-RLD1*F(126,K,3))*
          1      RLDI
              X(127,K) = (D(127,K)-RLD1*Y(126,K))*RLDI
        12 CONTINUE
     C
           J = JU
     CMIC@ DO ALL VECTOR SHARED(A,B,X,C,Y,F)PRIVATE(K,RLD2,RLD1,RLD,RLDI)
     CDIR@ IVDEP
           DO 13 K = 1, 128
              RLD1 = B(128,K) - A(128,K)*X(126,K)
              RLD = C(128,K) - (A(128,K)*Y(126,K)+RLD1*X(127,K))
              RLDI = 1./RLD

                                                                 (continued)
     ------------------------------------------------------------------------


     SG-3074 5.0               Cray Research, Inc.                        227


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
              F(128,K,1) = (F(128,K,1)-A(128,K)*F(126,K,1)-RLD1*F(127,K,1))*
          1      RLDI
              F(128,K,2) = (F(128,K,2)-A(128,K)*F(126,K,2)-RLD1*F(127,K,2))*
          1      RLDI
              F(128,K,3) = (F(128,K,3)-A(128,K)*F(126,K,3)-RLD1*F(127,K,3))*
          1      RLDI
              F(128,K,1) = F(128,K,1)
              F(128,K,2) = F(128,K,2)
              F(128,K,3) = F(128,K,3)
              F(127,K,1) = F(127,K,1) - X(127,K)*F(128,K,1)
              F(127,K,2) = F(127,K,2) - X(127,K)*F(128,K,2)
              F(127,K,3) = F(127,K,3) - X(127,K)*F(128,K,3)
        13 CONTINUE
     C
           DO 4 J = 2,JU-JL
             JX = JU-J
     CDIR@ IVDEP
             DO 15 K = KL,KU
               F(JX,K,1) = F(JX,K,1) - X(JX,K)*F(JX+1,K,1) -
          *                Y(JX,K)*F(JX+2,K,1)
               F(JX,K,2) = F(JX,K,2) - X(JX,K)*F(JX+1,K,2) -
          *                Y(JX,K)*F(JX+2,K,2)
               F(JX,K,3) = F(JX,K,3) - X(JX,K)*F(JX+1,K,3) -
          *                Y(JX,K)*F(JX+2,K,3)
     15      CONTINUE
     4     CONTINUE
     C
           RETURN
           END
     ------------------------------------------------------------------------






























     228                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



     VPENTA revision 1
     9.10.3.2
                           The following listing shows the code modification
                           for VPENTA revision 1.  The original version of
                           VPENTA had a number of memory bank conflicts, even
                           when executed in single-CPU mode.  The memory
                           contention was multiplied when the code was
                           executed by multiple processors simultaneously.
                           Revision 1 reduces memory bank conflicts,
                           resulting in a performance improvement.  Only one
                           line of code was altered in the VPENTA subroutine,
                           as follows:

     ------------------------------------------------------------------------
     Original:    PARAMETER (NJA=128, NJB=128, JL=1, JU=128, KL=1, KU=128)
     Revision 1:  PARAMETER (NJA=129, NJB=128, JL=1, JU=128, KL=1, KU=128)
     ------------------------------------------------------------------------


     VPENTA revision 2
     9.10.3.3
                           Code for VPENTA revision 2 is shown in the
                           following listing.  This version of VPENTA is
                           similar to the NAS Kernel Benchmark with twenty
                           changes.  The routine was stripmined to stretch
                           the parallelism of each inner-loop parallel region
                           into a single parallel region.  This technique
                           also allowed programmers to exploit parallelism
                           around the DO 3 and DO 15 loops (shown in the
                           previous listing), which have data dependencies
                           that inhibited multitasking.

                           The stripmining was accomplished by creating a new
                           autotasked loop (DO 110) that creates eight equal
                           chunks of work.  The inner loops were altered to
                           compute only 1/8 of the original number of
                           iterations.  Although the stripmining technique
                           decreased the vector length of the inner loop from
                           128 to 16, it allowed the programmer to autotask
                           around the nested loops that have data
                           dependencies.  FPP did not make any
                           transformations to this routine because of the
                           presence of a user-inserted CMIC$ directive around
                           the DO 110 loop, which encompasses the entire
                           routine.

                           Only seven lines of code were altered in this
                           version of the VPENTA subroutine.  (The actual
                           "benchmark" version of this code produced similar
                           results with only four lines of changes.)   The
                           seven-line changed version is presented for
                           clarity.  FPP changes to the original source are
                           highlighted in bold.


     SG-3074 5.0               Cray Research, Inc.                        229


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           This solution assumes that a specific number of
                           processors are available to the program.  Although
                           this practice provides greatly improved
                           performance on a dedicated system, it is generally
                           unwise to assume any specific number of processors
                           in a dynamically changing batch environment.

     ------------------------------------------------------------------------
     C
           SUBROUTINE VPENTA
     C
     C    ROUTINE TO INVERT 3 PENTADIAGONALS SIMULTANEOUSLY
     C
     C   12/05/84  D H BAILEY   MODIFIED FOR NAS KERNEL TEST
     C
           PARAMETER (NJA=129, NJB=128, JL=1, JU=128)
           COMMON /ARRAYS/ A(NJA,NJB), B(NJA,NJB), C(NJA,NJB), D(NJA,NJB),
          $ E(NJA,NJB), F(NJA,NJB,3), X(NJA,NJB), Y(NJA,NJB), FX(NJA,NJB,3)
     C
     C       ! START FORWARD GENERATION PROCESS AND SWEEP
     C
           NCPUS=8
     cmic$ do all private (x, y) autoscope
           do 110 jj=1,128,128/NCPUS
           kl=jj
           ku=jj+(128/NCPUS)-1
           J = JL
           DO 1 K = KL,KU
             RLD = C(J,K)
             RLDI = 1./RLD
             F(J,K,1) = F(J,K,1)*RLDI
             F(J,K,2) = F(J,K,2)*RLDI
             F(J,K,3) = F(J,K,3)*RLDI
             X(J,K) = D(J,K)*RLDI
             Y(J,K) = E(J,K)*RLDI
     1     CONTINUE
     C
           J = JL+1
           DO 2 K = KL,KU
             RLD1 = B(J,K)
             RLD = C(J,K) - RLD1*X(J-1,K)
             RLDI = 1./RLD
             F(J,K,1) = (F(J,K,1) - RLD1*F(J-1,K,1))*RLDI
             F(J,K,2) = (F(J,K,2) - RLD1*F(J-1,K,2))*RLDI
             F(J,K,3) = (F(J,K,3) - RLD1*F(J-1,K,3))*RLDI
             X(J,K) = (D(J,K) - RLD1*Y(J-1,K))*RLDI
             Y(J,K) = E(J,K)*RLDI
     2     CONTINUE
     C
           DO 3 J = JL+2,JU-2
             DO 11 K = KL,KU
               RLD2 = A(J,K)
               RLD1 = B(J,K) - RLD2*X(J-2,K)
               RLD = C(J,K) - (RLD2*Y(J-2,K) + RLD1*X(J-1,K))
               RLDI = 1./RLD
               F(J,K,1) = (F(J,K,1) - RLD2*F(J-2,K,1) - RLD1*F(J-1,K,1))*RLDI

                                                                 (continued)
     ------------------------------------------------------------------------


     230                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Autotasking Performance
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
               F(J,K,2) = (F(J,K,2) - RLD2*F(J-2,K,2) - RLD1*F(J-1,K,2))*RLDI
               F(J,K,3) = (F(J,K,3) - RLD2*F(J-2,K,3) - RLD1*F(J-1,K,3))*RLDI
               X(J,K) = (D(J,K) - RLD1*Y(J-1,K))*RLDI
               Y(J,K) = E(J,K)*RLDI
     11      CONTINUE
     3     CONTINUE
     C
           J = JU-1
           DO 12 K = KL,KU
             RLD2 = A(J,K)
             RLD1 = B(J,K) - RLD2*X(J-2,K)
             RLD = C(J,K) - (RLD2*Y(J-2,K) + RLD1*X(J-1,K))
             RLDI = 1./RLD
             F(J,K,1) = (F(J,K,1) - RLD2*F(J-2,K,1) - RLD1*F(J-1,K,1))*RLDI
             F(J,K,2) = (F(J,K,2) - RLD2*F(J-2,K,2) - RLD1*F(J-1,K,2))*RLDI
             F(J,K,3) = (F(J,K,3) - RLD2*F(J-2,K,3) - RLD1*F(J-1,K,3))*RLDI
             X(J,K) = (D(J,K) - RLD1*Y(J-1,K))*RLDI
     12    CONTINUE
     C
           J = JU
           DO 13 K = KL,KU
             RLD2 = A(J,K)
             RLD1 = B(J,K) - RLD2*X(J-2,K)
             RLD = C(J,K) - (RLD2*Y(J-2,K) + RLD1*X(J-1,K))
             RLDI = 1./RLD
             F(J,K,1) = (F(J,K,1) - RLD2*F(J-2,K,1) - RLD1*F(J-1,K,1))*RLDI
             F(J,K,2) = (F(J,K,2) - RLD2*F(J-2,K,2) - RLD1*F(J-1,K,2))*RLDI
             F(J,K,3) = (F(J,K,3) - RLD2*F(J-2,K,3) - RLD1*F(J-1,K,3))*RLDI
     13    CONTINUE
     C
     C        !  BACK SWEEP SOLUTION
     C
           DO 14 K = KL,KU
             F(JU,K,1) = F(JU,K,1)
             F(JU,K,2) = F(JU,K,2)
             F(JU,K,3) = F(JU,K,3)
             F(JU-1,K,1) = F(JU-1,K,1) - X(JU-1,K)*F(JU,K,1)
             F(JU-1,K,2) = F(JU-1,K,2) - X(JU-1,K)*F(JU,K,2)
             F(JU-1,K,3) = F(JU-1,K,3) - X(JU-1,K)*F(JU,K,3)
     14    CONTINUE
     C
           DO 4 J = 2,JU-JL
             JX = JU-J
             DO 15 K = KL,KU
               F(JX,K,1) = F(JX,K,1) - X(JX,K)*F(JX+1,K,1) -
          *                Y(JX,K)*F(JX+2,K,1)
               F(JX,K,2) = F(JX,K,2) - X(JX,K)*F(JX+1,K,2) -
          *                Y(JX,K)*F(JX+2,K,2)
               F(JX,K,3) = F(JX,K,3) - X(JX,K)*F(JX+1,K,3) -
          *                Y(JX,K)*F(JX+2,K,3)
     15      CONTINUE
     4     CONTINUE
     110   continue
     C
           RETURN
           END

     ------------------------------------------------------------------------


     SG-3074 5.0               Cray Research, Inc.                        231


     Autotasking Performance        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The results of the VPENTA case study, which was
                           run on an eight-processor CRAY Y-MP system, are
                           listed in Table 11 Remember that revision 1
                           is simply the original version of the VPENTA
                           subroutine, with one line of code altered to
                           reduce memory bank conflicts.  Revision 2 contains
                           the revision 1 change plus an additional
                           transformation to increase parallel processing
                           granularity and to exploit parallelism around
                           existing dependencies.  It is important to note
                           that the most significant performance improvement
                           was obtained by reducing the memory bank
                           conflicts.

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           The timing information in this section is provided
                           only for demonstration purposes and must not be
                           used as a benchmark for comparison with other
                           computers or software.  These times were obtained
                           using prerelease software on a lightly loaded
                           batch system.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



                           Table 11.  VPENTA case study results from a
                                      CRAY Y-MP system

                           __________________________________________________
                           VPENTA version       MFLOP performance with 8 CPUs
                           __________________________________________________
                           Original                          72.82
                           Revision 1                       222.3
                           Revision 2                      1041.9
                           __________________________________________________

















     232                       Cray Research, Inc.                SG-3074 5.0



                                             Autotasking Analysis Tools  [10]
     ########################################################################







                           Many tools exist to help you analyze various
                           aspects of your autotasked codes.  Several tools
                           have been added to UNICOS specifically for
                           Autotasking; other tools can also be useful.

                           This section helps you identify which tool you
                           need and summarizes the tools that are available.
                           Detailed documentation for each tool is provided
                           in other manuals in the UNICOS manual set; the
                           appropriate manual is noted along with the
                           description of the tool.





     Tool summary
     10.1
                           Table 9, page 222, helps identify which tool you 
			   need; the table categorizes the tools according to 
			   the information they provide.  The categories appear
                           from general to specific; the tools in each
                           category are in the suggested order of use.

                          Table 12.  Which tool to use?

     _________________________________________________________________________
     To identify . . .                        Use these tools
     _________________________________________________________________________

     CPU time used by an entire program       /bin/time (not C shell time)
                                              ja
                                              hpm
                                              atexpert (used with Autotasking)

     Routine using the most CPU time	      Profiling (prof and profview)
                                              Flowtrace (flowview)
                                              atexpert (used with Autotasking)

     CPU time used by one routine	      Flowtrace (flowview)
                                              FLOWMARK (Flowtrace)
                                              Profiling (prof and profview)
                                              SECOND
                                              atexpert (used with Autotasking)

     CPU time used in library routines	      Profiling (prof and profview)
     _________________________________________________________________________


     SG-3074 5.0               Cray Research, Inc.                        233


     Autotasking Analysis Tools     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                    Table 12.  Which tool to use?  (continued)
     _________________________________________________________________________
     To identify . . .                        Use these tools
     _________________________________________________________________________
     CPU time used in one loop or code section
                                              Profiling (prof and profview)
                                              FLOWMARK (Flowtrace)
                                              SECOND
                                              atexpert (used with Autotasking)

     Routines that you should consider inlining
                                              Flowtrace (flowview)

     Program's call tree (dynamic)            Flowtrace (flowview)

     Program's call tree (static)             Flowtrace (ftref)

     Routine altering the contents of a variable
					      Watchword (and cdbx+)

     Efficiency of CRAY Y-MP, CRAY X-MP EA, and
     CRAY X-MP hardware use		      hpm (perfview)

     Efficiency of parallel processing        atexpert (used with Autotasking)
                                              ja
                                              hpm
                                              mtdump (used with macrotasking)
                                              stategraph (used with 
						macrotasking)
                                              timeline (used with 
						macrotasking)
                                              multimeter (used with 
						macrotasking)

     Amount and speed of program I/O          procstat and procrpt
                                              ja

     Amount of central memory used            procstat and procrpt

     Process creation/termination history     procstat and procrpt
                                              ja

     Cause of differing results with fpp and fmp
     processing				      atchop

     Scope of variables for Autotasking       atscope

     Use of Fortran common variables throughout
     program				      ftref
     _________________________________________________________________________

     + Documented in the UNICOS CDBX Symbolic Debugger Reference Manual,
       publication SR-2091 and in the UNICOS CDBX Debugger User's Guide,
       publication SG-2094.











     234                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide     Autotasking Analysis Tools
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Table 13 itemizes some considerations and
                           requirements for using the performance tools that
                           run with programs.  The table does not include
                           reporting tools.

                           In Table 13 the headings have the following
                           meanings:

                           * In the column under "CPU overhead," "Low" means
                             that the tool incurs no significant overhead.

                           * In the next three columns, "Yes" means that the
                             tool requires program recompiling or reloading
                             or source code changes.

                              Table 13.  Tool impact

     ________________________________________________________________________
                                  CPU                                  Change
     Tool                         overhead    Recompile      Reload    source
     ________________________________________________________________________
     atchop                       High        Yes            Yes       No
     atexpert                     Low+        Yes            Yes       No
     Flowtrace++                  High        Yes            Yes       No
       FLOWMARK                   High        Yes            Yes       Yes
     hpm++                        Low         No             No        No
     ja                           Low         No             No        No
     MT history trace             High        Yes            Yes       Yes
       (stategraph, timeline,
           multimeter, mtdump)
     procstat                     Low         No             No+++     No
     Profiling                    Low         Recommended    Yes       No
     SECOND                       Low         Yes            Yes       Yes
     time                         Low         No             No        No
     Watchword                    High        Yes            Yes       Yes
     ________________________________________________________________________

     +  Overhead is inversely proportional to the granularity and number of
        invocations of parallel regions.
     ++  For multitasked programs on CRAY Y-MP, CRAY X-MP EA, and CRAY X-MP
        machines only.
     +++ Use on CRAY-2 machines requires reloading.




     Autotasking tools
     10.2
                           Tools discussed in this subsection were designed
                           for use with multiprocessing or Autotasking.
                           Detailed descriptions of these tools are provided
                           in the UNICOS Performance Utilities Reference


     SG-3074 5.0               Cray Research, Inc.                        235


     Autotasking Analysis Tools     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Manual, publication SR-2040; UNICOS man pages are
                           provided in the UNICOS User Commands Reference
                           Manual, publication SR-2011.  Man pages are also
                           available on-line.



     atchop command
     10.2.1
                           The atchop(1) command invokes a binary chop
                           utility for autotasked programs.  Using atchop,
                           you can identify subroutines and/or loops within
                           subroutines that are causing numerical differences
                           or abort conditions in programs after the programs
                           are preprocessed by fpp and/or fmp.

                           The atchop tool performs a binary chop with the
                           application program's subroutines, then performs
                           an fpp chop with the loops in the subroutines
                           identified in the binary chop.  The output from
                           atchop is the subroutine names and loop labels
                           within these subroutines that are contributing to
                           the program's numerical differences or that are
                           causing the program to abort because of an fpp
                           error in loop translation.



     atexpert command
     10.2.2
                           The atexpert(1) command invokes a tool that
                           accurately measures and graphically displays
                           Autotasking performance from a job run on an
                           arbitrarily loaded CRI system.  This tool can
                           predict speedups on a dedicated system from data
                           collected from one run on a nondedicated system.
                           It shows where a program is spending its time and
                           whether those areas are executed serially or with
                           multiple processors.  atexpert serves as a guide
                           for improving Autotasking performance.

                           The atexpert command instrumentation focuses on
                           timing each parallel region identified by the
                           Autotasking system.  Times for starting, stopping,
                           and performing the parallel work in a parallel
                           region are measured.  atexpert also measures the
                           time to perform the work between parallel regions
                           (called preceding serial time).










     236                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide     Autotasking Analysis Tools
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The overhead for atexpert timing is kept low (10
                           to 20) by carefully locating the calls to the
                           IRTC function and avoiding large numbers of
                           subroutine calls.  This permits a fairly accurate
                           computation of various kinds of Autotasking
                           overhead.  This in turn allows atexpert to make
                           accurate projections of speedups using N
                           processors; N can range from 1 to the number of
                           processors on the target machine.  atexpert
                           overhead increases, however, for programs with
                           small granularity parallel regions.

                           After compiling and loading the instrumented file,
                           the autotasked job is executed.  At program
                           termination, the atexpert library routine produces
                           a file containing the collected performance data.
                           You can then view the data by using the atexpert
                           postprocessing tool.

                           The best way to view the data collected during the
                           run of an autotasked job is to use the X Window
                           System graphics interface of the atexpert
                           postprocessing tool.  (If you cannot use the X
                           Window Sytem interface, the atexpert
                           postprocessing tool also has an ASCII interactive
                           and report interface.)

                           atexpert gives accurate performance measurements
                           of a program, subroutines, parallel regions, and
                           loops of a given application.  From these
                           measurements, a program can be examined to see
                           where further performance gains can be made.

                           Program, subroutine, parallel region, and loop
                           performance characteristics can be viewed.
                           Optimum use of CPUs can be determined.  Serial
                           sections of the run can also be scrutinized to
                           find more potential parallelism.  Autotasking
                           overhead can also be examined in detail, to see
                           where bottlenecks are occurring.  At each level of
                           analysis, atexpert graphically shows you how the
                           program has performed.

                           You can also use atexpert with other tools, such
                           as profview and flowview, to get an even more
                           accurate feel for where further performance gains
                           can be made.



     atscope command
     10.2.3
                           The atscope(1) command invokes an X Window System
                           tool that helps scope variables for chosen DO
                           loops in Fortran programs.  atscope assists in
                           transforming loops that Autotasking does not
                           detect automatically to run in parallel.  Unlike


     SG-3074 5.0               Cray Research, Inc.                        237


     Autotasking Analysis Tools     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           most of the other tools described in this section,
                           atscope simply analyzes source code; atscope never
                           executes your program.

                           Given a Fortran program, atscope lets you view
                           text while displaying the routine names of the
                           program.  After a subroutine is chosen on which to
                           work, the tool displays the text of that chosen
                           subroutine and the labels of all DO loops within
                           the routine.  You may then choose a loop to scope.
                           After a loop has been chosen, you are given a
                           "guess" as to the scope of each variable in the DO
                           loop.  Each variable is given a status of private,
                           shared, or unknown.  The chosen DO loop is
                           displayed, and you can click on a variable to see
                           all occurrences of that variable in the loop.  You
                           can change the status of the variable by clicking
                           on its status (toggled private/shared).

                           You must change the status of all unknown
                           variables; these are usually variables found in
                           CALL statements to other subroutines.  When you
                           are satisfied with the scope of the variables in
                           the chosen DO loop, a CMIC$ DO ALL directive is
                           automatically placed before the DO loop.  You may
                           then select another loop in the subroutine to
                           scope, or begin work on another subroutine all
                           together.  When all scoping is finished, you are
                           prompted for a file name in which the work can be
                           saved.



     Multitasking history
     trace
     10.2.4
                           The multitasking history trace buffer is a
                           circular buffer in user memory that maintains a
                           log of the actions taken by the multitasking
                           library routines.  The multitasking history trace
                           buffer logs actions taken at the macrotasking, but
                           not at the microtasking or Autotasking, level.
                           Access to this buffer is through calls to library
                           routines.  The buffer is large enough to record
                           the most recent 1000 actions.  Each time it is
                           full, this buffer may be dumped to a file
                           specified in a call to subroutine BUFTUNE.  The
                           mtdump(1) program can examine this unformatted
                           dump of the file and display all of it in several
                           possible formats, or it can search the file for
                           specific entries.  The Graphical Multiprocessing
                           Analysis Tools (GMAT) (multimeter(1),
                           stategraph(1), and timeline(1)) provide a way to
                           display the same information graphically.





     238                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide     Autotasking Analysis Tools
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The multitasking history trace feature helps you
                           determine how effectively you are using concurrent
                           processes.

                           The mtdump command is associated with multitasking
                           history trace and it invokes a tool that examines
                           an unformatted dump of the multitasking history
                           trace buffer.  Using mtdump, you can generate
                           reports in five primary formats, which display all
                           the information in the multitasking history trace
                           buffer, or you can select specific buffer entries,
                           which display only portions of the available
                           information.

                           In addition, the CDBX bufdump command lets you
                           examine the most recent trace buffer memory
                           entries in an image (such as a running program or
                           core file) being debugged.  See the UNICOS CDBX
                           Debugger User's Guide, publication SG-2094, for
                           more information.



     The Graphical
     Multiprocessing
     Analysis Tools (GMAT)
     10.2.5
                           As discussed in the previous subsection, you can
                           analyze the multitasking history trace file by
                           using the mtdump command.  The mtdump output file
                           contains a static postprocessing trace of all
                           multiprocessing events.  This static
                           postprocessing display is limited, however, by the
                           fact that it is postprocessing and cannot account
                           for time and events as they occur in the
                           multiprocessed job.  The Graphical Multiprocessing
                           Analysis Tools (GMAT)+ provide such a facility.
                           GMAT graphically displays, and helps you analyze,
                           library-generated trace files of programs run on
                           any CRI system using a workstation running the X
                           Window System.

                           There are two types of information from a trace
                           file that provide useful feedback:  task genealogy
                           and synchronization.  Because intertask
                           synchronization is necessary only as a result of
                           concurrent execution, the notion of time is a
                           fundamental aspect of the graphical display of
                           synchronization data.  On the other hand, a time-
                           line approach is not conducive for displaying task

                           ----------
                           + GMAT is copyrighted by the University of
                            California and is distributed by Cray Research,
                            Inc. with permission.





     SG-3074 5.0               Cray Research, Inc.                        239


     Autotasking Analysis Tools     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           lineage because the fork and join operations are
                           generally separated by enough time so that getting
                           all of it on one screen is impossible.

                           GMAT consists of the following three commands,
                           which highlight task genealogy and synchronization
                           separately:

                           * The multimeter command provides a graphical
                             display of multitasking history trace data
                             showing the balance and granularity of a
                             macrotasked code.  Unlike timeline and
                             stategraph, multimeter deals only with the
                             starting and stopping of tasks.  By looking at
                             only these events, multimeter tries to give you
                             a perspective of how well an application is
                             balanced.  multimeter is very simple compared to
                             other GMAT commands; it tries to provide a
                             picture of your entire application with one
                             snapshot.

                           * The stategraph command provides a graphical
                             display of the task genealogy tree of a
                             multiprocessing application.  Each node in the
                             tree is a task.  Spawned tasks appear below the
                             parent task and are connected with a line.  As a
                             task changes state (for example, from unbound to
                             a process to bound or running to waiting), the
                             displayed node changes shape and/or polarity.
                             There is no sense of time in the stategraph,
                             only changes in state, delimited by significant
                             events in the trace file.  Thus, at any given
                             moment, the graph reflects the state of all
                             tasks created so far by the application.  You
                             can control the rate at which the screen updates
                             occur.  This facilitates the detailed viewing of
                             important sections and the speedy presentation
                             of unimportant ones.

                             Because stategraph displays the state of each
                             task, only those events that affect the state of
                             a task are considered significant.  As an event
                             is read in from the trace file, the event
                             counter is incremented and, if the action is
                             significant, the display is also adjusted.  You
                             can optionally define up to five additional
                             events that should be considered significant by
                             stategraph and hence displayed.  This lets you
                             customize the graph by having user-defined
                             actions displayed or by displaying actions that
                             GMAT would otherwise ignore.

                           * The displays provided by the timeline command
                             differ greatly from those provided by
                             stategraph.  In stategraph, each node represents
                             a task, and at any given moment, the graph is a
                             snapshot of the multiprocessing job's current
                             state.  timeline, on the other hand, tries to
                             represent synchronization within a
                             multiprocessing program.


     240                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide     Autotasking Analysis Tools
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             Because time is a fundamental aspect of
                             synchronization, timeline incorporates time as
                             its central focus.  As tasks are created, they
                             begin to have a time line (an existence, with
                             state, in time).  As time goes on, the time line
                             of a task is drawn downward.  Events are shown
                             as icons superimposed on the time line.  The
                             graphical representation of a task's time line
                             changes as the task changes from a running state
                             to a waiting state, and so on.  Tasks are
                             connected with a horizontal line to the tasks
                             that they spawn.  At any given moment, the time
                             line displayed gives the history of the
                             multiprocessing job.




     Other UNICOS tools
     10.3
                           Tools discussed in this subsection were not
                           specifically designed for use with multiprocessing
                           or Autotasking, but they provide useful analysis
                           for all types of programs, including those that
                           take advantage of parallel processing.  Detailed
                           descriptions of these tools are provided in
                           various manuals; manuals are noted in individual
                           descriptions.  UNICOS man pages are provided (if
                           noted) in the UNICOS User Commands Reference
                           Manual, publication SR-2011, and are also
                           available on-line.



     Profiling feature
     10.3.1
                           The Profiling feature indicates the relative
                           amount of execution time used by individual parts
                           of your program, represented as segments of
                           program memory.  Study of Profiling output lets
                           you optimize areas in your programs that demand
                           the most execution time.  The Profiling commands
                           can be used with any program running under UNICOS,
                           regardless of the language.  It produces finer-
                           grained results than those from Flowtrace.

                           A typical report includes a line for each bucket
                           or routine, including a histogram bar and
                           execution statistics as an absolute count and as a
                           percentage.







     SG-3074 5.0               Cray Research, Inc.                        241


     Autotasking Analysis Tools     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Usually, the best way to read prof(1) output is by
                           using the profview(1) utility, which can be used
                           with interactive options or directed to a file.
                           The display from profview shows the number of hits
                           for each bucket, including a bar graph on which
                           "spikes" indicate code in which the most time is
                           spent.

                           If your program was loaded with the prof library,
                           at regular time intervals during execution, a
                           system routine records the address of the
                           instruction currently being executed.  Addresses
                           are grouped in buckets, whose size is selectable;
                           each time the monitor reads an address within a
                           particular bucket, a hit is recorded for that
                           bucket.

                           Because of the time interval between samples,
                           prof's results involve probability:  in the worst
                           case, an address could be executed many times but
                           never sampled.  Monitoring can also be affected by
                           the current load on the system.  Variables
                           specified in the env command change the prof
                           library parameters to control the number of
                           samples:  PROF_RATE (not available on CRAY-2
                           systems) selects the time interval, and PROF_WPB
                           selects the bucket size.  In addition, the
                           interactive and X Window System interfaces to
                           profview provide an option that let you display
                           percentages in the form of a confidence interval.

                           The prof and profview utilities provide useful
                           information for multitasked programs.  The system
                           shows the relative times spent in routines,
                           whether or not your program is multitasked.  The
                           information given is based on an aggregate of all
                           execution time, and it is not detailed by task.
                           If you need detailed information about parallel
                           efficiency, use the atexpert command.

                           The microtasking and Autotasking systems alter
                           your code and generate new routine names, as well
                           as new numbers for lines and statements.  prof
                           reports these altered symbols rather than the
                           original symbols in your program.

                           For more information about using prof and
                           profview, see the UNICOS Performance Utilities
                           Reference Manual, publication SR-2040, and the
                           UNICOS User Commands Reference Manual, publication
                           SR-2011.



     procstat and procrpt
     commands
     10.3.2


     242                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide     Autotasking Analysis Tools
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The procstat(1) command monitors execution of any
                           process specified on the command line, and
                           generates statistics about I/O, process, and
                           memory activities.  A process is any executable
                           entity for which a process identifier (PID) is
                           assigned, such as a Fortran program or a UNICOS
                           command.

                           The information generated by procstat always
                           includes process start and exit, and, by default,
                           includes memory activity and I/O usage (including
                           secondary data segment usage).

                           procstat output is formatted by the procrpt
                           utility, which can present the statistics in
                           several formats.  (procstat can also create a
                           formatted report by itself, but this is more
                           difficult to read than the report created by
                           procrpt.)

                           procstat maintains I/O statistics for all
                           processes executed at the same time as your
                           program.  This includes any processes initiated by
                           a multitasked program.  If procrpt cannot identify
                           the origin of a certain process, it enters
                           (unknown) in some fields.

                           On CRAY-2 systems, the commands require loading a
                           special library along with the program binary
                           files.

                           For more information about using procstat and
                           procrpt, see the UNICOS Performance Utilities
                           Reference Manual, publication SR-2040, and the
                           UNICOS User Commands Reference Manual, publication
                           SR-2011.



     Flowtrace
     10.3.3
                           Flowtrace gathers information about all traced
                           procedure calls during execution of a program.
                           When the program terminates normally, these
                           statistics are written to a file.  Reports can be
                           generated from this data by using the flowview
                           postprocessing tool.

                           The following are some of the statistics that the
                           Flowtrace feature gathers:

                           * Number of times a traced routine was called

                           * Total amount of CPU time spent in a traced
                             routine

                           * The names of the traced callers of each traced


     SG-3074 5.0               Cray Research, Inc.                        243


     Autotasking Analysis Tools     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             routine

                           Time spent in one traced routine called by another
                           traced routine is properly accounted to that
                           routine.  However, time spent in a nontraced
                           routine (such as a library routine) will be
                           included in the statistics for its traced caller.

                           Flowtrace works with multitasked codes only on
                           CRAY Y-MP, CRAY X-MP EA, and CRAY X-MP systems.

                           For more information about using Flowtrace and
                           flowview, see the UNICOS Performance Utilities
                           Reference Manual, publication SR-2040, and the
                           UNICOS User Commands Reference Manual, publication
                           SR-2011.



     ftref command
     10.3.4
                           The ftref(1) command analyzes Fortran source
                           listings and provides structural information about
                           them.  This analysis does not use information
                           derived from executing a program.  The analysis
                           can contain the following information:

                           * Global cross-reference listing of common block
                             variables, referenced by source line number

                           * A listing for each program unit showing its
                             entry point, calling program units, called
                             subprograms, and common blocks used

                           * A static call tree of all program units or any
                             desired subset

                           For multitasked applications, the ftref output
                           includes a summary of macrotasking subroutines
                           used, with options to indicate whether references
                           to common block variables are locked.  ftref is
                           otherwise unaffected by multitasking, but it is
                           useful for studying multitasked codes to determine
                           where common variables are redefined and where
                           locks can be activated.

                           Before you invoke ftref, first compile a Fortran
                           program specifying -Wf"-e sx" on the cf77 command
                           line to generate a cross-reference table and
                           source listing, and -c to disable loading.
                           Example for program pgm:

                                cf77 -Wf"-e sx" -c pgm.f
                                ftref -c full -t full pgm.l > pgm.ref





     244                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide     Autotasking Analysis Tools
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           For more information about using ftref, see the
                           UNICOS Performance Utilities Reference Manual,
                           publication SR-2040, and the UNICOS User Commands
                           Reference Manual, publication SR-2011.



     Watchword debugging
     tool
     10.3.5
                           Watchword is a debugging tool that uses the
                           Flowtrace interface.  You must specifically call
                           this system to select the areas of memory to be
                           checked.  The Watchword library then checks the
                           memory areas each time a traced routine is entered
                           and exited.  Differences in memory values
                           discovered by Watchword are noted in messages
                           written to stderr.

                           You can use Watchword in combination with the CDBX
                           debugging tool to speed the process of discovering
                           incorrect memory stores.  Use Watchword to
                           determine the routine or range of code in which
                           the store is occurring; then use the watchpoint
                           facility of CDBX to pinpoint the exact location of
                           the problem.

                           For more information about using Watchword, see
                           the UNICOS Performance Utilities Reference Manual,
                           publication SR-2040.



     Miscellaneous
     commands and
     functions
     10.3.6
                           The following subsections provide information on a
                           collection of commands and library functions that
                           can also be used to analyze autotasked programs,
                           as follows:

                           * The ja and hpm commands provide information
                             about timing multitasked program.

                           * The Fortran TIMEF, SECOND, and TSECND library
                             routines provide a way to time code segments.


     ja command
     10.3.6.1
                           The ja(1) command gathers timing statistics.  The
                           most important information that is reported for a
                           multitasked program in a batch environment is


     SG-3074 5.0               Cray Research, Inc.                        245


     Autotasking Analysis Tools     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           found under the heading "Multitasking Breakdown,"
                           which indicates the number of seconds that one,
                           two, three, or n CPUs were connected to your job.
                           These CPU times include time spent waiting on
                           semaphores.  In contrast, the user CPU time listed
                           in the output from the ja command does not contain
                           wait-semaphore time.

                           The output also indicates the average number of
                           CPUs that were connected to your job during
                           execution.  Generally, the higher the average
                           number of CPUs, the more of the work that was
                           executed by multiple processors concurrently
                           (assuming a high level of parallelism in the
                           program).

                           For more information about using ja, see the
                           UNICOS Performance Utilities Reference Manual,
                           publication SR-2040, and the UNICOS User Commands
                           Reference Manual, publication SR-2011.


     hpm command
     10.3.6.2
                           The hpm(1) command provides information on the
                           hardware utilization of programs run on CX/CEA.
                           For example, hpm can report how much of the total
                           execution time was spent waiting on semaphores.
                           This statistic is reported in the "Group 1"
                           report.

                           The following is sample group 1 output from the
                           UNICOS hpm command:

     ------------------------------------------------------------------------
     Group 1:  CPU seconds  :        0.89811  CP executing:      105709822

       Hold issue condition              300003000000202034f all CPs       actual # of CPs
     Waiting on semaphores              :  26.60                  28117904
     Waiting on shared registers        :   0.10                    100480
     Waiting on A-registers/funct. units:   3.86                   4075869
     Waiting on S-registers/funct. units:   6.23                   6583166
     Waiting on V-registers             :  24.80                  26216089
     Waiting on vector functional units :  33.54                  35452879
     Waiting on scalar memory references:   0.02                     18144
     Waiting on block memory references :   3.47                   3665476
     ------------------------------------------------------------------------

                           hpm gathers statistics for all CPUs that are
                           connected to a job throughout execution.  The line
                           labeled "Waiting on semaphores" indicates the
                           percentage of time that CPUs were connected to a
                           job, but not executing a task.  This occurs at
                           synchronization points in the code.  To compute
                           the total amount of time spent waiting on
                           semaphores, multiply the total execution time for
                           all CPUs (provided on line number 1) by the


     246                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide     Autotasking Analysis Tools
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           percent of time waiting on semaphores.

                           Running the hpm command with the -d option on an
                           autotasked program gives estimated megaflops for
                           the wall-clock time used by the program.

                           The following example listing is from is an
                           autotasked program, executed on a busy batch
                           workload.  The output of the group 0 counters
                           shows a megaflop rate of about 71.

     ------------------------------------------------------------------------
     Group 0:  CPU seconds   :       3.52      CP executing     :      414876297

     Million inst/sec (MIPS) :      16.04      Instructions     :       56537628
     Avg. clock periods/inst :       7.34
     CP holding issue      :      81.38      CP holding issue :      337626551
     Inst.buffer fetches/sec :       0.02M     Inst.buf. fetches:          65031
     Floating adds/sec       :       0.00M     F.P. adds        :
     322
     Floating multiplies/sec :      56.74M     F.P. multiplies  :      200000257
     Floating reciprocal/sec :      14.19M     F.P. reciprocals :       50000033
     I/O mem. references/sec :       0.24M     I/O references   :         856668
     CPU mem. references/sec :      85.27M     CPU references   :      300550742

     Floating ops/CPU second :      70.93M
     ------------------------------------------------------------------------

                           Next, the program was rerun, using the hpm -d
                           option.  The megaflop rate is shown as about 68.
                           These kinds of variations can be expected on
                           CX/CEA systems.  In the following listing,
                           however, the megaflops per wall-clock second are
                           shown, which gives a rough estimate of the amount
                           of CPU overlap occurring.

     ------------------------------------------------------------------------
     Group 0:  CPU seconds   :       3.67      CP executing     :      432329603

     Million inst/sec (MIPS) :      16.24      Instructions     :       59646793
     Avg. clock periods/inst :       7.25
     CP holding issue      :      81.47      CP holding issue :      352204818
     Inst.buffer fetches/sec :       0.03M     Inst.buf. fetches:         102074
     Floating adds/sec       :       0.00M     F.P. adds        :            322
     Floating multiplies/sec :      54.45M     F.P. multiplies  :      200000257
     Floating reciprocal/sec :      13.61M     F.P. reciprocals :       50000033
     I/O mem. references/sec :       0.04M     I/O references   :         158343
     CPU mem. references/sec :      81.94M     CPU references   :      300969691

     Floating ops/CPU second :      68.06M
     Floating ops/wall second:     177.39M     CPU/wallclock time ratio:    2.61
     ------------------------------------------------------------------------

                           Because this program was executed on a busy batch
                           system, the ratio between CPU and wall-clock
                           seconds varies widely between program executions.





     SG-3074 5.0               Cray Research, Inc.                        247


     Autotasking Analysis Tools     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           For more information about hpm, see the UNICOS
                           Performance Utilities Reference Manual,
                           publication SR-2040, and the UNICOS User Commands
                           Reference Manual, publication SR-2011.


     TIMEF library
     function
     10.3.6.3
                           The TIMEF(3U) library function returns the elapsed
                           wall-clock time (in floating-point milliseconds)
                           since the initial call to TIMEF.  The result is
                           type real.  The initial call to TIMEF returns 0.0.

                           The following is an example of using TIMEF to find
                           the elapsed wall-clock time for a segment of a
                           multitasked program:

                                      CALL TIMEF(TE1)
                                CMIC$ DO ALL
                                      DO 99 K=1,N
                                         .
                                          (autotasked work)
                                         .
                                99    CONTINUE
                                      CALL TIMEF(TE2)
                                      SECONDS =(TE2-TE1)*1.E-3


                           See Volume 1:  UNICOS Fortran Library Reference
                           Manual, publication SR-2079, for more details
                           about the TIMEF function.


     SECOND library
     function
     10.3.6.4
                           You can use the SECOND(3U) function to obtain the
                           total CPU time for a segment of a multitasked
                           program.  SECOND returns the total CPU time (in
                           floating-point seconds) since the start of a
                           program.  This is the total CPU time accumulated
                           by all processes in a multitasking program.  This
                           does not include semaphore wait time.

                           SECOND accumulates the CPU times for all processes
                           (tasks); therefore, SECOND should not be called
                           from within a parallel region.

                           The following is an example of using the SECOND
                           routine:






     248                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide     Autotasking Analysis Tools
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                      BEFORE = SECOND( )
                                CMIC$ PARALLEL
                                          .
                                      (autotasked work)
                                          .
                                CMIC$ ENDPARALLEL
                                      AFTER = SECOND( )
                                TOTCPUTIME = AFTER - BEFORE


                           See Volume 1:  UNICOS Fortran Library Reference
                           Manual, publication SR-2079, for more details
                           about the SECOND function.


     TSECND library
     function
     10.3.6.5
                           The TSECND(3U) returns the CPU time (in floating-
                           point seconds) since the start of that task
                           (process).  The reported time represents the CPU
                           time accumulated by the calling process in a
                           multitasked program.  This includes semaphore wait
                           time.  Calls to TSECND should be made from within
                           a parallel region to keep track of the CPU time
                           accumulated by each task.  If the call is made
                           outside of a parallel region, only the master
                           task's CPU time will be accumulated.

                           The following is an example of using the TSECND
                           routine:

                                CMIC$ PARALLEL PRIVATE(BEFORE,AFTER,...)...
                                      BEFORE = TSECND( )
                                          .
                                       (autotasked work or loop)
                                          .
                                      AFTER = TSECND( )
                                      PRINT *,"TASK TIME =",AFTER-BEFORE
                                CMIC$ ENDPARALLEL

                           Individual task times print in random order.

                           See Volume 1:  UNICOS Fortran Library Reference
                           Manual, publication SR-2079, for more details
                           about the TSECND function.











     SG-3074 5.0               Cray Research, Inc.                        249



                                               Autotasking Memory Usage  [11]
     ########################################################################







                           Autotasked programs require more memory than their
                           serial counterparts.  The major contributors to
                           increased memory usage are as follows:

                           * FMP creates multiple copies of code within
                             parallel regions.

                           * FPP introduces additional temporary variables
                             and arrays.

                           * Autotasked programs require increased stack
                             space because each task needs a unique copy of
                             all private variables.

                           Figure 44, page 250, outlines general Autotaksing 
			   memory use issues; this section discusses those 
			   issues.




     Increased program
     space requirements
     11.1
                           When FMP translates Autotasking directives, it
                           generates multiple copies of code within parallel
                           regions.  Besides the original source code, a
                           version of the code in each parallel region is
                           created for the master task and the slave task.
                           The master version contains an autotasked version
                           of the code and a unitasked version.  The slave
                           version of the code simply contains a nearly
                           identical copy of the master's autotasked version.

                           Figure 45 provides a conceptual
                           illustration of source code generated by FMP.













     SG-3074 5.0               Cray Research, Inc.                        251


     Autotasking Memory Usage       CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                            Autotasking Memory Usage


       Heap space - section of memory stored at the end of your program
       area and dynamically managed by the operating system

       Stack memory - portion of the heap memory that is allocated
       dynamically for local variables during subroutine execution

       Major contributors to increased memory use are as follows:

       * FMP creates multiple copies of code within parallel regions.

       * FPP introduces additional temporary variables and arrays.

       * Autotasked programs require increased stack space, because each
         task needs a unique copy of all private variables.


       To decrease memory use:

       * Use Fortran SAVE statements for large local arrays.  The saved
         arrays have static allocation and are no longer used in the
         initial stack size computation.

       * Store large local arrays in COMMON.  Arrays in COMMON have static
         memory allocation and are not used in the initial stack size
         computation.

       * Specify explicit stack requirements for SEGLDR.
     ________________________________________________________________________

                 Figure 44.  Summary of Autotasking memory issues

























     252                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide       Autotasking Memory Usage
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



                      Figure 45.  FMP source code generation



     Task common blocks
     generated by FPP
     11.1.1
                           If FPP needs to create temporary arrays for
                           vectorization or multitasking purposes, it
                           equivalences those arrays to a task common block
                           defined by FPP.  The default size of this task
                           common blocks is set to 8191 words.

                           Ordinarily, the default size of the FPP-generated
                           common block is sufficient to provide reasonably-
                           sized temporary arrays without taking up much room
                           in the user's data space.  Under unusual
                           circumstances, you may want to either increase or
                           decrease the size of this common block.  If you
                           are trying to keep your program size to an
                           absolute minimum, you can decrease the size of the
                           common block.  If FPP is generating many
                           temporaries, you may want to increase the size of
                           the common block so that FPP doesn't have to
                           stripmine loops.  (See "Stripmining," 
                           for more information.)

                           You can alter the size of the task common block
                           generated by FPP by using the -Q command-line
                           option.  The following example invocation changes
                           the size to 2000 words (from the default 8191
                           words).  The lowest allowable size is 1000 words.

                                $ fpp -Q 2000 stuff.f




     Master and slave
     tasks
     11.1.2
                           When an autotasked program is executing in code
                           that is not autotasked, the master task is said to
                           be executing.  The master task executes all of the
                           serial code, initiates parallel processing when an
                           Autotasking region is entered, performs all or
                           part (or none) of the work in the Autotasking
                           region, and waits until parallel processing is
                           finished before leaving the Autotasking region.




     SG-3074 5.0               Cray Research, Inc.                        253


     Autotasking Memory Usage       CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           When Autotasking code is entered, the master task
                           "calls" an external function to bring other slave
                           processors, if available, into execution.  (On
                           CX/CEA systems, this "call" is expanded inline.)
                           The address of the code to be executed is passed
                           to this function.  Each slave begins executing at
                           this address.  The slave code is in a separate
                           subroutine created by FMP.  The variables that are
                           shared with the master task and other slave tasks
                           are passed as arguments to the slave subroutine.

                           The code executed by the master task is distinct
                           from the code executed by the slave task.  The
                           master code is in the original calling routine and
                           contains the initialization and termination code
                           for parallel execution, which the slave task code
                           does not contain.  The master code also contains a
                           unitasked version of the autotasked code in case
                           the initialization code determines that
                           Autotasking is not appropriate.




     Increased stack space
     requirements
     11.2
                           Heap space is the section of memory that is stored
                           at the end of your program area and dynamically
                           managed by the heap management routines.  Stack
                           memory is a portion of the heap memory that is
                           allocated dynamically for local variables during
                           subroutine execution.  For a more complete
                           description of stack and heap management, see the
                           Segment Loader (SEGLDR) and ld Reference Manual,
                           publication SR-0066.

                           To better understand stack size computation, you
                           must first understand the process that is used to
                           determine the initial stack size for a
                           nonmultitasked (unitasked) program.

                           SEGLDR reads the stack allocation requirements for
                           each program module, as provided by the CFT77
                           compiler, and generates a static calling tree for
                           the program.  SEGLDR then traverses the tree and
                           determines the largest amount of stack space that
                           would be required for any branch of the tree.
                           This largest stack requirement is used as the
                           initial stack size.

                           Multitasked codes require more stack space than
                           their unitasked counterparts.  Each task in a
                           multitasked program must have its own stack area
                           within the heap to store unique copies of all
                           private variables that are needed for the parallel


     254                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide       Autotasking Memory Usage
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           region.

                           The performance effect of the increased stack
                           usage is a possible increase in wall-clock time
                           for execution of autotasked jobs running in a
                           batch environment; this additional time is
                           required for swapping.  Larger programs may create
                           the need for swapping processes to disk.  When
                           these large programs are swapped, they tend to
                           remain on disk for a longer period of time than
                           would a job with a smaller memory requirement.
                           The net effect of swapping on wall-clock
                           performance depends on the way that the system
                           scheduler, schedv, is tuned, as well as the
                           characteristics of the other jobs that are running
                           simultaneously with the autotasked program.

                           When using Autotasking with the UNICOS 6.0
                           release, smaller initial stack sizes for slave
                           routines is computed automatically.  SEGLDR does
                           two computations:  the stack size for the entire
                           program, and the stack size for each task head in
                           the program.  (A task head is the routine that
                           begins a parallel region.)  SEGLDR then compares
                           the task head stack sizes; the largest task head
                           stack size is used to allocate stack size for all
                           tasks.

                           There are also two ways you can usually reduce
                           stack size requirements:

                           * Use Fortran SAVE statements for large local
                             arrays.  The saved arrays have static allocation
                             and are no longer used in the initial stack size
                             computation.

                           * Store large local arrays in COMMON.  Arrays in
                             COMMON have static memory allocation and are not
                             used in the initial stack size computation.

                           For most programs, simply placing one or two large
                           local arrays in static storage significantly
                           reduces your Autotasking memory usage.




     Specifying memory
     requirements
     11.3
                           As discussed in previous subsections, SEGLDR
                           calculates stack space for autotasked programs.
                           In certain rare circumstances, the value
                           calculated may not be the value for optimal
                           performance.  You can override SEGLDR's
                           specification of stack space by using either
                           SEGLDR directives or environment variables.  (You


     SG-3074 5.0               Cray Research, Inc.                        255


     Autotasking Memory Usage       CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           can also change the size of FPP-generated task
                           common blocks, as previously discussed.)

                           For more information about SEGLDR directives, see
                           the UNICOS CDBX Debugger User's Guide, publication
                           SG-2094.  Values passed to SEGLDR by using these
                           directives override stack size values calculated
                           at run time.

                           On CX/CEA computer systems, environment variables
                           also provide a way to explicitly set initial slave
                           stack sizes and increment values, as follows:

                           Variable   Description
                           --------   -----------

                           MP_SLVSSZ  Initial stack size for microtasking or
                                      Autotasking slaves.

                           MP_SLVSIN  Stack increment for microtasking or
                                      Autotasking slaves.

                           There is no default value for these environment
                           variables.  If you set either of these environment
                           variables, that value is used in place of the
                           value calculated by SEGLDR or the values passed to
                           SEGLDR by using SEGLDR directives.



     Explicit stack size
     specification
     11.3.1
                           You can determine stack requirements for an
                           autotasked program by using several methods.  A
                           good method for determining explicit stack
                           requirements is outlined in this subsection.
                           Program abc.f is used throughout this example.

                           1.   Produce a SEGLDR load map for your code by
                                using the object code from an autotasked
                                program as input to SEGLDR.

                                $ cf77 -Zp -Wl"-D MAP=STAT" abc.o
















     256                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide       Autotasking Memory Usage
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                Sample load map statistics are as follows:

     ------------------------------------------------------------------------
     Program Statistics
     Non-segmented object module written to- a.out
     Allocation order- text,data,bss
     Program origin-        0 octal       0 decimal
     Program length- 13636611 octal 3095945 decimal
     Transfer is to entry point is '$START' at address 0a

      Managed Memory Statistics
         Initial stack size-           13353466 octal 3004214 decimal
         Stack increment size-              400 octal     256 decimal
         Initial managed memory size-  13365305 octal 3009221 decimal
         Managed memory increment size-    1000 octal     512 decimal
         Available managed memory-         7761 octal    4081 decimal
         Base address of managed memory/stack-  251304 octal
         Base address of pad area-              251044 octal
     ------------------------------------------------------------------------

                           2.   Obtain the estimated initial stack
                                requirement (called EIS in this example) as
                                shown on the load map.

                           3.   Calculate the size of the largest set of
                                private variables to be used in any parallel
                                region (called PVP in this example).  This
                                value can be a rough estimate, but it should
                                not be smaller than the actual size of the
                                largest set of private variables.  Add 5000
                                words or more to ensure that space is
                                available for argument lists and library
                                routines.

                           4.   Determine a new initial stack size and stack
                                increment by using the following formula:

                                    initial stack size = PVP
                                    stack increment value = EIS - PVP

                           5.   Reload the program by using a SEGLDR STACK
                                directive that indicates your new initial
                                stack size and increment value.  The STACK
                                directive must be in the following form:

       STACK=(inital stack value)+(stack increment value)

                                For this example, the values PVP=10000 and
                                EIS=1024897 are used.

                           You can also specify a stack value by passing the
                           STACK directive to SEGLDR on the command line or
                           control statement.  The following example shows a
                           typical command line:

                                $ cf77 -Zp -Wl"-D STACK=10000+1014897" abc.f




     SG-3074 5.0               Cray Research, Inc.                        257


     Autotasking Memory Usage       CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           With the preceding STACK directive, the master
                           processor starts with 10,000 words of stack space
                           and eventually runs out of memory.  At this point,
                           the master's stack memory space is incremented by
                           1,014,897 words.  As each slave processor enters
                           the code, it receives only the 10,000 words of
                           stack memory that it requires.  Without the
                           directive, each slave processor would have taken
                           1,024,897 words of stack memory, resulting in a
                           lot of wasted memory space.

                           The final result of the explicit stack
                           specifications would yield the follow stack usage:

                                Master = 10000 + 1014897 = 1024897
                                Slaves =   7 * 10000     =   70000
                                __________________________________
                                Total  =                   1094897


                           With the default computed stack values, the
                           program would have required the following stack
                           usage:

                                Master =                   1024897
                                Slaves =  7 * 1024897    = 7174279
                                __________________________________
                                Total  =                   8199176




     Explicit heap size
     specification
     11.3.2
                           Program stack memory is allocated from the program
                           dynamic memory area called the heap.  Because the
                           stack space is contained within the heap, if the
                           stack requirements for a program increase beyond
                           the size of the heap, the system will dynamically
                           allocate more memory to the heap to fulfill the
                           stack memory request.

                           Whenever you explicitly specify the stack size for
                           a program, you should also consider explicitly
                           setting the initial heap size and heap increment
                           value.  This is especially true when you are using
                           Autotasking.  The estimated initial heap size is
                           based on the initial stack size, and the default
                           heap increment value is 512 words.  If the initial
                           stack size has been set to a small value to
                           accommodate slave tasks, the master process may
                           produce significant memory management overhead to
                           obtain the necessary amount of heap space.





     258                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide       Autotasking Memory Usage
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The explicit initial heap size for an autotasked
                           program can be calculated using the same values
                           that were calculated for explicit stack size
                           specification.  A good estimate for an initial
                           heap size can be calculated in the following
                           manner:

                             initial heap size = (PVP * NCPUS) + [EIS - PVP]

                           In this formula, EIS refers to the original,
                           estimated initial stack size that was provided by
                           SEGLDR, not the new initial stack size (as
                           calculated by programmer for specification on the
                           STACK directive).

                           The heap increment value does not have to be
                           changed, because the new initial heap size is
                           large enough for the master stack space and all
                           slave task stack memory requirements.  Some
                           dynamic memory management may still occur,
                           depending on the type of I/O that is used in the
                           program; therefore, you should not increase the
                           heap increment value.

                           The UNICOS cf77 command needed to produce the
                           correct stack and heap sizes for example program
                           test.f on an eight-processor CRAY Y-MP system
                           would be as follows:

     $ cf77 -Zp -Wl"-D stack=10000+1014897;heap=1094897 -M map" test.f


                           The load map for this program now shows the
                           initial heap size to be 1,094,897 words with an
                           increment value of 512 words.

     ------------------------------------------------------------------------
      Program Statistics
      Non-segmented object module written to- a.out
      Allocation order- text,data,bss
      Program origin-           0 octal         0 decimal
      Program length-     5232332 octal   1389786 decimal
      Start entry point is '$START' at address 0a
      Transfer is to entry point 'TEST' at address 314a
          .
          .
          .
      Managed memory statistics
          Initial stack size-                 23420 octal     10000 decimal
          Stack increment size-             3676161 octal   1014897 decimal
          Initial task stack size-              313 octal       203 decimal
          Task stack increment size-            400 octal       256 decimal
          Initial managed memory size-      4132361 octal   1094897 decimal
          Managed memory increment size-       1000 octal       512 decimal
          Available managed memory-         4101753 octal   1082347 decimal
          Base address of managed memory/stack-    1077751 octal
          Base address of pad area-                1077511 octal
     ------------------------------------------------------------------------


     SG-3074 5.0               Cray Research, Inc.                        259


     Autotasking Memory Usage       CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




























































     260                       Cray Research, Inc.                SG-3074 5.0



                                     Autotasking in a Batch Environment  [12]
     ########################################################################







                           "Autotasking Performance," described
                           theoretical Autotasking speedup as well as
                           Autotasking performance on dedicated computer
                           systems.  More typically, however, you must
                           execute an autotasked program in a
                           multiprogramming or batch environment.  With
                           multiple programs competing for system resources,
                           performance of autotasked codes may be
                           significantly less than the theoretical maximum.
                           This section discusses what you can expect from
                           Autotasking in a batch environment.




     Realistic Autotasking
     performance
     expectations
     12.1
                           If the degree of parallelism in a program is
                           known, you can use Amdahl's Law for multitasking
                           to determine the theoretical speedup for
                           Autotasking with a given number of processors.
                           However, the theoretical speedup could be obtained
                           only under ideal conditions.  Some of these
                           conditions are as follows:

                           * The program is the only job executing on the
                             system.

                           * The program is never interrupted by the
                             operating system when using multiple processors.

                           * No overhead is associated with Autotasking.

                           * There is sufficient work for N processors in
                             every parallel region.

                           * All N processors start executing immediately
                             when a parallel region is entered.

                           * The work in every parallel region is evenly
                             partitioned so that all N processors remain busy
                             doing the work until it is finished.

                           * The performance of vector and scalar code is not
                             degraded by Autotasking.


     SG-3074 5.0               Cray Research, Inc.                        261


     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The first condition is rarely met; most programs
                           are not the only program executing on the system.
                           Usually, programs run in nondedicated
                           (multiprogramming/batch) environments, have fewer
                           than the configured number of processors
                           available, and experience background interference
                           from other jobs running on the system.  For these
                           programs, the effective value of N used in
                           Amdahl's Law might be less than the physical
                           number of processors available.  However, even if
                           the number of processors averages only 2 or 1.5,
                           autotasked programs should achieve shorter wall-
                           clock times, assuming that they contain sufficient
                           parallelism that can be exploited.

                           Usually, few of the preceding conditions are fully
                           met for the typical autotasked program.  Some of
                           the more typical (rather than ideal) conditions
                           are as follows:

                           * As stated earlier, most user jobs run in batch
                             environments and may be interrupted by the
                             operating system at any time.

                           * Autotasking overhead may range from 0to 10

                           * Not all loops that execute in parallel have
                             enough iterations or work to keep all processors
                             busy.

                           * It is neither possible to start all processors
                             instantly when a parallel region is entered, nor
                             to balance the workload across all processors.

                           * Finally, vector code performance is occasionally
                             degraded by Autotasking (as shown in the LINPACK
                             example in "Characteristics of parallel
                             programs".























     262                       Cray Research, Inc.                SG-3074 5.0


                                           Autotasking in a Batch Environment
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Amdahl's Law under
     realistic Autotasking
     conditions
     12.1.1
                           A more complex formulation of Amdahl's Law that
                           tries to predict the expected speedup in realistic
                           circumstances is shown in the following equation:

                                     Wbatch
			        ---------------
                           Sr =    fp*(1+O)
                                fs+--------------
                                   min(Nsys, Nuse)

                           Sr      Maximum realistic speedup from Autotasking

                           fp      Fraction of a program that can execute in
                                   parallel

                           fs      Fraction of a program that is serial =  1
                                   - fp
                           O       Overhead from Autotasking

                           W batch Weighting factor for a batch environment

                           Nsys    Average number of processors available
                                   from system

                           Nuse    Average number of processors usable by a
                                   program

                           Figure 46, page 262, summarizes the issues 
			   influencing this equation under realistic 
			   Autotasking conditions; the subsections that 
			   follow discuss the issues in more detail.
























     SG-3074 5.0               Cray Research, Inc.                        263


     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                         Issues Influencing Amdahl's Law
                     Under Realistic Autotasking Conditions



       * Weighting factor for batch environment


       * Increased time because of overhead


       * Average number of available processors


       * Average number of usable processors


         Autotasking overhead is a secondary effect that becomes important
         in extreme cases.  Overhead can contribute to an overall
         degradation in wall-clock time when the system gives a program
         too few processors.  At another extreme, when a program is very
         parallel, overhead may also have a noticeable effect.


         Realistically, Autotasking should improve wall-clock times for
         many programs, even in a batch environment.  Even when no
         additional processors are available, performance should not
         decline by more than a slight percentage.  Performance
         degradation depends on the amount of Autotasking overhead, the
         degree of parallelism in a program, and the increase in the
         Autotasking job's memory size requirements.
     ________________________________________________________________________

     Figure 46.  Summary of issues influencing Amdahl's Law under realistic
                 Autotasking conditions


     Weighting factor for
     batch environment
     12.1.1.1
                           The Wbatch term is a weighting factor for a batch
                           environment.  It incorporates any system factors
                           that tend to increase or decrease wall-clock times
                           in a batch environment.  If wall-clock times
                           decrease as a result of these factors, Wbatch is
                           greater than 1.0; otherwise, it is less than 1.0.

                           Wbatch does not include the factors that increase
                           or decrease the number of processors given to a
                           program by the system; these are included in the
                           Nsys term.  Also, Wbatch does not include any
                           factor that exists under a dedicated system such
                           as the operating system's deadlock detection and
                           processing logic.


     264                       Cray Research, Inc.                SG-3074 5.0


                                           Autotasking in a Batch Environment
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Memory size affects the wall-clock time of
                           multitasked programs in a batch environment.  As
                           discussed in "Memory Usage Issues," 
                           autotasked jobs require more memory and may
                           perform more memory expansions than their
                           unitasked counterparts.  In a batch environment,
                           autotasked jobs are usually swapped out to disk
                           more often (due to memory expansion requests) and
                           stay on disk longer after they are swapped out.

                           The memory size factor varies from program to
                           program.  Some of the variables include system
                           load, the size of other jobs executing on the
                           system, and the size of main memory.  If there is
                           sufficient space in main memory to hold an
                           autotasked job, along with all other jobs waiting
                           to execute, memory size is not a factor (Wbatch
                           equals 1).  Usually, this is not the case.

                           UNICOS generally favors multitasked jobs by giving
                           them more CPU resources when they become
                           available.  This favored status is because a
                           multitasked program creates several processes that
                           each vie for system resources along with the other
                           processes in the system.  This increases the value
                           of Nsys, as described in "Average number of
                           available processors,".  The memory factor
                           may compensate for this favored treatment.

                           For the examples described in the following
                           subsections, a value of 0.95 is assumed for
                           Wbatch, unless otherwise stated; that is, assuming
                           that all other things are equal, the increased
                           size of autotasked jobs causes their wall-clock
                           times to be about 5300003000000202034nger than the corresponding
                           unitasked jobs.  This is an estimate of what the
                           average increase in wall-clock times might be,
                           based on some experimental evidence; however, the
                           actual effect may be unnoticeable because of other
                           performance considerations.


     Increased time
     because of overhead
     12.1.1.2
                           Value O in the equation represents the increase in
                           the amount of time spent executing in parallel
                           because of overhead.  The sources of Autotasking
                           overhead are discussed in "Autotasking
                           Performance,".  In this manual, O
                           represents a fractional increase in execution time
                           for the part of the code that is executed in
                           parallel.  A constant value for O is assumed even
                           though it varies considerably from program to
                           program; it is a function of the granularity of
                           work that is executed in parallel.  In an actual
                           programming environment, O values may vary widely,


     SG-3074 5.0               Cray Research, Inc.                        265


     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           from 0 for very large granularity work with
                           virtually no overhead to a large fraction for
                           small granularity work with high overhead.  A
                           value of 0.15 is a reasonable first estimate.


     Average number of
     available processors
     12.1.1.3
                           The term min(Nsys, Nuse) represents the effective
                           number of processors available to a program.  The
                           operating system can give the program an average
                           of Nsys processors while the program is executing.
                           The maximum value for Nsys is the physical number
                           of processors available on the computer system
                           and, for our purposes, the minimum is 1.  In an
                           actual programming environment, the minimum is
                           often 0 because no processors are connected to a
                           job.  However, the realistic formulation of
                           Amdahl's Law assumes at least one processor is
                           connected to a job.

                           As noted previously, UNICOS tends to give
                           preferential treatment to multitasked jobs, so
                           Nsys will generally be greater than 1, even on a
                           heavily loaded system.  This can be observed by
                           using the ja command and noting the time during
                           which two or more CPUs were connected to the
                           multitasked job.


     Average number of
     usable processors
     12.1.1.4
                           The term Nuse is the number of processors that the
                           program itself can effectively use and is
                           independent of the load on the system.  Nuse is
                           subject to the following considerations:

                           * Nuse is seldom equal to the number of physical
                             CPUs on a system and is seldom a constant value,
                             even within one program.  The extent of
                             parallelism for each parallel region may be
                             different.

                           * The startup and shutdown times for processors
                             entering and exiting Autotasking code are
                             nonzero.

                           * Autotasking work cannot be evenly balanced.







     266                       Cray Research, Inc.                SG-3074 5.0


                                           Autotasking in a Batch Environment
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The effective value of Nuse depends on the
                           program.  A rough estimate may be obtained by
                           assuming that both the number of iterations and
                           the number of operations to perform in each
                           parallel region are relatively large.  This is not
                           always valid, but it may suffice for a first-order
                           approximation.  With this assumption, the startup
                           and shutdown times will be insignificant, and the
                           work can be fairly well balanced.  Then Nuse is
                           simply the average extent of parallelism for
                           parallel loops, weighted by the amount of time
                           spent in them (that is, the weighted average
                           number of iterations for parallel loops).
                           Remember that, for autotasked inner loops, the
                           number of iterations must be divided by 64 before
                           being included in the average due to the VECTOR
                           work allocation parameter.



     Examples of the
     realistic form of
     Amdahl's Law
     12.1.2
                           Two programs demonstrate Amdahl's Law under
                           realistic conditions.  The first program executes
                           80300003000000202034f the time in parallel; the second program
                           executes 20300003000000202034f the time in parallel.  Each
                           program uses an average of five processors
                           (Nuse = 5) in each parallel region.  Table 14
                           shows the effect of varying the number of
                           processors that the system makes available to the
                           two autotasked programs in a batch environment.
                           The entries shown are the speedups (S) or the
                           degradations (-1/Sr, if S < 1.0) from Amdahl's Law
                           under realistic conditions.  For this example,
                           Wbatch = 0.95, Nuse = 5, and O = 0.15 on an
                           eight-processor system.

                           Table 14 shows that both the 80300003000000202034arallel and
                           the 20300003000000202034arallel autotasked programs should finish
                           in less wall-clock time than their unitasked
                           versions if they are given an average of at least
                           1.5 processors.  As the system gives the program
                           fewer processors, O and Wbatch begin to degrade
                           performance.

                           If the system gives these programs more than five
                           processors, performance does not improve, because
                           Nuse becomes the limiting term in Amdahl's Law for
                           realistic conditions.  Large production programs
                           that run for many hours encounter varying batch
                           environments.  During actual production hours, the
                           load on the system may become sufficiently light
                           so that Nuse becomes the most important term.  If
                           it is less than Nsys, the program will not make


     SG-3074 5.0               Cray Research, Inc.                        267


     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           effective use of all available CPU resources.

                           Table 14.  Autotasking wall-clock speedups in
                                      batch environment

            ____________________________________________________________________
            Average processors     80300003000000202034arallel program     201600000000000000070461arallel program
            ____________________________________________________________________
                   8                       2.47                     1.12
                   6                       2.47                     1.12
                   4                       2.21                     1.11
                   2                       1.44                     1.04
                   1.5                     1.17                     1.00
                   1.25                    1.01                    -1.04
                   1.1                    -1.09                    -1.06
                   1                      -1.18                    -1.08
            ____________________________________________________________________



     80300003000000202034arallel program
     12.1.2.1
                           A comparison between the theoretical and realistic
                           computed speedups for this example program shows
                           the effect of Autotasking overhead on peak
                           performance.  Taking the 80300003000000202034arallel program with
                           Nsys = 4, the realistic speedup is 2.21.  Dividing
                           this by Wbatch gives a speedup of 2.33, without
                           the batch environment factor.  The ideal speedup
                           is 2.5.  The difference is because of the overhead
                           encountered by a program running on multiple
                           processors, including hardware memory contention,
                           additional Autotasking code, and other factors.

                           Autotasking overhead is a secondary effect that
                           becomes important in extreme cases.  As stated
                           previously, the overhead contributes to an overall
                           degradation in wall-clock time when the system
                           gives a program too few processors.  At another
                           extreme, when a program is very parallel, overhead
                           may also have a noticeable effect.

                           For example, suppose a program executes in
                           parallel 99.9300003000000202034f the time, and the overhead
                           fraction is 0.15.  For this program, the maximum
                           theoretical wall-clock speedup on an
                           eight-processor system is 7.94.  Substituting
                           Wbatch = 1.0,
                           O = 0.15, and Nsys = Nuse = 8 into Amdahl's Law
                           for realistic conditions yields a realistic
                           speedup of only 6.92.  The difference means the
                           loss of more than one processor because of
                           overhead.  If threshold tests were not used by
                           FPP, Autotasking overhead would have a greater
                           effect.



     268                       Cray Research, Inc.                SG-3074 5.0


                                           Autotasking in a Batch Environment
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     20300003000000202034arallel program
     12.1.2.2
                           Table 14 shows that, with fewer processors,
                           the performance of the 20300003000000202034arallel code is less
                           subject to degradation than the 80300003000000202034arallel code.
                           In fact, when Nsys = 1, the 20300003000000202034arallel code
                           decreases only 8ompared to an 18 6917529027641110830ecrease for
                           the 80300003000000202034arallel code.  This is because the
                           Autotasking overhead term for Amdahl's Law under
                           realistic conditions is applied only to the
                           fraction of time spent executing in parallel (fp).
                           When performance is dominated by serial execution,
                           Autotasking overhead has less effect.

                           The extent of parallelism (Nuse) and the fraction
                           of program execution time that can be in parallel
                           execution (fp) depend directly on the program and
                           its data.  They also indirectly depend on the
                           ability of Autotasking to detect and exploit
                           parallelism.  Both Nuse and fp must be increased
                           to increase a program's peak realistic
                           performance.  You must work with Autotasking to
                           increase the number of processors your program can
                           use and to increase the number of places where
                           Autotasking can occur.  You should use the
                           information provided by FPP and decide whether
                           program structures can be simplified and/or data
                           dependencies removed.  Additionally, Autotasking
                           directives are provided to tell Autotasking where
                           there are areas of parallelism that it could not
                           otherwise detect (for example, parallelism across
                           subroutine boundaries).

                           Realistically, Autotasking should improve wall-
                           clock times for many programs, even in a batch
                           environment.  Even when no additional processors
                           are available, performance should not decline by
                           more than a slight percentage.  Performance
                           degradation depends on the amount of Autotasking
                           overhead, the degree of parallelism in a program,
                           and the increase in the Autotasking job's memory
                           size requirements.




     Autotasking in a
     heavily loaded batch
     environment
     12.2
                           Although autotasked programs cannot usually attain
                           the same level of speedup in a batch environment
                           as on a dedicated system, there are still many
                           advantages to Autotasking in a batch environment.


     SG-3074 5.0               Cray Research, Inc.                        269


     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Figure 47, page 269, outlines the advantages and
                           disadvantages of Autotasking on a heavily loaded
                           batch system and the following subsections discuss
                           advantages and disadvantages.  The term heavily
                           loaded describes a batch environment with less
                           than 5to 10 6755605599552540dle time.























































     270                       Cray Research, Inc.                SG-3074 5.0


                                           Autotasking in a Batch Environment
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                  Autotasking on a Heavily Loaded Batch System

       Advantages:


       * Wall-clock time is decreased for programs that require large
         amounts of system resources.

       * Autotasking puts any available spare cycles to work.

       * Time-critical programs can be completed in less wall-clock time.

       * Extra programmer analysis is not necessary if automatic
         Autotasking is used.

       * Many billing algorithms treat multitasking favorably (or at least
         charge no penalty).


       Disadvantages:


       * Autotasking introduces some overhead into program execution.

       * Autotasked jobs require more memory than their unitasked
         counterparts and may perform more memory expansions.

       * Additional analysis and programming is necessary if you want to
         improve Autotasking performance by manually inserting Autotasking
         directives.

       * Billing algorithms that are weighted heavily toward CPU connect
         time may penalize multitasked programs.
     ________________________________________________________________________

     Figure 47.  Summary of advantages and disadvantages of Autotasking in a
                 batch system



     Advantages
     12.2.1
                           Running Autotasking on a heavily loaded batch
                           system has the following general advantages:

                           * Wall-clock time is decreased for programs that
                             require large amounts of system resources (for
                             example, jobs with large memory requirements).
                             CPUs are often idle when these large jobs are
                             running because no memory is available for other
                             jobs to fit in the system.  Multitasked jobs
                             with large memory requirements take advantage of
                             those idle CPUs, allowing them to complete more
                             quickly and make room for other jobs in the
                             system.


     SG-3074 5.0               Cray Research, Inc.                        271


     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * Spare cycles are available even in heavily
                             loaded systems.  Autotasking can put these
                             cycles to work.  Using these idle cycles can
                             increase the total system throughput.

                           * Time-critical programs can be completed in less
                             wall-clock time.

                           * Extra programmer analysis is not necessary if
                             automatic Autotasking is used.

                           * With many billing algorithms, there is a benefit
                             (or at least no penalty) for multitasking.  If
                             memory is figured into the billing algorithm, an
                             autotasked code may benefit because it will use
                             slightly more memory for much less time.



     Disadvantages
     12.2.2
                           Running Autotasking on a heavily loaded batch
                           system has the following general disadvantages:

                           * Autotasking introduces some overhead into
                             program execution.

                           * Autotasked jobs require more memory than their
                             unitasked counterparts and may perform more
                             memory expansions.  Autotasked programs contain
                             more code, create extra temporary variables, and
                             require additional stack space.  The extra
                             memory expansions may cause an autotasked job to
                             be swapped more often and remain swapped longer
                             on a heavily loaded batch system.

                           * Additional analysis and programming is necessary
                             if you want to improve Autotasking performance
                             by manually inserting Autotasking directives.
                             To achieve significant Autotasking performance
                             improvements in a batch environment, you will
                             often have to exploit parallelism that cannot be
                             detected by the compiling system.

                           * Billing algorithms that are weighted heavily
                             toward CPU connect time may penalize multitasked
                             programs.










     272                       Cray Research, Inc.                SG-3074 5.0



                                          Debugging Autotasked Programs  [13]
     ########################################################################







                           Debugging a program that uses Autotasking is very
                           difficult; since the Autotasking feature was
                           developed, there has been a demand for tools to
                           assist in debugging these programs.  With the
                           UNICOS 6.0 release, atchop was introduced to aid
                           in debugging autotasked codes.  With UNICOS 6.1,
                           increased CDBX support also enhances debugging
                           autotasked codes.  This section discusses general
                           tips to help you debug autotasked codes, and it
                           describes the enhanced CDBX support.

                           For more information about atchop, 
                           also see the UNICOS 6.0 release versions of the
                           UNICOS Performance Utilities Reference Manual,
                           publication SR-2040, and the UNICOS User Commands
                           Reference Manual, publication SR-2011.

                           For specific CDBX information, see the UNICOS 6.1
                           release version of UNICOS CDBX Symbolic Debugger
                           Reference Manual, publication SR-2091.




     Problems unrelated to
     the use of multiple
     processors
     13.1
                           When optimizing your code, do as much optimization
                           as possible without Autotasking; verify your
                           results after each new optimization.  After you
                           are satisfied that you have successfully optimized
                           your code, you can take the following steps to
                           prepare the code for Autotasking:

                           * Ensure that your code runs correctly in STACK
                             allocation mode.

                             You can set compiler options to generate warning
                             messages to help you with this process by using
                             the cf77 -Wf"-ei" and cf77 -Wf"-ev" options.

                           * Use the CDBX debugger to help you debug
                             vectorization transformations.  Check for
                             oversubscripted arrays that may produce
                             incorrect program results when autotasked.
                             Using the cf77 -Wf"-eo" option to check array
                             bounds and conformance can be useful during this


     SG-3074 5.0               Cray Research, Inc.                        273


     Debugging Autotasked Programs  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             process.




     Problems related to
     the use of multiple
     processors
     13.2
                           If your code produces incorrect results when
                           autotasked, FPP may be making some incorrect
                           transformations.  These problems are more
                           difficult to find and solve.  The following
                           suggestions may be useful for some codes:

                           * If you encounter problems with Autotasking, use
                             atchop to do a binary search of concurrent
                             regions to narrow the problem to a suspect loop.

                           You can also try several steps manually (without
                           atchop), as follows:

                           * Run your code only through FPP to see whether
                             the problem still exists when FMP is not used.
                             It is sometimes easier to isolate problems this
                             way.

                           * If problems disappear when FMP is not used, try
                             setting NCPUS environment variable to 1; this
                             can help isolate variable scoping problems or
                             help pinpoint the problem to either a slave task
                             or the master task.

                           * Begin by using many FPP options and then disable
                             the options one-by-one to see whether you can
                             isolate the problem to a specific type of
                             transformation.  You can also disable options
                             enabled by default to see whether you can
                             isolate the problem to the transformation done
                             by default.

                           * Use the CFPP$ SKIP directive to inhibit
                             transformation on specific loops in your code.




     CDBX debugger support
     13.3
                           With the CF77 5.0 compiling system release and the
                           CDBX 6.1 release, the CDBX debugger provides
                           support for finding the following categories of
                           problems:

                           * A problem in a program that contains parallel


     274                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Debugging Autotasked Programs
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             code, but in which the bug is suspected to be in
                             the serial portion of the code.  You could
                             expect to debug this code by breakpointing
                             through the serial code.

                           * A problem in a program where a bug is suspected
                             in the parallel portions of the code, but it is
                             not a timing problem.  You could also expect to
                             find this bug by breakpointing.

                           CDBX support does not extend to a program in which
                           a bug is suspected in both the parallel code and
                           in timing and/or scoping.  Generally, such
                           problems are outside the scope of a breakpoint
                           debugger.  You can use CDBX to find such problems
                           only if you are lucky enough to have the timing
                           and/or scoping problem appear at the right time.



     Debugging from cf77
     input
     13.3.1
                           To enable Autotasking debugging from the cf77
                           command line, use the following:

                                $ cf77 -Zp -G file


                           For CDBX to make sense of all the directives and
                           statement labels, and, in turn, give you the most
                           information, you must run the code through the FMP
                           phase of the compiling system (use of the entire
                           compiling system is recommended).  Generally, the
                           compiling system performs the following steps to
                           enable debugging of autotasked codes by CDBX:

                           * The first phase of the compiling system that
                             processes the file inserts an SRC directive at
                             the beginning of the file.  This tells CDBX the
                             name of the original source file.

                           * Lines of transformed code are bracketed by two
                             new FPP directives, CMIC@ BTRNSFRM and CMIC@
                             ETRNSFRM; FMP translates these CMIC@ directives
                             into CDIR@ directives.  (CDBX does not interpret
                             CMIC@ directives; it does interpret CDIR@
                             directives.)  The ETRNSFRM directive contains a
                             number that corresponds to the original line
                             number following the transformed block of code.
                             If necessary, FMP generates additional CDIR@
                             directives if it inserts additional code.  FMP
                             also keeps track of the master, unitasked, or
                             slave portion of the code and passes this
                             information on to the compiler, along with the
                             names of the slave routines it creates.  FMP
                             appends an m and the slave routine name, a u or
                             an s to the CDIR@ ETRNSFRM directive.  (See


     SG-3074 5.0               Cray Research, Inc.                        275


     Debugging Autotasked Programs  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             example 3.)

                           * You are free to debug code outside of the
                             regions bounded by BTRNSFRM and ETRNSFRM
                             directives.  Breakpoints are based on original
                             line numbers from source lines outside of the
                             marked transformed code.  The compiler writes
                             line number locations of block boundaries into
                             the symbol table.

                           * If you try to set breakpoints in the code within
                             the BTRNSFRM and ETRNSFRM directives, the
                             compiler sets the breakpoint at the next known
                             line number following the one given.

                           * For postmortem debugging, p-counter locations in
                             the code within the BTRNSFRM and ETRNSFRM
                             directives code inform you of the immediately
                             preceding and following original line numbers
                             that bracket the "crash point."  CDBX also
                             indicates whether the "crash point" occurred
                             inside or outside a parallel region.  If it
                             occurred within a parallel region, CDBX
                             indicates whether it was in the master, slave,
                             or unitasked version of the three-code model.

                           With the use of CDBX, all line numbers are file
                           based.  In example 1, FPP transforms a nested set
                           of DO loops beginning at line number 264 and
                           extending to line 281.

                           FPP output:

                                CMIC@ SRC file.f
                                        (original lines 1 through 263)
                                CMIC@ BTRNSFRM
                                        (Transformed lines 264 through 281)
                                        (Debugging deferred)
                                CMIC@ ETRNSFRM 282
                                        (Original lines 282 and following)


                           FMP output (renames the directives):

                                CDIR@ SRC file.f
                                        (Original lines 1 through 263)
                                CDIR@ BTRNSFRM
                                        (Transformed lines 264 through 281)
                                        (Debugging deferred)
                                CDIR@ ETRNSFRM 282
                                        (Original lines 282 and following)










     276                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Debugging Autotasked Programs
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           In example 2, FPP performs an inline expansion (or
                           expands an INCLUDE) at line number 91.

                           FPP output:

                                CMIC@ SRC file.f
                                        (Original lines 1 through 90)
                                CMIC@ BTRNSFRM
                                        (Inline expanded code)
                                        (Debugging deferred)
                                CMIC@ ETRNSFRM 92
                                        (Original lines 92 and following)


                           FMP output (renames the directives):

                                CDIR@ SRC file.f
                                        (Original lines 1 through 90)
                                CDIR@ BTRNSFRM
                                        (Inline expanded code)
                                        (Debugging deferred)
                                CDIR@ ETRNSFRM 92
                                        (Original lines 92 and following)


                           In example 3, FPP inserts a CMIC@ DO ALL directive
                           for a DO loop beginning at line number 125 and
                           ending at line number 146.  This is the only line
                           added by FPP in the subroutine.  FMP does not
                           count any lines beginning with a CMIC@ or CDIR@
                           directive; therefore, FPP does not need to put
                           BTRNSFRM and ETRNSFRM directives around these
                           inserted lines.

                           FPP output:

                                CMIC@ SRC file.f
                                        (Original lines 1 through 124)
                                CMIC@ DO ALL
                                        DO 100 I = 1,N
                                            (do loop lines 126 through 145)
                                 100    CONTINUE
                                        ...
                                        END

















     SG-3074 5.0               Cray Research, Inc.                        277


     Debugging Autotasked Programs  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           FMP output (note the indicators for the three-code
                           model):

                                CDIR@ SRC file.f
                                  (Original lines 1 through 124)
                                CDIR@ BTRNSFRM
                                CMICS DO ALL
                                  (FMP generated Autotasking code
                                   to begin parallel region
                                   and define the concurrent iteration loop)
                                CDIR@ ETRNSFRM 126, m, suba@125
                                  (do loop lines 126 through 145)
                                CDIR@ BTRNSFRM
                                  (FMP generated Autotasking code
                                   for concurrency control)
                                  (FMP generated Autotasking code
                                   to get to unitasked version
                                   of the loop)
                                CDIR@ ETRNSFRM 125, u
                                      DO 101 I = 1,N
                                      (do loop lines 126 through 145)
                                  101 CONTINUE
                                CDIR@ BTRNSFRM
                                  (FMP generated Autotasking code
                                   at end of the parallel region)
                                CDIR@ ETRNSFRM 147
                                      ...
                                      END


                           In the slave routine:

                                CDIR@ BTRNSFRM
                                  (FMP generated Autotasking code
                                   at top of slave task routine
                                   and define the concurrent iteration loop)
                                CDIR@ ETRNSFRM 126 s
                                  (do loop lines 126 through 145)
                                CDIR@ BTRNSFRM
                                  (FMP generated Autotasking code
                                   for concurrency control)
                                  (FMP generated Autotasking code
                                   to end slave task routine)


                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note


                           1.   Breakpoints can be set at original line
                                number 126 in example 3.  The breakpoints
                                would be set in three locations corresponding
                                to the master, unitasked, and slave points at
                                the top of each iteration.

                           2.   A breakpoint can also be set at original line
                                number 147 in example 3; this line number
                                represents the exit from the parallel region.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


     278                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Debugging Autotasked Programs
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




     Debugging from FMP
     output
     13.3.2
                           If you must breakpoint within transformed regions,
                           you can do so; unfortunately, this requires using
                           FMP output as your source code.  To debug using
                           FMP output, use the following command line:

                                $ cf77 -ZP -Wf"-ez -o off" test.f


                           You can then breakpoint and step anywhere in the
                           code produced by FMP (file test.j, which is
                           produced by the preceding example).  You should
                           use this option only as a last resort; output from
                           FMP is difficult to read, and you may have
                           difficulty finding your original source code in
                           the FMP output.




     Environment variables
     for all systems
     13.4
                           You can also use environment variables to
                           customize your parallel processing environment;
                           problems can sometimes be isolated by setting the
                           environment variable to a particular setting.  See
                           "UNICOS environment variables," for more
                           information.




















     SG-3074 5.0               Cray Research, Inc.                        279



                                        UNICOS Interface to Autotasking  [14]
     ########################################################################







                           The UNICOS operating system and library interface
                           to Autotasking is based on two communication
                           concepts:  a wakeup flag used by the library to
                           activate and deactivate processes in a multitasked
                           program; and a context pointer used by the library
                           to request a process context be saved in the user
                           area if the process is disconnected.  The
                           communication mechanism is a user area structure
                           called a thread.  The thread is created by the
                           library, but it is shared with the operating
                           system by using the thread(2) system call.  It
                           contains the flag and pointer mentioned previously
                           and other control and statistics information.

                           The wakeup flag, in conjunction with the resch(2)
                           system call, gives the library a means to activate
                           and deactivate processes when entering and leaving
                           parallel control structures.  The resch system
                           call lets a process with a registered thread
                           structure release the CPU and go to sleep.  The
                           slave processes use this mechanism to suspend
                           themselves during long waits.

                           For example, at the beginning of an autotasked
                           program, the master process creates a slave
                           process by using the tfork(2) system call in
                           preparation for parallel work.  Each process
                           registers its thread with the operating system by
                           using the thread system call.  The slave process
                           clears its wakeup flag and does a resch system
                           call to wait for work.  When a parallel control
                           structure is encountered, the master process sets
                           the wakeup flag for the slave process, requesting
                           help.  The operating system "sees" the wakeup flag
                           and schedules the slave process for execution.
                           When the parallel work is done, the slave process
                           returns to sleep if no new work is found within
                           the slave hold time.

                           The context pointer gives the library control of
                           the context save location.  The library can
                           request that the operating system save the context
                           in the user area rather than the system area, if
                           the CPU must be taken for system work.  This
                           allows another process in the program to pick up
                           the saved context and complete that iteration of
                           the parallel structure without having to wait for
                           the interrupted process to be returned.



     SG-3074 5.0               Cray Research, Inc.                        281


     UNICOS Interface to AutotaskingCF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                           The library sets the context pointer to the user
                           area at the beginning of a parallel control
                           structure.  Iterations 1, 2, and 3 are handed to
                           processes 1, 2, and 3, respectively.  Process 2 is
                           interrupted by operating system work before
                           completing its parallel iteration, while processes
                           1 and 3 complete theirs.  Process 1 detects that
                           iteration 2 is incomplete, locates and assumes the
                           saved context, and completes the iteration.  The
                           program continues.

                           Parallel region overview Figure 48, page 281,
                           illustrates the code executed by the master
                           process during the execution of a parallel region.
                           Critical regions of code are indicated by "dotted"
                           boxes.











































     282                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing GuideUNICOS Interface to Autotasking
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



              Figure 48.  Parallel region overview (master process)

                           Figure 49 illustrates the code executed by the
                           slave processes during the execution of a parallel
                           region.  Critical regions of code are indicated by
                           "dotted" boxes.



     See the printed manual for this figure; it doesn't display on-line.



               Figure 49.  Parallel region overview (slave process)









































     SG-3074 5.0               Cray Research, Inc.                        283



                                                      Software Anomalies  [A]
     ########################################################################







                           Software anomalies identified during development
                           and testing of the CF77 5.0 release of Autotasking
                           are described in detail in this appendix.

                           * Routines that require the -o zeroinc option to
                             the cft77 command cannot be autotasked.  Support
                             to pass this option properly through the phases
                             has been deferred.  Other routines in the same
                             program can still be safely autotasked.

                           * The Autotasking system supplied with the CF77
                             5.0 release requires use of UNICOS 6.0 or higher
                             libraries to run correctly.

                           * In the following example, the dimension of array
                             Y is passed at run time, and subroutine SAM
                             works properly because it is using space in Y
                             that was allocated elsewhere.  However, when FMP
                             creates the Autotasking version of the routine,
                             the subroutine that is created for the slave
                             processors to execute (call it SAMSLAVE) is
                             creating array Y as a dynamic (allocated at run
                             time) array.  Because the space for Y is being
                             allocated in subroutine SAMSLAVE, it needs to
                             know the exact dimensions of Y, and the
                             DIMENSION Y(1) is not sufficient.  Currently,
                             FMP does not detect this situation, and in this
                             example, would create Y as consisting of 1 word.
                             When the program runs, it will expect Y to
                             consist of M words and will over-index Y,
                             possibly causing floating-point errors, operand
                             range errors, stack corruption, or incorrect
                             results.  Some cases of this situation may be
                             detected by FMP in the future and warning
                             messages generated; generally, however, this
                             situation cannot be detected.















     SG-3074 5.0               Cray Research, Inc.                        285


     Software Anomalies             CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                      SUBROUTINE SAM(Y,M)
                                      DIMENSION Y(1)

                                CMIC$ DOALL PRIVATE(Y)
                                      DO 50 I = 1, N
                                        DO 30 J = 1, M
                                          Y(J) = ...
                                   30   CONTINUE
                                        DO 40 J = 1, M
                                          ... = Y(J)
                                   40   CONTINUE
                                   50 CONTINUE
                                      END

                                      SUBROUTINE SAMSLAVE()
                                      DIMENSION Y(1)

                                      ....
                                      END


                           * Certain nonstandard programming constructs may
                             cause errors when run in parallel using
                             Autotasking.  For example, a test code had a
                             copy routine, as follows:

                                 SUBROUTINE COPY(FROM,TO,NWORDS)

                             The subroutine was being called as

                                 CALL COPY(A(1000),A(1500),1000)

                             The COPY subroutine was handling the
                             vectorization case properly where the FROM and
                             TO arrays might overlap.  However, when inner
                             loops were turned on, the main loop in COPY was
                             autotasked, because the loop appeared to use
                             only the independent arrays FROM and TO.  The
                             results were unpredictable when COPY was called
                             as in the preceding example.  The ANSI standard
                             requires that if A is passed twice as an
                             argument (as in the example), that it must be
                             read only in the called routine.

                           * Some programs that have large arrays defined
                             locally in subroutines may see their memory
                             requirements grow dramatically when autotasked.
                             The reason is that any array that must be
                             private to each task will have space on each
                             task's stack.  In a worst case, memory could
                             grow proportionally with the number of CPUs
                             used.  If the large arrays truly need to be
                             private, not much can be done.  If the arrays
                             are partitioned and sharable, however, putting
                             them in common blocks will ensure space for them
                             is allocated only once.  Programs that have
                             their memory requirements increased by
                             Autotasking may have longer elapsed times when
                             running in a mix simply because they will be


     286                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide             Software Anomalies
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             more likely to be swapped out for longer periods
                             of time.

                           * UNICOS imposes a limit of the number of
                             processes that any job (login session) can use
                             at any time.  This limit is 24 by default.  If
                             this limit is hit, the multitasking libraries
                             ungracefully abort with an error exit.  If you
                             examine the exchange packages in the core file,
                             the one that failed will have TFORK in its S1
                             register.  This error condition will be handled
                             more gracefully in the future.
















































     SG-3074 5.0               Cray Research, Inc.                        287



                                                   FPP TIDY Subprocessor  [B]
     ########################################################################







                           This appendix describes the TIDY feature of FPP.
                           The TIDY subprocessor reformats Fortran code that
                           is passed through FPP.

                           There are two basic choices for FPP output format:

                           * Let FPP regenerate the entire routine.

                           * Let FPP regenerate only the loops that require
                             restructuring, echoing the rest of the routine
                             exactly as in the original input.

                           When the "optimized blocks only" option is enabled
                           (the default), many of the other TIDY options are
                           disabled (such as renumbering of labels, and so
                           on).  Use the -d y option to the fpp command to
                           enable regeneration of the entire routine.

                           TIDY options fall into two categories:  switches
                           (as shown in Table 15 and parameters (as shown
                           in "TIDY parameters".




     TIDY options
     B.1
                           The switches in Table 15 can be specified using
                           the -r (enable) and -n (disable) options to the
                           fpp command or using the SWITCH directive.

                             Table 15.  TIDY switches
     ________________________________________________________________________
     Switch       Description                                         Default
     ________________________________________________________________________

       a                                                              ON
                  Places inline comments above statement if
                  they do not fit on the same (reformatted)
                  line.  The column in which the inline comment
                  begins is determined by the length of the
                  comment and by the ILCCOL parameter described
                  elsewhere.  If the original Fortran line is
                  not reformatted, this switch may still have
                  an effect if the ILCCOL parameter is set to a
                  column less than the column in which the
                  original inline comment begins.  This switch
                  has no effect on inline comments that remain
                  inline.
     ________________________________________________________________________


     SG-3074 5.0               Cray Research, Inc.                        289


     FPP TIDY Subprocessor          CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                      Table 15.  TIDY switches  (continued)
     ________________________________________________________________________
     Switch       Description                                         Default
     ________________________________________________________________________
       b                                                              OFF
                  Puts space after a comma in a subscript (for
                  example, A(X, Y)).

       c                                                              ON
                  Puts space after a comma in a list (for
                  example, CALL X (A, B, C)).  Lists are any
                  collection of variable and array names, such
                  as those found in the argument lists of
                  SUBROUTINE and FUNCTION statements, CALLs to
                  subroutines or functions, and the
                  input/output list of READ, PRINT, and WRITE
                  statements.

       e                                                              ON
                  Puts spaces around equals sign ( = ).

       f                                                              OFF
                  Moves FORMAT statements before END statement.
                  This switch causes all FORMAT statements to
                  be collected at the end of the module.  If
                  this switch is off (the default), the FORMAT
                  statements remain where they were originally.
                  If this switch is on, renumbering
                  automatically gives FORMAT statements the
                  highest labels, because they are now at the
                  end of the routine.  This situation may
                  change if the FORMAT= parameter is used.
                  (Indentation and spacing of FORMAT statements
                  is never changed.)
       h                                                              ON
                  Forces labels to be on CONTINUE and FORMAT
                  statements only.  Fortran allows statement
                  numbers on any statement, but the only
                  statement numbers necessary are those used as
                  targets for branches, terminal statements of
                  a DO loop, and FORMAT statements.  Branch
                  targets and terminal statements of a DO loop
                  can be converted to CONTINUE statements, and
                  numbered accordingly.  With this switch ON
                  (the default), this is done automatically.

       j                                                              OFF
                  Puts spaces around the ** and // operators.

       k                                                              OFF
                  Puts spaces around only .AND., .OR.,
                  For IF statements, the k, o, and q switches
                  are interrelated; only one may be on at a
                  time.  See the example following this table.

       l                                                              OFF
                  Converts Fortran output to lowercase (except
                  declarations).  FPP converts all of the
                  executable part of the code to lowercase.
                  FPP declarations are also generated in
                  lowercase.  The user declarations are copied
                  in the same case as they were in the input
                  program.  (Character constants and comments
                  are also not converted.)
     ________________________________________________________________________


     290                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide          FPP TIDY Subprocessor
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                      Table 15.  TIDY switches  (continued)
     ________________________________________________________________________
     Switch       Description                                         Default
     ________________________________________________________________________
       m                                                              ON
                  Squeezes two-line statements into one by
                  ignoring spacing rules.  Sometimes strict
                  adherence to the TIDY spacing rules yields
                  two-line statements that would be more
                  readable as one line.  By ignoring some of
                  the spacing rules, the statement is made to
                  fit on one line.  If you want this automatic
                  squeezing turned off, use the -r m option to
                  the fpp command.

       n                                                              OFF
                  Puts spaces around nonsubscript parentheses.
                  Nonsubscript parentheses occur in IF, CALL,
                  READ, WRITE, SUBROUTINE, FUNCTION, and other
                  statements.

       o                                                              OFF
                  Puts spaces around all logical operators.
                  For IF statements, the k, o, and q switches
                  are interrelated; only one may be on at a
                  time.  See the example following this table.

       p                                                              ON
                  Puts spaces around + and -.

       q                                                              ON
                  Puts spaces around all logical operators when
                  alone in IF statement; otherwise, only around
                  statements, the k, o, and q switches are
                  interrelated; only one may be on at a time.
                  See the example following this table.

       r                                                              OFF
                  Generates comments listing external
                  references.  This switch generates a block of
                  comments that lists the subroutines and
                  external functions called by the current
                  program unit.  This block of comments is
                  inserted into the translated program unit
                  preceding the first executable statement.
                  Routines are listed in first occurrence
                  order.  If no subroutines or external
                  functions are called, a comment to that
                  effect is inserted.

       s                                                              OFF
                  Follows spacing rules for operators inside
                  parentheses.  This switch uses the settings
                  of the j, p, and t switches to determine
                  spacing inside of parentheses.  If the switch
                  is off, as is the default, no spacing occurs.

       t                                                              OFF
                  Puts spaces around * and /.
     ________________________________________________________________________


     SG-3074 5.0               Cray Research, Inc.                        291


     FPP TIDY Subprocessor          CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                      Table 15.  TIDY switches  (continued)


     ________________________________________________________________________
     Switch       Description                                         Default
     ________________________________________________________________________
       u                                                              OFF
                  Adds keywords UNIT= and FMT=.  Fortran allows
                  keywords UNIT= and FMT= in input and output
                  statements.  Most codes, however, use the
                  positional notation.  If desired, the
                  keywords will be added and formatted in the
                  same manner as any other keywords in the
                  input or output statement.  If the original
                  statement contains a UNIT= or FMT= keyword,
                  and the switch is off, FPP does not remove
                  any keywords from the original statement.

       v                                                              OFF
                  Puts spaces around = in keyword lists.

       x                                                              OFF
                  Specifies
                  RENUMB=100:10,FORMAT=900:10,TDYON=R.  This
                  shorthand switch causes renumbering of labels
                  and insertion of comment block summarizing
                  externals.

       y                                                              ON
                  Translates only optimized blocks; echoes rest
                  of program as is.  This switch causes FPP to
                  change only the optimized loops in the input
                  routine, and to copy the rest of the code
                  directly from the input to the output without
                  modification.
     ________________________________________________________________________


                           The following example shows how the k, o, and q
                           switches are interrelated for IF statements:

                                LOGICAL L1, L2
                                IF (A(I).GT.B(I)) A(I) = B(I)
                                IF (C(I).GT.D(I).AND..NOT.L1) C(I) = D(I)
                                IF (L1 .AND. .NOT.L2) L2 = .TRUE.


                           When the k switch on, and the o and q switches are
                           off, the previous example becomes the following:

                                LOGICAL L1, L2
                                IF (A(I).GT.B(I)) A(I) = B(I)
                                IF (C(I).GT.D(I) .AND. .NOT.L1) C(I) = D(I)
                                IF (L1 .AND. .NOT.L2) L2 = .TRUE.


     292                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide          FPP TIDY Subprocessor
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           With the o switch on, and the k and q switches
                           off, it becomes the following:

                           LOGICAL L1, L2
                           IF (A(I) .GT. B(I)) A(I) = B(I)
                           IF (C(I) .GT. D(I) .AND. .NOT. L1) C(I) = D(I)
                           IF (L1 .AND. .NOT. L2) L2 = .TRUE.


                           With the q switch on, and the k and o switches
                           off, it becomes the following:

                                LOGICAL L1, L2
                                IF (A(I) .GT. B(I)) A(I) = B(I)
                                IF (C(I).GT.D(I) .AND. .NOT.L1) C(I) = D(I)
                                IF (L1 .AND. .NOT.L2) L2 = .TRUE.





     TIDY parameters
     B.2
                           The RENUMB, FORMAT, LABELS, indentation, LSTCOL,
                           ILCCOL, and CONCHR parameters of TIDY can appear
                           only on the SWITCH directive (they cannot be used
                           as command-line options or control statement
                           parameters), as shown in the following example:

                           CFPP$ SWITCH,RENUMB=100:5,FORMAT=1000:10,INDDO=4




     RENUMB parameter
     B.2.1
                           The RENUMB parameter has the following format:

                           -------------------------------------------------
                           RENUMB=m:n
                           -------------------------------------------------

                           This parameter controls the renumbering of the
                           statement numbers within a routine.  If only the
                           RENUMB= part appears without a number following
                           the =, the statement numbers are renumbered
                           starting at 100 and increased by 10 for each
                           successive label.

                           If the parameter is given as RENUMB=m and m is any
                           number less than 99,999, the increment defaults to
                           10.  If the parameter is given as RENUMB=:n, the
                           starting number defaults to 100.  The forms
                           RENUMB=m:  and RENUMB=: are not allowed.  Use
                           RENUMB=n or RENUMB=, respectively.


SG-3074 5.0               Cray Research, Inc.                        293


     FPP TIDY Subprocessor          CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     FORMAT parameter
     B.2.2
                           The FORMAT parameter has the following format:

                           -------------------------------------------------
                           FORMAT=m:n
                           -------------------------------------------------

                           The FORMAT parameter works exactly like the RENUMB
                           parameter, but only on FORMAT statements.  If a
                           different numbering scheme is desired for FORMAT
                           statements than for other statements, this
                           parameter is used.  If FORMAT= is used alone, the
                           default is a starting number of 900 and an
                           increment of 10.  The forms FORMAT=m and FORMAT=:n
                           are also valid as described in the RENUMB=
                           subsection.



     LABELS parameter
     B.2.3
                           The LABELS parameter has the following format:

                           -------------------------------------------------
                           LABELS=n:l
                           -------------------------------------------------

                           The LABELS parameter controls placement of
                           statement labels within the statement label field.
                           For the labels to be left-justified, l should be
                           set to the character L.  For the labels to be
                           right-justified, l should be set to the character
                           R.  The column in which the labels are justified
                           is given by n, which must be a number from 0 to 5.
                           If n is 0, alignment of statement labels does not
                           occur.  The default is to right-justify labels in
                           column 5 (5:R).



     Indentation
     parameters
     B.2.4
                           The indentation parameters are as follows:

                           -------------------------------------------------
                           INDDO=n
                           INDIF=n
                           INDCN=n
                           INDAL=n
                           -------------------------------------------------


     294                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide          FPP TIDY Subprocessor
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Argument n is the number of spaces to indent.  In
                           each case, n must be a number between 0 and 10,
                           inclusive.  INDDO is for DO blocks, INDIF for IF
                           blocks, and INDCN is the number of spaces to
                           indent continued lines.  INDAL sets all the other
                           parameters to n (INDDO, INDIF, and INDCN).  The
                           default for all indentation parameters is 3.

                           For a DO block, the DO statement and the terminal
                           statement of the DO are placed at the same
                           indentation level, and statements between them
                           occur at the new indentation level.  For an IF
                           block, the IF statement itself, any ELSE IF or
                           ELSE statement, and the ENDIF statement all occur
                           at the same indentation level, and statements
                           between them occur at the new indentation level.

                           Continued lines are indented n spaces from the
                           initial line; n is given by the INDCN parameter.



     LSTCOL parameter
     B.2.5
                           The LSTCOL parameter has the following format:

                           -------------------------------------------------
                           LSTCOL=n
                           -------------------------------------------------

                           This parameter gives the last column available for
                           indentation, that is, no statement is started
                           after this column.  The default is 31.

                           Indentation is preserved up to the column limit
                           given by LSTCOL, and no statement begins after
                           column n.  Using the defaults gives eight separate
                           block levels, each indented by three spaces from
                           the previous level.



     ILCCOL parameter
     B.2.6
                           The ILCCOL parameter has the following format:

                           -------------------------------------------------
                           ILCCOL=n
                           -------------------------------------------------








     SG-3074 5.0               Cray Research, Inc.                        295


     FPP TIDY Subprocessor          CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           This parameter aligns inline comments starting at
                           column n.  If n is less than 15 or greater than
                           66, no alignment is done.  Column n is the column
                           in which the inline comment delimiter appears.

                           The default is to align inline comments at column
                           50 (ILCCOL=50).  To turn off alignment of the
                           inline comments, use ILCCOL=0.  If an inline
                           comment fits on the original line, even after
                           reformatting, but does not start at the column
                           used for alignment, it is placed on the original
                           line so that the last character of the comment is
                           in the last position of the card.  If an inline
                           comment does not fit on the same line as before, a
                           comment is inserted above (or below, if the A
                           switch is off) the line, and the comment is
                           aligned to the appropriate column, if possible.



     CONCHR parameter
     B.2.7
                           The CONCHR parameter has the following format:

                           -------------------------------------------------
                           CONCHR=*
                           -------------------------------------------------

                           This parameter determines what character to use
                           for continued lines.  By default, the successive
                           numbers 1,2,3,....8,9, are used.  The tenth
                           continued line uses the decimal point (.)  and
                           then the cycle begins over again with 1.  If the
                           CONCHR= parameter is used, the character following
                           the = is used as the continuation character.  If
                           that character is 0, the default is used, because
                           0 is not a valid continuation character.




     FORMAT and DATA
     statements
     B.3
                           TIDY does not change the indentation or spacing
                           inside FORMAT and DATA statements.  (However,
                           FORMAT statements may optionally be renumbered
                           and/or moved to the end of the routine.)







     296                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide          FPP TIDY Subprocessor
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Continued lines
     B.4
                           A statement that must be continued on another line
                           may be broken at any point, except within a name
                           or a constant.





















































     SG-3074 5.0               Cray Research, Inc.                        297



                               Interpreting FMP Intermediate Source Code  [C]
     ########################################################################







                           The intermediate code produced by FMP can provide
                           you with a better understanding of mechanisms that
                           make Autotasking work.  However, this intermediate
                           code is difficult to interpret.

                           The following pages illustrate FMP source code
                           generation.  The first listing shows the FPP
                           source code with CMIC$ directives.

     ------------------------------------------------------------------------
            SUBROUTINE FPPLIST ( L , M )
     C...TRANSLATED BY FPP 3.00Z20 04/04/90  10:04:12
            REAL  A (50), B (80,80,80), C (80,80,80)
     C
     CMIC$ DO ALL IF (ABS(L).GE.50 .OR. L.EQ.0) SHARED(L, M, B, A, C)
     CMIC$1    PRIVATE(I, J, K, X)
            DO 10 I = 1,50,1
                       A(I) = B (I,1,1) + 23.4
               DO 20 J = 1,40
     CDIR@ IVDEP
                  DO 30 K = 1,80
                        C (I,J,K)  = C (I+L,J+M,K) + B(I,J,K)
                        X = C (I,J,K)
     30           CONTINUE
     20        CONTINUE
     10     CONTINUE
     C
            PRINT *, C
            END
     ------------------------------------------------------------------------

                           The following is a listing of the intermediate
                           source code generated by FMP from the previous
                           source code.  FMP translations are in lowercase
                           text.  Comments were added for clarity; they are
                           marked by a c* in column 1 or a ! preceding the
                           comment if the comment is appended to a Fortran
                           statement.













     SG-3074 5.0               Cray Research, Inc.                        299


     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
            SUBROUTINE FPPLIST ( L , M )
     C...TRANSLATED BY FPP 3.00Z20 04/04/90  10:04:12
     C...Translated by fmp 4.0 (32)
     c*
     c** FMP Generated Code for MASTER Process (with comments)
     c*
            REAL  A (50), B (80,80,80), C (80,80,80)
     C
     c*
     c** Declare slave routine
     c*
           external fpplist@5
     cdir@ taskhead fpplist@5
     c*
     c** Declare Autotasking variables and common blocks
     c*
        integer utqarg1(0:5) ! Array for passing shared data locations to slaves
        integer utqrloc                !
        integer utqaloc                !
        integer utqoff                 !
        pointer (utqcsnmp,utqcsnum)    !
        integer utqcsnum               !
        integer utqgcsnm               !
        integer utqtrips               ! Global trip counter
        integer utqitr                 ! Task iteration counter
        integer utqdummy               ! Dummy variable used for functions
        integer g@offset
        integer g@micrtn
        integer g@micarg
        integer b@thread
        integer g@ctxslv
        integer g@ctxrdy
        integer g@ctxrun
        integer g@slvpri
        integer g@schmsk
        integer g@schprm
        common /g@micrtn/g@micrtn      ! Address of slave routine
        common /g@micarg/g@micarg      ! Address of argument list for slave
        common /g@offset/g@offset         !
        common /b@thread/b@thread(69,64)  !
        common /g@ctxslv/g@ctxslv         ! Mask of schedulable slaves
        common /g@ctxrun/g@ctxrun      ! Mask of running slaves
        common /g@ctxrdy/g@ctxrdy      ! Mask of ready slaves
        common /g@slvpri/g@slvpri      ! Default Autotasking slave priority
        common /g@schmsk/g@schmsk      ! Mask of schedulable priorities
        common /g@schprm/g@schprm(0:63) ! Table of prioritized context masks
        utqarg1(1) = loc@(l)          ! Store locations of all shared
        utqarg1(2) = loc@(m)          !  data that must be passed to
        utqarg1(3) = loc@(a)          !  slave tasks in array utqarg1
        utqarg1(4) = loc@(b)
        utqarg1(5) = loc@(c)
        utqarg1(0) = 5
     CMIC$ DO ALL IF (ABS(L).GE.50 .OR. L.EQ.0) SHARED(L, M, B, A, C)
     CMIC$1    PRIVATE(I, J, K, X)
     c*

                                                                 (continued)
     ------------------------------------------------------------------------


     300                       Cray Research, Inc.                SG-3074 5.0


                                    Interpreting FMP Intermediate Source Code
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
     c** Test to see if loop should be executed in parallel
     c** Test for possible dependency and ensure there is enough work
     c*
           if (abs(l).ge.50.or.l.eq.0) then
           utqtrips = (50 - 1)/1
           if (utqtrips.gt.0) then
     c*
     c** Set global iteration count
     c*
           utqoff = 1 + utqtrips*1 + 1
     c*
     c** Force shared data to memory to ensure that slaves use updated values
     c*
     cdir@ suppress l,m,a,b,c
           utqrloc = loc@(fpplist@5)
           utqaloc = loc@(utqarg1)
     c*
     c** Test for Autotasking or microtasking at a higher level.  If not,
     c** set Autotasking semaphore, specify slave task routine and argument
     c** list, and begin parallel execution.
     c*
           if (shiftl@(read@sm(),10).ge.0) then
           utqdummy = tset@sm(10)
           utqdummy = tset@sm(3)
           utqgcsnm = read@st(1)
           g@offset = utqoff
           g@micrtn = utqrloc
           g@micarg = utqaloc
     cdir@ suppress  g@offset
     cdir@ suppress  g@micrtn
     cdir@ suppress  g@micarg
           utqdummy = write@st(2,-utqtrips)
           utqdummy = write@st(0,-1)
           utqdummy = write@st(1,compl@(utqgcsnm))
           utqdummy = clr@sm(12)
           utqdummy = wclr@sm(3)
           utqcsnmp = read@b(55) + 131
           utqcsnum = -1
     c*
     c** Master code to begin parallel region
     c**  Set TSKLK (single-CPU execution for critical region)
     c**  If (any idle slaves) then
     c**    update slave priority table based on the slave's priority
     c**    update the mask of schedulable priorities with slave's priority
     c**    update the context ready mask for the new slaves
     c**  endif
     c**  CMR and clear TSKLK (clear lock for critical region)
     c*
           utqdummy = tset@sm(2)
     cdir@ suppress g@ctxslv,g@ctxrun,g@slvpri
           utqdummy = and@(compl@(g@ctxrun),g@ctxslv)
           utqitr = shiftr@(or@(utqdummy,-utqdummy),63)
           g@schprm(63-g@slvpri) = or@(g@schprm(63-g@slvpri),utqdummy)
           g@schmsk = or@(shiftl(utqitr,g@slvpri),g@schmsk)
           g@ctxrdy = or@(g@ctxrdy,utqdummy)

                                                                 (continued)
     ------------------------------------------------------------------------


     SG-3074 5.0               Cray Research, Inc.                        301


     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
     cdir@ suppress g@ctxrdy,g@schprm,g@schmsk
           utqdummy = wclr@sm(2)
     c*
     c**   Set threads' wake-up flags with RTC
     c*
           utqitr = irtc()
           do 4 utqdummy = 1,16
             b@thread(2,utqdummy) = utqitr
         4 continue
     c*
     c** Set master's iteration counter for parallel loop.
     c*
           i = 1
     c*
     c** Start master's parallel code here.
     c**  DO loop is commented out.  A labeled continue statement
     c**  is inserted to mark the top of the loop.
     c*
         2 continue
     c       DO 10 I = 1,50,1
                       A(I) = B (I,1,1) + 23.4
               DO 20 J = 1,40
     CDIR@ IVDEP
                  DO 30 K = 1,80
                        C (I,J,K)  = C (I+L,J+M,K) + B(I,J,K)
                        X = C (I,J,K)
     30           CONTINUE
     20        CONTINUE
     10     CONTINUE
     C
     c*
     c** Critical region protected by semaphore 4 for updating global
     c**  iteration count.  Global counter is read and incremented.
     c*
           utqdummy = tset@sm(4)
           utqitr = read@st(2)
           utqdummy = write@st(2,utqitr+1)
           utqdummy = clr@sm(4)
     c*
     c** Get next iteration counter for master task.
     c*
           utqdummy = compl@(utqitr)
           i = utqoff + utqitr*1
     c*
     c** Test to see if all iterations have been assigned; if not,
     c**  jump back to top of loop and compute next iteration.
     c*
           if (utqdummy.ge.0) goto 2
     c*
     c** ALL iterations have been assigned.
     c** Ensure that any late arriving slave processes will not waste
     c**  time going through the library into the slave routine only to
     c**  find out that there is no more work to do.
     c*
           utqdummy = tset@sm(2)

                                                                 (continued)
     ------------------------------------------------------------------------


     302                       Cray Research, Inc.                SG-3074 5.0


                                    Interpreting FMP Intermediate Source Code
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
           utqdummy = and@(compl@(g@ctxrun),g@ctxslv)
           g@ctxrdy = and@(compl@(utqdummy),g@ctxrdy)
           utqdummy = wclr@sm(2)
     c*
     c** Critical region protected by semaphore 3 to check
     c**  and write number of fray
     c*
           utqdummy = tset@sm(3)
           utqdummy = set@sm(12)
           utqdummy = write@st(1,0-read@st(1))
           utqdummy = clr@sm(0)
           utqdummy = clr@sm(3)
     c*
     c** Loop waiting on semaphore 0 for all slaves to finish
     c*
         7 utqdummy = tset@sm(0)
           if (read@sb(0).ne.0) goto 7
           utqdummy = clr@sm(0)
           utqdummy = clr@sm(10)
     c*
     c** End of master task's parallel code.
     c** Jump around unitasked (single-CPU) version of the loop
     c*
           goto 3
           endif
           endif
           endif
     c*
     c** Master task's unitasked (single-CPU) code
     c*
     CMICS DO ALL IF (ABS(L).GE.50 .OR. L.EQ.0) SHARED(L, M, B, A, C)
     CMICS1    PRIVATE(I, J, K, X)
     c       DO 10 I = 1,50,1
     c*
     c** DO loop is reinserted
     c*
           do 1 i = 1, 50, 1
           a(i)=b(i,1,1)+23.4
           do 5 j = 1,40
           do 6 k = 1,80
           c(i,j,k)=c(i+l,j+m,k)+b(i,j,k)
           x=c(i,j,k)
         6 continue
         5 continue
         1 continue
     c*
     c** End of unitasked loop
     c*
         3 continue
            PRINT *, C
            END
     c*
     c** FMP GENERATED SUBROUTINE TO BE EXECUTED BY SLAVE PROCESSES
     c** This is basically a copy of the master task's parallel code.
     c*

                                                                 (continued)
     ------------------------------------------------------------------------


     SG-3074 5.0               Cray Research, Inc.                        303


     CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
           subroutine fpplist@5(l,m,a,b,c)
     c*
     c** FMP Generated Code for SLAVE Processes (with comments)
     c**
     c** Declare stack allocation for local variables
     c*
     cdir@ stack
           integer utqoff                 !
           integer utqdummy               ! Dummy variable used for functions
           integer g@offset               !
           integer utqitr                 ! Iteration counter for each slave
           real a(*)
           real b(80,80,*)
           real c(80,80,*)
           common /g@offset/g@offset
           external utqep                 !
           utqoff = g@offset
     CMICS DO ALL IF (ABS(L).GE.50 .OR. L.EQ.0) SHARED(L, M, B, A, C)
     CMICS1    PRIVATE(I, J, K, X)
     c*
     c** Critical region protected by semaphore 4 for updating global
     c**  iteration count.  Global counter is read and incremented.
     c*
           utqdummy = tset@sm(4)
           utqitr = read@st(2)
           utqdummy = write@st(2,utqitr+1)
           utqdummy = clr@sm(4)
           utqdummy = compl@(utqitr)
     c*
     c** Get next iteration counter for the slave
     c*
           i = utqoff + utqitr*1
     c*
     c** Test to see if all iterations have been assigned; if not,
     c**  enter the top of loop and compute next iteration.
     c*
           if (utqdummy.ge.0) then
         2 continue
     c       DO 10 I = 1,50,1
                       A(I) = B (I,1,1) + 23.4
               DO 20 J = 1,40
     CDIR@ IVDEP
                  DO 30 K = 1,80
                        C (I,J,K)  = C (I+L,J+M,K) + B(I,J,K)
                        X = C (I,J,K)
     30           CONTINUE
     20        CONTINUE
     10     CONTINUE
     C
     c*
     c** Critical region protected by semaphore 4 for updating global
     c**  iteration count.  Global counter is read and incremented.
     c*
           utqdummy = tset@sm(4)
           utqitr = read@st(2)

                                                                 (continued)
     ------------------------------------------------------------------------


     304                       Cray Research, Inc.                SG-3074 5.0


                                    Interpreting FMP Intermediate Source Code
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
           utqdummy = write@st(2,utqitr+1)
           utqdummy = clr@sm(4)
           utqdummy = compl@(utqitr)
     c*
     c** Get next iteration counter for the slave
     c*
           i = utqoff + utqitr*1
     c*
     c** Test to see if all iterations have been assigned; if not,
     c**  jump back to top of loop and compute next iteration.
     c*
           if (utqdummy.ge.0) goto 2
     c*
     c** All iterations have been assigned.  Slave process is complete.
           endif
           return
           end
     ------------------------------------------------------------------------









































     SG-3074 5.0               Cray Research, Inc.                        305



                                                    UNICOS Command Pages  [D]
     ########################################################################







                           See the printed manual for man pages or use
                           the man command to display them; they don't
                           display in the docview version.















































     SG-3074 5.0               Cray Research, Inc.                        307



                                                    FMP and FPP Messages  [E]
     ########################################################################







                           The Autotasking feature of the CF77 compiling
                           system generates diagnostic messages that appear
                           in the form of UNICOS error message system
                           messages.  These messages can be accessed on-line
			   through the message system.  They are not 
			   displayed in the docview version of this manual.




































     SG-3074 5.0               Cray Research, Inc.                        309


&& End of Text &&
entire 1 19754 0
record-of-revision 24 65 1
contents 93 207 1
preface 341 243 1
conventions 458 83 2
reader-comments 541 43 2
introduction 603 523 1
evolution-of-cri-parallel-processing-software 757 129 2
using-microtasking-&-macrotasking-with-autotaski 890 33 2
goals-of-autotasking 932 34 2
when-to-use-autotasking 970 39 2
autotasking's-effect-on-vectorization 1013 48 2
speedup-expected-from-autotasking 1065 61 2
cf77-user-interface 1188 1447 1
cf77-compiling-system 1208 465 2
fpp 1262 177 3
fmp 1451 20 3
cft77 1474 30 3
compiling-system-advantages 1514 159 3
unicos-user-interface 1677 625 2
using-cf77-command 1727 575 3
compiling-system-control-options-(-z) 1931 137 4
intermediate-source-file-options 2070 46 4
object-file-creation 2118 13 4
cf77-command-information 2133 45 4
using--w-option 2180 68 4
additional-cf77-options 2250 52 4
unicos-environment-variables 2306 402 2
environment-variables-for-all-systems 2427 35 3
additional-environment-variables-for-cx/cea-syst 2465 125 3
additional-environment-variables-for-cray-2-syst 2593 42 3
invoking-fpp-&-fmp-directly 2685 978 3
unicos-fpp-command 2712 773 4
fpp-command-examples 2931 121 5
fpp-optimization-switches 3055 319 5
fpp-listing-switches 3377 108 5
unicos-fmp-command 3495 168 4
concepts-&-directives 3680 3141 1
concepts 3703 311 2
levels-of-user-intervention-with-autotasking 4018 104 2
directives 4126 2821 2
fpp-directives 4253 997 3
transformation-directives 4535 299 4
novector/vector 4553 12 5
noconcur/concur 4566 22 5
skip 4589 5 5
inner/noinner 4595 6 5
cncall 4612 8 5
noaltcode/altcode 4621 29 5
noassoc/assoc 4651 33 5
split/nosplit 4685 9 5
select 4695 21 5
nolstval/lstval 4717 28 5
unroll/nounroll 4746 88 5
data-dependency-directives 4836 150 4
nodepchk/depchk 4849 29 5
nosync/sync 4879 13 5
noeqvchk/eqvchk 4893 11 5
permutation 4905 21 5
relation 4927 17 5
privatearray 4945 41 5
miscellaneous-directives 4988 262 4
advisory-directives:--count-&-iterations 4996 171 5
listing-directives:--nolist-&-list 5168 23 5
switch 5192 30 5
fmp-directives 5253 1525 3
fmp-autotasking-directives 5340 761 4
cmic$-do-all 5502 266 5
cmic$-parallel-&-cmic$-end-parallel 5769 29 5
cmic$-do-parallel-&-cmic$-end-do 5799 84 5
cmic$-case-&-cmic$-end-case 5884 64 5
cmic$-guard-&-cmic$-end-guard 5949 41 5
cmic$-continue 5991 8 5
cmic$-soft-exit 6000 34 5
cmic$-taskcommon 6035 66 5
fmp-microtasking-directives 6103 424 4
cmic$-micro 6196 26 5
cmic$-do-global 6223 40 5
cmic$-do-global-long-vector 6264 36 5
cmic$-do-global-by-expression 6301 32 5
cmic$-do-global-for-expression 6345 31 5
cmic$-getcpus 6377 32 5
cmic$-process,-cmic$-also-process,-&-cmic$-end-p 6410 45 5
cmic$-relcpus 6456 24 5
cmic$-stop-all-process 6481 46 5
fmp-data-scope-rules 6535 243 4
read-only-variables 6607 25 5
array-indexed-by-loop-index 6633 18 5
read-then-write-variables 6652 27 5
write-then-read-variables-&-arrays 6680 16 5
user-added-scope-required 6697 81 5
compiler-directives 6790 31 3
fpp-data-dependency-analysis 6850 1317 3
data-dependency-examples 6951 333 4
reference-reordering 7288 51 4
ambiguous-subscript-resolution 7343 20 4
loop-splitting-to-split-index-set 7377 26 4
loop-peeling 7407 42 4
using-data-dependency-directives 7453 302 4
nodepchk-(declaring-nonrecursion) 7538 43 5
noeqvchk-(declaring-nonrecursion-in-equivalences 7584 59 5
relation-(specifying-relationship-between-variab 7646 60 5
permutation-(declaring-safe-indirect-addressing) 7709 46 5
loop-splitting-to-isolate-recursion 7764 46 4
translation-of-linear-recursion 7814 113 4
array-indexing 7931 384 4
vector-indexes 7939 77 5
index-expressions 8024 14 5
nonlinear-indexing 8041 22 5
last-value-saving 8066 59 5
indirect-addressing 8128 39 5
fpp-loop-analysis-&-tuning 8213 2743 5
loop-analysis 8319 758 6
loop-types 8371 114 7
if-loops-conversion 8377 71 8
do-loops 8450 35 8
allowable-statement-types 8488 72 7
conditional-operations 8563 118 7
testing-on-loop-index 8575 106 8
loop-variable-types 8684 49 7
selection-criteria-for-vectorization 8736 157 7
selection-criteria-for-autotasking 8896 66 7
outer-loop-optimization-inhibitors 8965 51 7
outer-loop-vectorization-inhibitors 9019 58 7
loop-optimizations 9081 579 6
loop-collapse 9156 51 7
loop-fusion 9210 49 7
loop-rerolling 9262 29 7
loop-unrolling 9294 75 7
translation-of-array-notation 9381 30 7
extended-parallel-regions 9414 87 7
parallel-cases 9511 93 7
reductions 9607 53 7
loop-tuning-parameters 9664 1352 6
select-directive-use 9751 71 7
nosync-directive-use 9825 28 7
inner-directive-use 9856 57 7
threshold-tests 9916 416 7
threshold-calculation-on-cx/cea-systems 9965 71 8
threshold-calculation-on-cray-2-systems 10038 46 8
threshold-test-omitted 10094 15 8
threshold-calculation-on-outer-loops 10111 7 8
multiple-loops-contained-in-extended-parallel-re 10120 20 8
threshold-calculation-for-reductions 10142 69 8
threshold-calculation-for-reductions-on-outer-lo 10219 47 8
user-control-of-threshold-tests 10268 64 8
cncall-directive-use 10335 67 7
conditional-autotasking 10414 13 7
private-array-use 10430 150 7
split-directive-use 10583 51 7
vfunction-directive-use 10637 23 7
calls-to-cray-scientific-library 10674 342 7
matrix-matrix-multiplication 10767 30 8
vector-matrix-multiplication 10805 28 8
rank-one-update 10835 38 8
linear-recursion 10875 10 8
index-of-maximum/minimum 10887 69 8
additional-fpp-optimization 10993 1458 8
vectorization-enhancement 11020 442 9
dependence-analysis 11164 125 10
loop-nest-restructuring 11292 64 10
if-conversion 11359 93 10
miscellaneous-techniques 11455 7 10
inline-expansion 11466 729 9
automatic-inline-expansion 11603 32 10
full-versus-safe-expansion 11636 95 10
explicit-inline-expansion 11733 15 10
autoexpand-directive 11765 15 10
expand-directive 11782 28 10
nexpand-directive 11820 43 10
search-directive 11883 19 10
same-file 11904 11 10
explicitly-named-file 11917 6 10
implicitly-named-file 11925 9 10
separate-compilation 11949 30 10
code-size 11981 20 10
debugging 12003 14 10
compilation-rate 12019 9 10
expansion-inhibitors 12040 26 10
automatic-expansion-inhibitors 12068 33 10
inline-expansion-mechanics 12104 65 10
inline-expansion-user-messages 12172 23 10
scalars-in-loops 12199 252 9
last-values-of-promoted-scalars 12292 8 10
conditionally-defined-promoted-scalars 12312 35 10
carry-around-scalars 12350 25 10
equivalenced-scalars 12378 22 10
assoc-directive 12403 48 10
fpp-source-output 12500 630 1
names-generated-by-fpp 12603 28 2
temporary-arrays-generated-by-fpp 12635 111 2
fpp-task-common-block 12677 22 3
stripmining 12702 44 3
example-fpp-listings 12758 372 2
autotasking-performance 13145 2064 1
performance-expectations-for-vectorization 13167 97 2
amdahl's-law-for-vectorization 13276 66 2
performance-expectations-for-autotasking 13346 36 2
amdahl's-law-for-multitasking 13386 196 2
single-threaded-code-segments 13542 40 3
estimating-percentage-of-parallelism-within-prog 13586 114 2
prerequisites-for-high-performance 13704 54 2
characteristics-of-parallel-programs 13762 210 2
extent-of-parallelism-&-load-balancing 13976 290 2
load-balancing-example-1 14019 178 3
load-balancing-example-2 14200 66 3
overhead-produced-by-autotasking 14270 297 2
time-spent-waiting-on-semaphores 14361 16 3
time-spent-executing-extra-code-for-autotasking 14389 134 3
extra-memory-bank-conflicts-created 14526 17 3
possible-decreased-vector-performance 14546 21 3
autotasking-performance-example:--nas-kernel-ben 14582 668 2
version-with-zero-changes 14629 86 3
version-with-twenty-changes 14718 45 3
case-study:--vpenta 14775 475 3
vpenta:--original-code 14788 186 4
vpenta-revision-1 14976 17 4
vpenta-revision-2 14995 214 4
autotasking-analysis-tools 15228 1085 4
tool-summary 15254 152 5
autotasking-tools 15410 356 5
atchop-command 15431 18 6
atexpert-command 15452 83 6
atscope-command 15538 45 6
multitasking-history-trace 15586 52 6
graphical-multiprocessing-analysis-tools-(gmat) 15641 125 6
other-unicos-tools 15770 584 5
profiling-feature 15786 80 6
procstat-&-procrpt-commands 15869 47 6
flowtrace 15919 42 6
ftref-command 15964 50 6
watchword-debugging-tool 16017 23 6
miscellaneous-commands-&-functions 16043 311 6
ja-command 16059 33 7
hpm-command 16094 114 7
timef-library-function 16210 26 7
second-library-function 16238 42 7
tsecnd-library-function 16282 31 7
autotasking-memory-usage 16327 586 7
increased-program-space-requirements 16358 183 8
task-common-blocks-generated-by-fpp 16466 31 9
master-&-slave-tasks 16500 41 9
increased-stack-space-requirements 16545 83 8
specifying-memory-requirements 16632 281 8
explicit-stack-size-specification 16680 131 9
explicit-heap-size-specification 16814 99 9
autotasking-in-batch-environment 16975 760 1
realistic-autotasking-performance-expectations 17000 535 2
amdahl's-law-under-realistic-autotasking-conditi 17107 275 3
weighting-factor-for-batch-environment 17212 65 4
increased-time-because-of-overhead 17279 27 4
average-number-of-available-processors 17308 23 4
average-number-of-usable-processors 17333 49 4
examples-of-realistic-form-of-amdahl's-law 17385 150 3
80%-parallel-program 17449 35 4
20%-parallel-program 17493 42 4
autotasking-in-heavily-loaded-batch-environment 17539 238 2
advantages 17664 39 3
disadvantages 17706 29 3
debugging-autotasked-programs 17747 430 3
problems-unrelated-to-use-of-multiple-processors 17781 33 4
problems-related-to-use-of-multiple-processors 17818 38 4
cdbx-debugger-support 17860 304 4
debugging-from-cf77-input 17896 247 5
debugging-from-fmp-output 18146 18 5
unicos-interface-to-autotasking 18200 152 1
software-anomalies 18396 146 1
fpp-tidy-subprocessor 18593 523 1
tidy-options 18628 253 2
tidy-parameters 18885 202 2
renumb-parameter 18898 21 3
format-parameter 18927 17 3
labels-parameter 18947 18 3
indentation-parameters 18968 38 3
lstcol-parameter 19009 17 3
ilccol-parameter 19029 38 3
conchr-parameter 19070 17 3
format-&-data-statements 19091 7 2
continued-lines 19111 5 2
interpreting-fmp-intermediate-source-code 19172 421 1
unicos-command-pages 19637 14 1
fmp-&-fpp-messages 19700 17 1