CCCCCC   RRRRRRR    AAAAAA   YY    YY
                   CCCCCCCC  RRRRRRRR  AAAAAAAA  YY    YY
                   CCC   CC  RR    RR  AA    AA  YYY  YYY
                   CC        RR   RR   AA    AA   YYYYYY
                   CC        RRRRRR    AAAAAAAA    YYYY
                   CC        RRRRRR    AAAAAAAA     YY
                   CC        RR   RR   AA    AA     YY
                   CCC   CC  RR   RR   AA    AA     YY
                   CCCCCCCC  RR    RR  AA    AA     YY
                    CCCCCC   RR    RR  AA    AA     YY

                               RESEARCH,  INC.

   CF77 Compiling System, Volume 4: Parallel Processing Guide (SG-3074 5.0) 

   This user's guide defines and describes the Autotasking feature of the 
   CF77 compiling system release 5.0.  Autotasking is the automatic 
   distribution of loop iterations to multiple processors.  This user's 
   guide is one manual in a set (SR-3071, SR-3072, and SG-3074) describing 
   the CF77 compiling system, which includes the Cray Fortran compiler 
   CFT77 and the Autotasking software.


                                                           Record of Revision
     ########################################################################







     The date of printing or software version number is indicated in the
     footer.  In reprints with revision, changes are noted by revision bars
     along the margin of the page.





            Version         Description

              4.0           June 1990.  Original printing.  This manual
                            replaces the UNICOS Autotasking User's Guide,
                            publication SN-2088, and the COS Autotasking
                            User's Guide, publication SN-3033.  It supports
                            the Autotasking feature of the CF77 compiling
                            system release 4.0.  The "New Features" page
                            details specific features associated with the 4.0
                            release.


              5.0           June 1991.  Reprint with revision to support the
                            Autotasking feature of the CF77 compiling system
                            release 5.0, which runs on CX/CEA and CRAY-2
                            systems under the UNICOS 6.0 release or higher.
                            Documentation for the Autotasking feature under
                            the Cray Research operating system COS,
                            previously included in this manual, is no longer
                            included.  COS users of Autotasking should see
                            revision 4.0 of this manual.

                            New fpp command options include the following:
                            -H, to specify directories that contain INCLUDE
                            files; -N80, to specify 80-column input files;
                            -P, to specify the number of lines per page, for
                            page-formatted listings; -Q, to specify the size
                            of FPP-generated temporary arrays; and -V, to
                            display current FPP version information.  New FPP
                            optimization switch f lets you enable generation
                            of debugging directives.  Optimization switch j,
                            to translate nested loop idioms into library
                            calls, is now enabled by default.  A new FPP
                            directive, CFPP$ PRIVATEARRAY, lets you specify
                            that private arrays can be autotasked.

                            New fmp command options include the following:
                            -I, to specify directories that contain INCLUDE
                            files; -N80, to specify 80-column input files;
                            and -V, to display current FMP version
                            information.  A new FMP Autotasking directive,
                            TASKCOMMON, lets you specify that common blocks
                            should be converted to task common blocks.

                            UNICOS environment variables are now documented
                            in section 2, "CF77 User Interface," rather than
                            in section 13.

     SG-3074 5.0               Cray Research, Inc.                        iii



                                                                     Contents
     ########################################################################








       v  Preface

       1  Introduction  [1]
       3  Evolution of CRI parallel processing software
       5  Using microtasking and macrotasking with Autotasking
       6  Goals of Autotasking
       6  When to use Autotasking
       7  Autotasking's effect on vectorization
       8  Speedup expected from Autotasking

      11  CF77 User Interface  [2]
      11  CF77 compiling system
      18  UNICOS user interface
      28  UNICOS environment variables

      35  Invoking FPP and FMP Directly  [3]
      35  UNICOS fpp command
      47  UNICOS fmp command

      51  Concepts and Directives  [4]
      51  Concepts
      56  Levels of user intervention with Autotasking
      57  Directives

     101  FPP Data Dependency Analysis  [5]
     102  Data dependency examples
     107  Reference reordering
     108  Ambiguous subscript resolution
     109  Loop splitting to split index set
     109  Loop peeling
     110  Using data dependency directives
     115  Loop splitting to isolate recursion
     115  Translation of linear recursion
     117  Array indexing

     123  FPP Loop Analysis and Tuning  [6]
     124  Loop analysis
     136  Loop optimizations
     145  Loop tuning parameters








     SG-3074 5.0               Cray Research, Inc.                          v


     Contents                       CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     167  Additional FPP Optimization  [7]
     167  Vectorization enhancement
     174  Inline expansion
     187  Scalars in loops

     193  FPP Source Output  [8]
     194  Names generated by FPP
     195  Temporary arrays generated by FPP
     197  Example FPP listings

     203  Autotasking Performance  [9]
     203  Performance expectations for vectorization
     205  Amdahl's Law for vectorization
     206  Performance expectations for Autotasking
     206  Amdahl's Law for multitasking
     209  Estimating the percentage of parallelism within a program
     211  Prerequisites for high performance
     212  Characteristics of parallel programs
     215  Extent of parallelism and load balancing
     220  Overhead produced by Autotasking
     224  Autotasking performance example:  NAS Kernel Benchmark

     235  Autotasking Analysis Tools  [10]
     235  Tool summary
     237  Autotasking tools
     243  Other UNICOS tools

     253  Autotasking Memory Usage  [11]
     253  Increased program space requirements
     256  Increased stack space requirements
     257  Specifying memory requirements

     263  Autotasking in a Batch Environment  [12]
     263  Realistic Autotasking performance expectations
     272  Autotasking in a heavily loaded batch environment

     275  Debugging Autotasked Programs  [13]
     275  Problems unrelated to the use of multiple processors
     276  Problems related to the use of multiple processors
     276  CDBX debugger support
     281  Environment variables for all systems

     283  UNICOS Interface to Autotasking  [14]

     287  Software Anomalies  [A]

     291  FPP TIDY Subprocessor  [B]
     291  TIDY options
     295  TIDY parameters
     298  FORMAT and DATA statements
     298  Continued lines





     vi                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                       Contents
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     299  Interpreting FMP Intermediate Source Code  [C]

     307  UNICOS Command Pages  [D]

     309  FMP and FPP Messages  [E]

          Figures
       2  Figure 1.  Cray Research parallel processing capabilities
       4  Figure 2.  Cray Research multitasking implementations
      12  Figure 3.  CF77 compiling system
      14  Figure 4.  FPP  the dependence analysis phase
      17  Figure 5.  Summary of FMP and CFT77 roles in CF77 compiling system
      18  Figure 6.  Phases of Autotasking using cf77
      21  Figure 7.  cf77 command summary
      23  Figure 8.  cf77 command control option summary
      29  Figure 9.  Summary of multitasking environment variables
      53  Figure 10. Multitasking terminology
      54  Figure 11. Multitasking terminology (continued)
      57  Figure 12. Levels of intervention with Autotasking
      59  Figure 13. Directive summary by type
      63  Figure 14. FPP transformation directive summary
      68  Figure 15. FPP data dependency directive summary
      72  Figure 16. Miscellaneous FPP directive summary
      77  Figure 17. Autotasking versus microtasking
      78  Figure 18. FMP Autotasking directives
      82  Figure 19. Work distribution parameters for parallel loops
      89  Figure 20. FMP microtasking directive summary
     102  Figure 21. FPP data dependency analysis
     104  Figure 22. FPP data dependency analysis
     107  Figure 23. Summary of FPP techniques to optimize data dependencies
     117  Figure 24. Summary of linear recursion patterns recognized by FPP
     124  Figure 25. Summary of FPP loop analysis
     132  Figure 26. Summary of FPP loop selection criteria for vectorization
     134  Figure 27. Summary of FPP criteria for Autotasking and possible
                     inhibitors
     137  Figure 28. Summary of FPP loop optimization techniques
     146  Figure 29. Summary of FPP loop tuning parameters
     169  Figure 30. Summary of additional FPP vectorization enhancements
     188  Figure 31. Summary of FPP transformations of scalars in loops
     198  Figure 32. Summary of FPP listing features
     204  Figure 33. Summary of performance issues
     210  Figure 34. Simple technique for estimating parallelism in a program
     213  Figure 35. Characteristics of programs with a high potential for
                     parallelism
     217  Figure 36. Summary of parallelism and load balancing
     218  Figure 37. Execution scenario for example 1
     219  Figure 38. Execution scenario for example 2
     223  Figure 39. Overhead introduced by Autotasking on CX/CEA
     223  Figure 40. Overhead introduced by Autotasking on CRAY-2
     255  Figure 41. FMP source code generation
     266  Figure 42. Summary of issues influencing Amdahl's Law under
                     realistic Autotasking conditions


     SG-3074 5.0               Cray Research, Inc.                        vii


     Contents                       CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     273  Figure 43. Summary of advantages and disadvantages of Autotasking
                     in a batch system
     285  Figure 44. Parallel region overview (master process)
     285  Figure 45. Parallel region overview (slave process)

          Tables
      42  Table 1.  Optimization switches enabled and disabled by -e and -d
      46  Table 2.  Listing switches enabled and disabled by -p and -q
      60  Table 3.  Allowable scope parameters for CFPP$ directives
      61  Table 4.  CFPP$ directives
      76  Table 5.  Equivalent Autotasking and microtasking directives
      99  Table 6.  CFT77 versus FPP directives
     199  Table 7.  Loop disposition codes
     208  Table 8.  Amdahl's Law for multitasking
     226  Table 9.  NAS Kernel Benchmark  zero changes
     227  Table 10. NAS Kernel Benchmark  twenty changes
     234  Table 11. VPENTA case study results from a CRAY Y-MP system
     235  Table 12. Which tool to use?
     237  Table 13. Tool impact
     270  Table 14. Autotasking wall-clock speedups in batch environment
     291  Table 15. TIDY switches






































     viii                      Cray Research, Inc.                SG-3074 5.0



                                                                      Preface
     ########################################################################







                           This user's guide is one manual in a set
                           describing the Cray Research CF77 compiling
                           system.  The compiling system includes the Cray
                           Research Fortran compiler CFT77 and the
                           Autotasking software described in this guide.
                           Other manuals in this set include the following:

                           * CF77 Compiling System, Volume 1:  Fortran
                             Reference Manual, publication SR-3071

                           * CF77 Compiling System, Volume 2:  Compiler
                             Message Manual, publication SR-3072

                           * CF77 Compiling System, Volume 3:  Vectorization
                             Guide, publication SG-3073

                           * CF77 Compiling System Ready Reference,
                             publication SQ-3070

                           This user's guide defines and describes the
                           Autotasking feature of the CF77 compiling system.
                           Autotasking is the automatic distribution of loop
                           iterations to multiple processors.  The
                           Autotasking feature described in this manual runs
                           on CRAY Y-MP, CRAY X-MP EA, CRAY X-MP, and CRAY-2
                           computer systems running the UNICOS 6.0 release or
                           higher and the CF77 5.0 release.  Autotasking is
                           released as part of the CF77 5.0 release.

                           This user's guide describes the dependence
                           analyzer, FPP, the translation phase, FMP, and the
                           user interface to the compiling system, cf77.  FPP
                           preprocesses DO and IF loops for the CFT77
                           compiler, and improves performance of Cray Fortran
                           programs by providing Autotasking and
                           vectorization enhancement.  FMP translates
                           directives and original Fortran source for
                           Autotasking.  cf77 provides a one-step user
                           interface to the compiling system.

                           The following Cray Research, Inc. (CRI) manuals
                           provide information about related subjects.
                           Unless otherwise noted, all publications
                           referenced in this manual are CRI publications.

                           * UNICOS User Commands Reference Manual,
                             publication SR-2011

                           * UNICOS User Commands Ready Reference,
                             publication SQ-2056


     SG-3074 5.0               Cray Research, Inc.                         ix


     Preface      CF77 Compiling System, Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * CAL Assembler Version 2 Reference Manual,
                             publication SR-2003

                           * CRAY-2 Fortran (CFT2) Reference Manual,
                             publication SR-2007

                           * Macros and Opdefs Reference Manual, publication
                             SR-0012

                           * UNICOS Macros and Opdefs Reference Manual for
                             CRAY-2 Computer Systems, publication SR-2082

                           * Volume 1:  UNICOS Fortran Library Reference
                             Manual, publication SR-2079

                           * Volume 2:  UNICOS C Library Reference Manual,
                             publication SR-2080

                           * Volume 3:  UNICOS Math and Scientific Library
                             Reference Manual, publication SR-2081

                           * Volume 4:  UNICOS System Calls Reference Manual,
                             publication SR-2012

                           * Segment Loader (SEGLDR) and ld Reference Manual,
                             publication SR-0066

                           * UNICOS Performance Utilities Reference Manual,
                             publication SR-2040

                           * UNICOS CDBX Symbolic Debugger Reference Manual,
                             publication SR-2091

                           * UNICOS CDBX Debugger User's Guide, publication
                             SG-2094

                           * CRAY Y-MP, CRAY X-MP EA, and CRAY X-MP
                             Multitasking Programmer's Manual, publication
                             SR-0222

                           * CRAY-2 Multitasking Programmer's Manual,
                             publication SN-2026

                           * Interlanguage Programming Conventions,
                             publication SN-3009




     Conventions
                           The Hardware Product Line sheet, located at the
                           end of this preface, defines the hardware naming
                           conventions used in this manual.  This sheet shows
                           both the chronological evolution of Cray Research
                           mainframes and the characteristics of each
                           mainframe group.  The reverse side of the sheet
                           contains definitions of the terms used on the
                           sheet and throughout this manual.


     xi                        Cray Research, Inc.                SG-3074 5.0


     CF77 Compiling System, Volume 4:  Parallel Processing Guide      Preface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The following typographic conventions are used
                           throughout this manual:

                           Convention   Meaning
                           ----------   -------

                           command(1)   The designation (1) following a
                                        command name indicates that the
                                        command is documented in UNICOS User
                                        Commands Reference Manual,
                                        publication SR-2011

                           system call(2)
                                        The designation (2) following a
                                        system call name indicates that the
                                        system call is documented in Volume
                                        4:  UNICOS System Calls Reference
                                        Manual, publication SR-2012

                           library routine(3X)
                                        The designation (3X) following a
                                        routine name indicates that the
                                        routine is documented in one of the
                                        CRI library reference manuals (-
                                        SR-2079, SR-2080, SR-2081, SR-2057,
                                        or SM-2083).  The letter following
                                        the number 3 indicates the
                                        appropriate manual.

                                        For a list of the 3X library routine
                                        designations and their associated
                                        manuals, see the FILES section of the
                                        man(1) man page.

                           typewriter font
                                        Denotes literal items such as command
                                        names, file names, routines,
                                        directory names, path names, signals,
                                        messages, and programming language
                                        structures.

                           italic font  Denotes variable entries and words or
                                        concepts being defined.

                           bold typewriter font
                                        In screen drawings of interactive
                                        sessions, denotes literal items
                                        entered by the user.  Output is shown
                                        in nonbold typewriter font.

                           In this manual, Cray Research and CRI are used
                           interchangeably to refer to Cray Research, Inc.,
                           and/or its products.  To avoid redundancy or
                           awkwardness, Cray Research may occasionally be
                           shortened to Cray (for example, Cray disk or Cray
                           job).





     SG-3074 5.0               Cray Research, Inc.                        xii


     Preface      CF77 Compiling System, Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     Reader comments
                           If you have comments about the technical accuracy,
                           content, or organization of this manual, please
                           tell us.  You can contact us in any of the
                           following ways:

                           * Call our Software Information Services
                             department at (612) 683-5729.

                           * Send us electronic mail from a UNICOS or UNIX
                             system, using the following UUCP address:

                                uunet!cray!publications


                           * Send us electronic mail from any system
                             connected to Internet, using one of the
                             following Internet addresses:

     				pubs3074@timbuk.cray.com (comments specific 
				to this manual)

     				publications@timbuk.cray.com (general 
				comments)


                           * Send a facsimile of your comments to the
                             attention of "Software Information Services" at
                             fax number (612) 683-5599.

                           * Use the postage-paid Reader's Comment form at
                             the back of this manual.

                           * Write to us at the following address:

                                Cray Research, Inc.
                                Software Information Services Department
                                655F Lone Oak Drive
                                Eagan, MN  55121


                           We value your comments and will respond to them
                           promptly.
















     xiii                      Cray Research, Inc.                SG-3074 5.0



                                                            Introduction  [1]
     ########################################################################







                           Parallel processing capabilities exist on several
                           levels and can be used in many different ways on
                           Cray Research, Inc. (CRI) computer systems.
                           Parallel processing capabilities are summarized in
                           Figure 1, page 2, and discussed in this section.
                           At the hardware level, the following introduce
                           parallel processing:

                           * Parallel instruction execution - Most CRI
                             computer systems issue one instruction per clock
                             period, although an instruction may take several
                             clock periods to complete execution.  Also,
                             instructions are not executed serially, but in
                             parallel; for example, an addition instruction
                             may be issued during the execution of a
                             multiplication instruction.  This is parallel
                             processing at the hardware instruction level.

                           * Vector registers and segmented vector functional
                             units - The vector registers and vector
                             functional units in CRI systems use instruction
                             pipelining.  Vector functional unit segmentation
                             and vector chaining (in CRAY Y-MP systems) act
                             as parallel processing aids.  Instruction
                             pipelining occurs when an instruction begins
                             before the previous instruction has completed;
                             this is accomplished by using segmented
                             hardware.  Segmentation is the process whereby
                             an operation is divided into a discrete number
                             of steps; segmented hardware allows these
                             discrete parts to be "pipelined" through it.
                             Vector chaining in CRAY Y-MP systems allows a
                             vector register reserved for results to become
                             the operand register of a succeeding
                             instruction.  These hardware parallel processing
                             features, combined with instruction parallelism,
                             allow a significant number of operations to be
                             done in parallel.

                           * I/O subsystems or foreground processors -
                             CRAY Y-MP/8 computer systems can have up to two
                             I/O subsystems, and CRAY-2 computer systems have
                             foreground processors.  These are logically
                             separate processors that perform input and
                             output functions for the operating system in use
                             on the computer system.  These operations occur
                             in parallel with a job or process running in the
                             main processor.



     SG-3074 5.0               Cray Research, Inc.                          1


     Introduction                   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                 Cray Research Parallel Processing Capabilities


       Parallel execution at the hardware level


       * Parallel instruction execution


       * Vector registers and segmented vector functional units


       * I/O subsystems or foreground processors






       Parallel execution at the software level


       * Concurrent multiprogramming


       * Multiprogramming at the job level


       * Multiprogramming at process level


       * Multitasking
     ________________________________________________________________________

            Figure 1.  Cray Research parallel processing capabilities

                           In conjunction with the hardware-level
                           parallelism, there are also software elements that
                           introduce parallel processing, as follows:

                           * Concurrent multiprogramming - When a CRI system
                             has only one main processor, that processor
                             switches between jobs or processes in the
                             system.  This switching can make it appear that
                             many things are happening simultaneously, even
                             though only one program is being worked on at
                             any point in time.  This is desirable, because
                             the processor can work on one job while another
                             job is waiting for an I/O operation to complete.
                             Almost all operating systems can do this kind of
                             multiprogramming.

                           * Multiprogramming at the job level - With more
                             than one main processor available, as in any
                             multiprocessor CRI system, the processors are
                             working concurrently on as many different


     2                         Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                   Introduction
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             programs at one time as there are processors.

                           * Multiprogramming at the process level - Each
                             program that a user submits to the operating
                             system or types in at a terminal is a process to
                             the operating system.  In UNICOS, a user can
                             create separate processes simply by placing a
                             job in the background (affixing an ampersand to
                             the end of the command line), or by using the
                             pipe capability to "pipe" the output of one
                             process to another process as input.

                           It is also possible to have more than one
                           processor work on one program.  This is
                           generically called multitasking.  CRI has
                           exploited this multitasking capability in evolving
                           software products.




     Evolution of CRI
     parallel processing
     software
     1.1
                           The evolution of CRI parallel processing software
                           consists of three implementations:  macrotasking,
                           microtasking, and Autotasking.  This evolution is
                           summarized in Figure 2, page 4.  Macrotasking
                           required programmers to modify their codes to
                           exploit parallelism by doing extensive data
                           scoping and required the insertion of library
                           calls specific to CRI.  Microtasking expanded on
                           the strengths of macrotasking; less data scoping
                           was required and compiler directives replaced
                           library calls specific to CRI.  A big advantage of
                           microtasking is that it requires programmers to
                           change working programs much less than
                           macrotasking and works well in both batch and
                           dedicated environments.




















     SG-3074 5.0               Cray Research, Inc.                          3


     Introduction                   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                   Cray Research Multitasking Implementations



       * Macrotasking



       * Microtasking



       * Autotasking




       Autotasking, microtasking, and macrotasking can coexist in the same
       program, but not in the same subprogram unit.




       A previously autotasked routine can be processed by FPP to detect
       additional parallelism.  FPP will analyze only loop nests that do
       not contain CMIC$ directives.




       A previously microtasked program can be processed by FPP to detect
       additional parallelism in the nonmicrotasked routines.
     ________________________________________________________________________

              Figure 2.  Cray Research multitasking implementations

                           The most recent implementation, Autotasking,
                           combines the best aspects of microtasking with two
                           fundamental enhancements:

                           * Autotasking can be fully automatic; that is, it
                             does not require programmer intervention,
                             although programmers are free to interact with
                             the Autotasking system to enhance performance.

                           * Autotasking can exploit parallelism at the DO
                             loop level without extending to subroutine
                             boundaries, as microtasking is written to do.









     4                         Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                   Introduction
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                             The CF77 compiling system, composed of products
                             FPP, FMP, and the CFT77 compiler, provide this
                             functionality.  The cf77, fpp, and fmp UNICOS
                             commands are the interface to the CF77 compiling
                             system.  The cf77 command serves as an
                             "overcompiler," which invokes the appropriate
                             phases of Autotasking (that is FPP, FMP, the
                             CFT77 compiler, and the loader) to build an
                             executable program based on a set of defaults
                             and options.  The UNICOS fpp command invokes the
                             dependence analysis phase of Autotasking.

                             FMP replaces PREMULT, the previous microtasking
                             preprocessor, providing a bridge to the enhanced
                             libraries and supporting both Autotasking and
                             microtasking.  Programs that have been
                             microtasked do not have to be changed to use
                             Autotasking support.




     Using microtasking
     and macrotasking with
     Autotasking
     1.2
                           On all CRI systems, Autotasking, microtasking, and
                           macrotasking can coexist in the same program.
                           Entire subprogram units can be macrotasked,
                           microtasked, or Autotasked in a given program
                           without problems.  You can also combine
                           Autotasking and microtasking in the same
                           subprogram unit, with the following restrictions:

                           * Autotasking CMIC directives inhibit FPP action
                             on any loop nest in which they appear.  Also,
                             FPP does not try to optimize anything inside a
                             parallel region (that is, anything bounded by a
                             CMIC$ PARALLEL/CMIC$ END PARALLEL pair.)

                           * For microtasking, the following CMIC$ directives
                             inhibit FPP action for the entire routine:
                             DOGLOBAL, MICRO, PROCESS, ALSOPROCESS, and
                             ENDPROCESS.

                           * Microtasking directives other than those
                             previously listed (including CONTINUE and
                             GUARD/ENDGUARD) are handled the same as for
                             Autotasking CMIC directives.

                           * FPP does not change DOGLOBAL directives to DOALL
                             directives.

                           * Codes can be autotasked from within macrotasked
                             areas.



     SG-3074 5.0               Cray Research, Inc.                          5


     Introduction                   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Goals of Autotasking
     1.3
                           Autotasking can be generally described as the
                           automatic distribution of loop iterations to
                           multiple processors.  To do this, Autotasking
                           takes a Fortran program as input and transforms it
                           so it can run on multiple processors concurrently,
                           and (assuming multiple processors are available),
                           it makes the program run faster (wall-clock time)
                           than it does without Autotasking.  Autotasking
                           builds on the experience gained from prior CRI
                           parallel processing products, macrotasking and
                           microtasking, and makes parallel processing easier
                           for CRI system users.

                           More specifically, the goals of Autotasking
                           include the following:

                           * Detect parallelism automatically in a program
                             and exploit the parallelism without user
                             intervention.

                           * Define a syntax by which parallelism is
                             expressed, allowing users to guide the
                             Autotasking system in code segments in which the
                             user can provide additional information to the
                             Autotasking system, or where the Autotasking
                             system cannot detect parallelism automatically.

                           * Define the scope of variables when transforming
                             a program to exploit parallelism.

                           * Provide a simple command line interface to
                             Autotasking.




     When to use
     Autotasking
     1.4
                           Autotasking, like microtasking and macrotasking,
                           can reduce the wall-clock run time of CPU-
                           intensive programs.  If a program is I/O-bound,
                           using Autotasking will probably make it more I/O-
                           bound.  Long-running programs, programs that use
                           so much memory that little else can run in the
                           machine, or programs that have hard deadlines for
                           completion are particularly good candidates for
                           Autotasking.  However, most running CRI computer
                           systems have available idle time that autotasked
                           programs can employ effectively.






     6                         Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                   Introduction
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Generally, Autotasking works best on programs in
                           which most of the work is in nested DO loops that
                           do not contain CALL statements.  To run in
                           parallel, the iterations of a DO loop must use
                           independent elements of arrays that are being
                           changed.  This property is often hard to see for
                           complex loops.  Also, Autotasking is not limited
                           to running code in parallel on the outer loop; it
                           can make transformations that arrange code so that
                           it will run in parallel on loops other than the
                           outermost loop.  Programs that are heavily
                           vectorized tend to have high potential for
                           parallelism.




     Autotasking's effect
     on vectorization
     1.5
                           High performance for many codes is achieved when
                           the compiler detects code sequences that can be
                           vectorized, and it uses the vector registers to
                           run those sequences.  Generally, vectorized code
                           for a loop runs about 10 times faster than scalar
                           code for the same loop.  Because it costs a
                           program little to use vector registers, it is
                           almost always better to run in vector mode.

                           In determining how to optimize a program,
                           Autotasking favors vectorization over parallel
                           processing.  If dependence analysis allows it,
                           Autotasking vectorizes the innermost loop of a
                           nest of DO loops and runs the outermost loop on
                           multiple processors.  In some cases, Autotasking
                           will process a single vectorizable DO loop in
                           chunks, as if it were a nested pair of loops, with
                           a vector inner loop and a parallel outer loop.

                           Loops do not need to be vectorized for Autotasking
                           to detect that a nest of loops can be run in
                           parallel.  Some codes may have scalar inner loops
                           that can be run in parallel on an outer loop or a
                           set of adjacent scalar loops that are independent
                           and can be executed in parallel.

                           For more information about vectorization that can
                           be acheived with the compiling system, see the
                           CF77 Compiling System, Volume 3:  Vectorization
                           Guide, publication SG-3073.









     SG-3074 5.0               Cray Research, Inc.                          7


     Introduction                   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~





     Speedup expected from
     Autotasking
     1.6
                           Because there are so many contributing factors,
                           the degree of speedup is difficult to predict.
                           The first factor to consider is the program
                           itself.  How much parallelism does it contain?  If
                           you have previously microtasked or macrotasked the
                           program, you probably have a good idea how much
                           parallelism exists.  If the program has never used
                           parallel processing, some of the guidelines in
                           "When to use Autotasking," page 6, may give you an
                           idea of the parallelism you can expect.  From a
                           known or expected amount of parallelism, you can
                           calculate a speedup based on Amdahl's Law.  For
                           example, to get a speedup of 3 on an eight-
                           processor system, the code must be 80300003000000202034arallel;
                           to get a speedup of 7 on an eight-processor
                           machine, the code must be 98300003000000202034arallel.  Amdahl's
                           Law and its effect on autotasked programs are
                           explained in more detail in "Autotasking
                           Performance," page 201.  (The UNICOS amlaw(1)
                           command also provides a summary of Amdahl's Law.)


                           Every code contains some amount of parallelism.
                           Autotasking detects some types of parallelism, but
                           not others.  Parallelism found in a code sequence
                           may not be of sufficient granularity to make the
                           program run faster; therefore, Autotasking may
                           choose to ignore the parallelism in that code
                           sequence.  Vectorization already exploits
                           parallelism in most codes.  Because of these
                           various factors, Autotasking may detect and
                           exploit only part or none of the parallelism that
                           exists in the code.

                           Briefly, Autotasking a large existing application
                           rarely results in a speedup linear with the number
                           of processors in the machine.  A fair amount of
                           user assistance will probably be required to
                           achieve that level of performance.  Smaller codes
                           or those that spend almost all of their execution
                           time in small kernels (matrix multiplication,
                           basic linear algebra, and so on) have a better
                           chance of achieving near-linear speedups without
                           user assistance.

                           Vectorization almost always results in codes
                           running faster.  Autotasking generally results in
                           speedups, but it has a higher risk for slowing
                           down some codes.  As with any optimization, you
                           should apply Autotasking carefully to any code.


     8                         Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide                   Introduction
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




























































     SG-3074 5.0               Cray Research, Inc.                          9



                                                     CF77 User Interface  [2]
     ########################################################################







                           To understand the user interface to CF77, you must
                           first understand how the parts of CF77 fit
                           together.  This section explains the CF77
                           compiling system, then describes the user
                           interface, and environment variables that you can
                           use to customize your environment.




     CF77 compiling system
     2.1
                           Autotasking, which is a part of the CF77 compiling
                           system, is made up of three phases:  the
                           dependence analysis phase, FPP; the translation
                           phase, FMP; and the code generation phase, the
                           CFT77 compiler.  Figure 3, page 12, shows how the
                           phases fit together when invoked using valid
                           options of the cf77 command.  Although the phases
                           can be invoked independently, knowing how the
                           phases fit together may help you do a better job
                           of Autotasking your program.




























     SG-3074 5.0               Cray Research, Inc.                         11


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



                         Figure 3.  CF77 compiling system



     FPP
     2.1.1
                           The dependence analysis phase, FPP, parses the
                           original Fortran source program, looks for
                           parallelism within program units, and produces a
                           transformed Fortran source file as output.  Some
                           of the transformations FPP performs are summarized
                           in Figure 4, page 14, and are as follows:

                           * Adds Autotasking directives with private and
                             shared variable lists where parallel execution
                             is possible

                           * Adds CDIR@ IVDEP directives before loops that
                             can be vectorized

                           * Expands external procedures inline if requested
                             and if possible

                           * Restructures loop nests

                           * Replaces certain code patterns with calls to
                             highly optimized, multiprocessed library
                             routines

                           * Generates a run-time threshold test for
                             autotasked loops when the amount of work cannot
                             be computed at compile time

                           * Generates conditional vector or scalar code

                           * Converts IF loops into DO loops, where possible,
                             to enhance vectorization and parallelization

                           * Rewrites over-complicated subscript expressions
                             as linear functions of the loop index

                           * Splits partially vectorizable loops into
                             separate fully vectorizable and nonvectorizable
                             loops

                           * Reorders statements to remove data dependencies









     12                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           FPP recognizes when iterations of DO loops operate
                           on independent elements of arrays and it inserts
                           directives to exploit this independence.  Many
                           codes have parallelism of this type.  FPP also
                           recognizes adjacent blocks of code that can be
                           executed concurrently.  Some parallelism obvious
                           to users may be difficult for FPP to recognize.
                           For example, a CALL statement inside a DO loop
                           prevents FPP from transforming the code to run in
                           parallel, because the effects of the subroutine
                           being called are unknown by FPP.


















































     SG-3074 5.0               Cray Research, Inc.                         13


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                                       FPP

       The dependence analysis phase makes the following transformations:


       * Adds Autotasking directives where parallel execution is possible

       * Adds CDIR@ IVDEP directives before loops that can be vectorized

       * Expands external procedures inline if requested and if possible

       * Restructures loop nests

       * Replaces certain code patterns with calls to highly optimized,
         multiprocessed library routines

       * Generates a run-time threshold test for autotasked loops when the
         amount of work cannot be computed at compile time

       * Generates conditional vector or scalar code

       * Converts IF loops into DO loops, where possible

       * Rewrites over-complicated subscript expressions as linear
         functions of the loop index

       * Splits partially vectorizable loops into separate fully
         vectorizable and nonvectorizable loops

       * Reorders statements to remove data dependencies
     ________________________________________________________________________

                  Figure 4.  FPP - the dependence analysis phase

                           The input to the FPP phase is Fortran source code
                           and the output is (possibly restructured) Fortran
                           source code with Autotasking and compiler
                           directives added to express the parallelism.  This
                           output may be compiled directly by CFT77.  In this
                           case, only the vectorization enhancements are
                           obtained without multitasking enhancements.

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           The CDIR@ IVDEP directives inserted by FPP are
                           reserved for use by only FPP.  If you want the
                           functionality of the CDIR@ IVDEP directive, use
                           the CDIR$ IVDEP form of the directive.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~







     14                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     FMP
     2.1.2
                           The primary function of the translation phase,
                           FMP, is to transform a Fortran source file for
                           multitasking.  Autotasking directives are
                           translated into Autotasking intrinsic functions or
                           library calls, master and slave code, conditional
                           Autotasking threshold tests, and Autotasking
                           initialization code.

                           The output of the translator is Fortran source
                           code with calls to machine-dependent library
                           routines and compiler intrinsic functions embedded
                           in the source code to control parallel execution.
                           FMP maintains the proper scope of variables as it
                           makes these changes.  Special intrinsic statements
                           not used by normal Fortran programs are inserted
                           by the translator.  Directives that FMP recognizes
                           are described in "Concepts and Directives," page
                           51.



     CFT77
     2.1.3
                           The code generation phase is the CFT77 compiler,
                           which takes the output of the translator and
                           produces executable machine code.  The Autotasking
                           intrinsic functions are recognized by CFT77 and
                           cause inline code to be generated.

                           Each of these phases of the Autotasking system
                           contributes to the overall compilation time for
                           Autotasking.  Generally, compilation time is a
                           function of the number of lines of source code
                           processed in each phase.  The transformations
                           produced by FPP usually result in an increase in
                           the number of lines of source code; however, FMP
                           may substantially increase the size of the source
                           file.  This may result in much longer processing
                           time by CFT77 compared to compiling the original
                           Fortran source file.  See "Master and slave
                           tasks," page xx.x 0, for reasons for the increase.

                           Successful completion of these first three phases
                           plus SEGLDR results in the creation of an absolute
                           binary file (a.out) that reflects the contents of
                           the source files and any referenced library
                           routines.  Figure 5, page 17, summarizes the role
                           of FMP and the CFT77 compiler in the CF77
                           compiling system.  See Figure 6, page 18, for an
                           overview of the entire CF77 compiling system
                           process.




     SG-3074 5.0               Cray Research, Inc.                         15


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Compiling system
     advantages
     2.1.4
                           The effort involved in these three phases may seem
                           redundant.  After all, each phase parses Fortran,
                           analyzes the scope of variables, and so on.  It
                           seems simpler to perform these tasks just once,
                           but CRI chose to implement Autotasking this way
                           for the following reasons.  First, separate phases
                           allow the most flexibility to change one phase
                           without affecting other phases.  Second, keeping
                           separate phases had the smallest impact on the
                           existing functions of the CFT77 compiler, which
                           has many functions besides parallel processing.

                           These three phases optionally create source output
                           files and listings that let you see what
                           Autotasking is doing to your program, and they let
                           you feed your insight back into Autotasking in the
                           form of directives.

                           You can look at the generated source output to see
                           how it differs from your original program, and why
                           the dependence analyzer may not have found
                           parallelism you think exists.  You may also want
                           to look at the FPP diagnostic listing (see the
                           following subsections for details on getting a
                           listing).  You can add directives of your own at
                           this point or continue with what the dependence
                           analyzer found.






























     16                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                                  FMP and CFT77




       FMP - The translation phase transforms a Fortran source file for
       multitasking.  Autotasking directives are translated into
       Autotasking intrinsic functions or library calls, master and slave
       code, conditional Autotasking threshold tests, and Autotasking
       initialization code.





       CFT77 - The code generation phase is the CFT77 compiler, which
       takes the output of the translator and produces executable machine
       code.  The Autotasking intrinsic functions are recognized by CFT77
       and cause inline code to be generated.
     ________________________________________________________________________

        Figure 5.  Summary of FMP and CFT77 roles in CF77 compiling system



































     SG-3074 5.0               Cray Research, Inc.                         17


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



                   Figure 6.  Phases of Autotasking using cf77

                           You can also invoke the dependence analysis phase
                           and the translation phase separately.  The output
                           of FPP provides a very detailed view of the way
                           Autotasking affects your code.  However, the
                           output of FMP has calls to library routines,
                           intrinsic statements that refer to machine-
                           specific hardware registers, and other code that
                           generally make the output difficult to read.

                           The following subsection explains the CF77 user
                           interface (the cf77 command) in more detail.  See
                           "Invoking FPP and FMP Directly," page 35 for
                           details of the commands for FPP and FMP.  The
                           UNICOS cf77 man page is included in "UNICOS
                           Command Pages," page xx.x 0.

                           You may also interact with the compiling system at
                           a lower level by using compiler directives in your
                           source code.  See "Concepts and Directives," page
                           51, for more information on the use of compiler
                           directives.




     UNICOS user interface
     2.2
                           Under UNICOS, you have several choices of how to
                           interact with the CF77 compiling system.  As
                           explained previously, you can use the cf77
                           command, which functions similarly to the cc
                           command found on UNICOS systems.  cf77 provides a
                           one-line command to analyze, translate, compile,
                           and load a Fortran program, letting you ignore the
                           details of invoking the compiler and loader.
                           Figure 6, page 18, provides an overview of the
                           CF77 system.

                           The cf77 command also lets you pass options to all
                           phases of the compiling system.  You can also
                           communicate directly with each of the phases of
                           the compiling system, as follows:

                           * You can direct the actions of the dependence
                             analyzer, FPP, by using the fpp command

                           * You can direct the actions of the translation
                             phase, FMP, by using the fmp command



     18                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * You can direct the actions of the CFT77
                             compiler, the code generation phase, by using
                             the cft77 command

                           For a complete list of fpp and fmp options, see
                           "Invoking FPP and FMP Directly," page 35.  For
                           more information about using the cft77 command,
                           see CF77 Compiling System, Volume 1:  Fortran
                           Reference Manual, publication SR-3071.

                           The following subsections describe the use of the
                           cf77 command and some of the most commonly used
                           cf77 options.  A complete description of the  cf77
                           command is provided by the cf77 man page located
                           in "UNICOS Command Pages," page xx.x 0.



     Using the cf77
     command
     2.2.1
                           The cf77 command serves as an "overcompiler,"
                           which invokes the appropriate phases of
                           Autotasking (that is FPP, FMP, the CFT77 compiler,
                           and the loader) to build an executable program
                           based on the defaults and options.  The cf77
                           command is summarized in Figure 7, page 21, and
                           discussed in the following subsections.

                           To use cf77 to run a nonautotasked program, which
                           creates an executable file named a.out, enter the
                           following:

                                cf77 abc.f


                           To autotask the same program, add the -Zp option
                           to the cf77 command line:

                                cf77 -Zp abc.f


                           This shows the simplest way to invoke Autotasking
                           on a program; it invokes the three compiling
                           system phases (FPP, FMP, and CFT77), loads object
                           files, and produces an executable binary file
                           (a.out).

                           A simplified "expanded" version of a cf77 command
                           is as follows.  The "expanded" version does not
                           include all options for the commands, but it shows
                           the order in which things are done.

                                cf77 -Zp file.f

                           Expanded version:




     SG-3074 5.0               Cray Research, Inc.                         19


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                fpp file.f > file.m
                                fmp file.m > file.j
                                cft77  -a stack  file.j
                                rm file.m
                                rm file.j
                                segldr file.j.o
                                rm file.j.o






















































     20                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                              cf77 Command Summary



       cf77 - A one-line command to analyze, translate, compile and load a
       Fortran program.


       FPP, FMP, CFT77, and SEGLDR can also be invoked with separate
       commands.


       A simplified "expanded" version of a cf77 command is as follows:

       cf77 -Zp file.f


       Expanded version:

       fpp file.f > file.m
       fmp file.m > file.j
       cft77  -a stack  file.j
       rm file.m
       rm file.j
       segldr file.j.o
       rm file.j.o

     ________________________________________________________________________

                         Figure 7.  cf77 command summary

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           Autotasked codes must run in STACK allocation
                           mode.  If you are trying to autotask a code that
                           has been running in STATIC allocation mode, first
                           get the program running and debugged in STACK mode
                           before trying Autotasking.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

















     SG-3074 5.0               Cray Research, Inc.                         21


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The cf77 command has options that let you exercise
                           more control over the phases of the Autotasking
                           system.  With the cf77 command-line options, you
                           have a high degree of flexibility; for example,
                           you can do the following:

                           * Bypass different phases of the compiling system

                           * Create only an intermediate source file

                           * Create only an object file

                           * Obtain command information

                           * Pass information directly to the CFT77 compiler,
                             FPP, FMP, and SEGLDR

                           Figure 8, page 23, summarizes these control
                           options; the following subsections explain them
                           (and other, miscellaneous options) and how to use
                           them.


     Compiling system
     control options (-Z)
     2.2.1.1
                           The cf77 command provides the following CF77
                           compiling system control options with the -Z
                           options, as follows:

                           Option   Description
                           ------   -----------

                           -Zp      Selects parallel dependency analysis and
                                    translation.  The full compiling system
                                    (all three phases) is invoked.
                                    Intermediate source files are deleted by
                                    the cf77 command.

                           -Zu      Invokes only FMP, the CFT77 compiler, and
                                    SEGLDR.  Intermediate source files are
                                    deleted by the cf77 command.  For
                                    example, if you had a program that
                                    contained microtasking directives, you
                                    might want to skip the dependency
                                    analysis phase (FPP) of the compiling
                                    system by using cf77 -Zu.













     22                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                          cf77 Command Control Options


       With the cf77 command-line options, you have a high degree of
       flexibility; for example, you can:


       * Bypass different phases of the compiling system



       * Create only an intermediate source file



       * Create only an object file



       * Obtain command information



       * Pass information directly to CFT77, FPP, FMP, and SEGLDR
     ________________________________________________________________________

                  Figure 8.  cf77 command control option summary

                                    Option   Description
                                    ------   -----------

                                    -Zv      Selects only vector enhancements
                                             only and invokes FPP, the CFT77
                                             compiler, and SEGLDR.  FPP
                                             performs only vectorization
                                             enhancements for any input file
                                             of the form file.f; no
                                             Autotasking directives are
                                             inserted.  (Additional input
                                             files of the form file.o are
                                             passed to the loader.  If they
                                             contain Autotasking directives,
                                             they are recognized by the
                                             compiling system.)  Intermediate
                                             source files are deleted by the
                                             cf77 command.

                                    -Zc      Invokes the compiler and loader;
                                             this is the default control
                                             option.

                                    -Zm      Selects microtasking.  Invokes
                                             PREMULT as the translation phase
                                             (rather than FMP) and the
                                             compiler (CFT or CFT2) only.
                                             Intermediate source files are


     SG-3074 5.0               Cray Research, Inc.                         23


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                             deleted by the cf77 command.

                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Note

                                             The premult(1) command, which
                                             invokes PREMULT, will no longer
                                             be supported in the CF77 6.0
                                             release.  The fmp(1) command
                                             provides equivalent
                                             functionality to premult; you
                                             are encouraged to switch from
                                             premult to fmp at your earliest
                                             convenience.

                           In addition, the -Zm option will be supported but
                           will invoke FMP rather than PREMULT.  Because the
                           -Zm option will no longer be supported in a future
                           release, you are encouraged to switch to the -Zu
                           option at your earliest convenience.

                           The -ZP, -ZU, and -ZV options provide the same
                           functionality as -Zp, -Zu, and -Zv, respectively.
                           However, the uppercase versions of these options
                           force the cf77 command to leave all intermediate
                           source files intact.
                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



     Intermediate source
     file options
     2.2.1.2
                           If you want to create intermediate source files
                           and exit the compiling system without creating an
                           object file or an executable binary file, use the
                           following options:

                           Option   Description
                           ------   -----------

                           -M       Runs only FPP and produces an
                                    intermediate source file.  The
                                    intermediate source file is named using
                                    the original source file name suffixed
                                    with .m.  Option
                                    -Zv or -Zp must also be selected.

                           -J       Runs only FPP and FMP and produces an
                                    intermediate source file.  The
                                    intermediate source file is named using
                                    the original source file name suffixed
                                    with .j.  Option -Zu or -Zp must also be
                                    selected; otherwise, the compilation
                                    aborts and returns an error message.




     24                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The output from both of these options (file.m and
                           file.j) can be used as input to the cf77 command.
                           When the file is used as input, the compiling
                           system recognizes the file type and invokes the
                           correct processing phase.

                           Examples:

                                cf77  -Zp  file.m
                                cf77  -Zp  xxx.f  yyy.m  zzz.j  www.o



     Object file creation
     2.2.1.3
                           The -c option is provided to create an object file
                           and then exit the compiling system before invoking
                           SEGLDR.

                           Option   Description
                           ------   -----------

                           -c       Forces object files to be produced.  The
                                    object file is named using the original
                                    source file name suffixed with .o.  If
                                    you specify -c, SEGLDR is not invoked.


     cf77 command
     information
     2.2.1.4
                           The cf77 command executes various commands.  You
                           can obtain a log of the commands issued by cf77 by
                           using one of the following verbose mode options:

                           Option   Description
                           ------   -----------

                           -v       Specifies verbose mode.  Writes output to
                                    stderr (normally your screen) indicating
                                    each phase of the compilation as it
                                    occurs, as well as all options and
                                    arguments being passed to each phase.

                           -T       Disables the entire compiling system but
                                    displays all options currently in effect
                                    for the individual commands corresponding
                                    to each system phase.  This information
                                    is the same as that given by option -v,
                                    but with no processing.  This output is
                                    written to stderr (normally your screen).
                                    In the following example, your input is
                                    shown in typewriter bold.



     SG-3074 5.0               Cray Research, Inc.                         25


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ------------------------------------------------------------------------
     f77  -Zp  -T  file.f
     /bin/fpp: fpp file.f > /tmp/jtmp.000526a/ct2BAAa60788
     /bin/fmp: fmp -c /tmp/jtmp.000526a/ct4DAAa60788.s \
	/tmp/jtmp.000526a/ct2BAAa60788
     /tmp/jtmp.000526a/ct3CAAa60788.f
     /bin/cft77: /bin/cft77 -b file.o -a stack \
	/tmp/jtmp.000526a/ct3CAAa60788.f
     /bin/segldr: segldr file.o
     
     ------------------------------------------------------------------------


     Using the -W option
     2.2.1.5
                           The following -W options let you pass arguments to
                           individual components of the CF77 compiling
                           system.

                           Option       Description
                           ------       -----------

                           -Wf"optstring"
                                        Passes options contained in optstring
                                        to the CFT77 compiler

                           -Wd"optstring"
                                        Passes options contained in optstring
                                        to FPP

                           -Wu"optstring"
                                        Passes options contained in optstring
                                        to FMP

                           -Wl"optstring"
                                        Passes options contained in optstring
                                        to SEGLDR

                           You must separate multiple options with spaces.

                           For example, to obtain a load map from SEGLDR, use
                           the following command:

                                cf77 -Zp -Wl"-M,f" prog1.f


                           The following example passes options to FPP using
                           the cf77 -Wd mechanism and illustrates invoking
                           FPP analysis of inner loops for Autotasking:

                                cf77 -Zp -Wd"-ei" prog2.f


                           The following example disables CFT77 double
                           precision and defines integers to consist of 64
                           bits.

                                cf77 -Zp -Wf"-dp -i64" prog3.f



     26                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           You can specify more than one -W option on the
                           same cf77 command line.

                           The following example shows the command line if
                           you want to use atexpert:

                                cf77 -ZP -Wd"-ei -d0" -Wu"-p" prog4.f


                           The previous command line specifies use of the
                           entire compiling system, enables FPP inner-loop
                           analysis and disables threshold test generation,
                           and generates FMP output suitable for use with
                           atexpert.


     Additional cf77
     options
     2.2.1.6
                           The following list shows some additional cf77
                           options.  Although the functionality of each of
                           these options can be duplicated with one or more
                           -W options, it is not recommended.

                           Option    Description
                           ------    -----------

                           -l name   Identifies library files.  If a library
                                     name lib begins with . or /, it is
                                     assumed to be a path name, and SEGLDR
                                     uses it as is.  Otherwise, SEGLDR checks
                                     first for file /lib/libname.a, then for
                                     file /usr/lib/libname.a, and uses the
                                     first one found.  See the -L option.

                           -L libdir Passes the directory name libdir to
                                     SEGLDR as the directory in which to find
                                     default libraries during the load phase.

                           -F        Enables the Flowtrace option for CFT77.

                           -g        Invokes CFT77 debugging options -ez and
                                     -o off, but inhibits Autotasking.  You
                                     can generate debug symbols when doing
                                     Autotasking by specifying cf77 options
                                     -Wf"-ez" or
                                     -Wf"-ez -ooff".  You can enable
                                     Autotasking debugging by using the -G
                                     cf77 option.

                           -G        Invokes FPP, FMP, and the CFT77
                                     debugging option -ez.  When this option
                                     is used with the
                                     -Zp option, Autotasking debugging is
                                     enabled.




     SG-3074 5.0               Cray Research, Inc.                         27


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           For a complete description of the options for
                           cf77, see the cf77 manual page in "UNICOS Command
                           Pages," page xx.x 0.




     UNICOS environment
     variables
     2.3
                           Environment variables are predefined shell
                           variables that determine some of the
                           characteristics of your shell.  These environment
                           variables are taken from the execution
                           environment, and they can affect your parallel
                           processing environment.  Environment variables
                           affecting parallel processing are summarized in
                           Figure 9, page 29, and discussed in the following
                           subsections.






































     28                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                              Environment Variables

       * All systems (general compiling system)

         - NCPUS

         - CAL

         - CFT77

         - FPP

         - FMP

         - SEGLDR

       * CX/CEA systems (multitasking)

         - MP_DEDICATED

         - MP_MAXCPU

         - MP_DBACTIVE

         - MP_DBRELEAS

         - MP_HOLDTIME

         - MP_SAMPLE

         - MP_PRIORITY

         - MP_SLVPRI

         - MP_STACKSZW

         - MP_STACKINW

         - MP_SLVSSZ

         - MP_SLVSIN

       * CRAY-2 systems (multitasking)

         - MICRO_NICE

         - MICRO_TIMEOUT
     ________________________________________________________________________

             Figure 9.  Summary of multitasking environment variables







     SG-3074 5.0               Cray Research, Inc.                         29


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Environment variables
     for all systems
     2.3.1
                           The first variable available on all systems is
                           NCPUS; it specifies the number of tasks available
                           to an autotasked program.  When debugging
                           autotasked code, set NCPUS=1, which allows only
                           the master task to execute (that is, no slave
                           processes are created).  See "Debugging Autotasked
                           Programs," page xx.x 0, for more specific
                           information.

                           Generally, the default value for NCPUS is the
                           number of physical processors in the system.  If
                           you specify NCPUS to be greater than the number of
                           physical processors available, unnecessary
                           overhead will be incurred.

                           Other environment variables that you can use to
                           specify particular versions of software are as
                           follows:  CAL, the file name of as(1), the CAL
                           assembler (default is /bin/as); CFT77, the file
                           name of the CFT77 compiler (default is
                           /bin/cft77); FPP, the file name of the fpp(1)
                           dependence analyzer (default is /bin/fpp); FMP,
                           the file name of the fmp(1) translator (default is
                           /bin/fmp); PREMULT, the file name of premult(1),
                           the microtasking preprocessor (default is
                           /bin/premult); and SEGLDR, the file name of the
                           loader (default is /bin/segldr).

                           Setting any of these variables lets you use
                           versions of the software that are not the default;
                           for example, you could test a new release level of
                           a compiler before it becomes the default.



     Additional
     environment variables
     for CX/CEA systems
     2.3.2
                           Additional environment variables available on
                           CX/CEA systems are described in the following
                           list.  Many of these environment variables control
                           TSKTUNE tuning keywords, which let you tune the
                           system for parallel processing without rebuilding
                           libraries or other system software.  See section 5
                           in the CRAY Y-MP, CRAY X-MP EA, and CRAY X-MP
                           Multitasking Programmer's Manual, publication
                           SR-0222, for more information about TSKTUNE.







     30                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Variable   Description
                           --------   -----------

                           MP_DEDICATED
                                      If set to 1, indicates you are running
                                      a multiprocessed application in a
                                      dedicated machine environment.  Slave
                                      processors wait in user space instead
                                      of returning back to the operating
                                      system.  If MP_DEDICATED is set to 0 or
                                      is not set at all, slave processesors
                                      return to the operating system after
                                      waiting in user space for 50,000 clock
                                      periods.
                                      When MP_DEDICATED is set to anything
                                      other than 1 or 0, the behavior is
                                      undefined.  Setting MP_DEDICATED in a
                                      nondedicated machine environment
                                      degrades personal and system
                                      throughput.

                           MP_MAXCPU  Maximum number of CPUs allowed for
                                      macrotasking; the default is 16.

                           MP_DBACTIVE
                                      Number of additional user tasks that
                                      can be readied for execution before an
                                      additional logical CPU is acquired;
                                      this is called the activation deadband
                                      value.  The value of MP_DBACTIVE can
                                      range from 0 to the largest integer
                                      value (the number of logical CPUs is
                                      equal to the number of user tasks
                                      limited by MAXCPUS).  The initial value
                                      is 0.

                           MP_DBRELEAS
                                      Number of logical CPUs retained by the
                                      job if there are more CPUs than tasks;
                                      this is called the release deadband
                                      value.  Any CPUs in excess of this
                                      number are released to the system.  The
                                      initial value is set to 1 less than the
                                      number of physical CPUs available on
                                      the system or to 1, whichever is
                                      greater.  Setting MP_DBRELEAS to less
                                      than this value may cause an excessive
                                      number of CPUs to be deleted and
                                      acquired, and a correspondingly long
                                      list of CPUs in the log file.  The
                                      value of MP_DBRELEAS can range from 0
                                      (representing immediate return) to the
                                      value of MAXCPUS.

                           MP_HOLDTIME
                                      Number of clock periods (CPs) to hold a
                                      processor before giving up the CPU when
                                      no parallel work is available.  The
                                      default is 50,000 CPs.


     SG-3074 5.0               Cray Research, Inc.                         31


     CF77 User Interface            CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Variable   Description
                           ________   ___________
                           MP_SAMPLE  Sample rate at which the ready mask is
                                      read when in the hold loop; the default
                                      is 150 CPs.  This means that a process
                                      checks for a ready task every 150 CPs
                                      while it is waiting for parallel work.

                           MP_PRIORITY
                                      Scheduling priority for macrotasks.
                                      Legal values are 0 to 63, 0 being the
                                      lowest priority.  The default is 31.
                                      When the library schedules queued
                                      tasks, higher priority tasks are
                                      scheduled first.

                           MP_SLVPRI  Scheduling priority for slave
                                      microtasks or autotasks.  Legal values
                                      are 0 to 63, 0 being the lowest
                                      priority.  The default is 0.

                           MP_STACKSZW
                                      Initial stack size for macrotasks.

                           MP_STACKINW
                                      Stack increment for macrotasks.

                           MP_SLVSSZ  Initial stack size for microtasking or
                                      Autotasking slaves.

                           MP_SLVSIN  Stack increment for microtasking or
                                      Autotasking slaves.



     Additional
     environment variables
     for CRAY-2 systems
     2.3.3
                           Additional environment variables exist for use on
                           CRAY-2 systems, as follows:

                           Variable   Description
                           --------   -----------

                           MICRO_NICE Integer value used by the nice system
                                      call when the library starts the
                                      Autotasking slaves.  The default value
                                      is 4.  If you want to run the slaves at
                                      normal priority, MICRO_NICE should be
                                      set to 0.








     32                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide            CF77 User Interface
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Variable   Description
                           ________   ___________
                           MICRO_TIMEOUT
                                      Integer value that is the number of
                                      milliseconds that Autotasking slave
                                      tasks wait, looking for work, before
                                      they give up the CPU.  The default
                                      value is 4 ms.  If you are making
                                      dedicated runs, set MICRO_TIMEOUT to a
                                      large value (such as 10,000) to ensure
                                      that the CPUs are always connected to
                                      the job.















































     SG-3074 5.0               Cray Research, Inc.                         33



                                           Invoking FPP and FMP Directly  [3]
     ########################################################################







                           FPP, the dependence analysis phase of the CF77
                           compiling system, parses the original Fortran
                           source program, looks for parallelism within
                           program units, and produces a transformed Fortran
                           source file as output.  The primary function of
                           FMP, the translation phase, is to transform a
                           Fortran source file for multitasking.

                           Both of these phases can be invoked by using the
                           cf77 command, as discussed in section 2.  Both
                           phases can also be invoked directly.  This section
                           describes the commands to invoke FPP and FMP
                           directly.




     UNICOS fpp command
     3.1
                           The UNICOS fpp command invokes the dependence
                           analysis phase of Autotasking.  You can invoke FPP
                           either as a part of the CF77 compiling system, as
                           described in section 2, or separately, by using
                           the fpp command.

                           When fpp is executed as a separate command, it has
                           the following syntax:

     -----------------------------------------------------
     fpp [-C routine1,routine2,...] [-d optoff]
       [-D directive[:sub1,sub2,...]] [-e opton] [-F file]
       [-H directory] [-I routine1,routine2,...]
       [-l listingfile] [-M lines] [-N80] [-o outputfile]
       [-p liston] [-P pagelength] [-q listoff]
       [-Q tempspace] [-r formaton] [-n formatoff]
       [-S file1,file2,...] [-T threshold] [-V] file.f
     -----------------------------------------------------












     SG-3074 5.0               Cray Research, Inc.                         35


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Options to fpp are as follows:

                           Option     Description
                           ------     -----------

                           -C routine1,routine2,...
                                      Lists names of concurrently callable
                                      routines.

                           -d optoff
                           -e opton
                                      Enables (-e) or disables (-d)
                                      optimization option switches specified
                                      in optoff and opton.  The optimization
                                      switches are described in "fpp
                                      optimization switches," page 40, and
                                      also listed in Table 1, page 42.

                           -D directive[:sub1,sub2,...]
                                      Specifies a directive to be applied to
                                      certain routines, or to the whole input
                                      file if no routines are listed.

                           -F file    Specifies a file containing additional
                                      command-line options.  This option is
                                      useful when you have many command-line
                                      options for a program.  You cannot use
                                      tab characters in a -F file, neither
                                      can you use the -F option to specify
                                      the input file for fpp (it must appear
                                      on the command line).  A typical use
                                      for a -F file would be to specify -D
                                      options.  An example of a -F file
                                      follows the description of fpp options.

                           -H directory
                                      Specifies a directory in which to
                                      search for INCLUDE files.  This option
                                      can be repeated up to 10 times, but
                                      only one directory name is allowed per
                                      -H specification.  INCLUDE files are
                                      searched for first in the directory of
                                      the input source file, then in the
                                      directories named on -H options, in the
                                      order in which they were specified.

                           -I routine1,routine2,...
                                      Lists routines to be expanded inline.
                                      This option specifies only the names of
                                      routines to be expanded inline, and not
                                      their source location.  Source location
                                      can be specified by using the -S or -e8
                                      options, the SEARCH directive, or the
                                      default search method.  See "Where to
                                      find code for inline expansion, page
                                      180, for more information.  This option
                                      is different from the cf77 and cft77 -I
                                      options.
                                      You cannot specify file.f in your
                                      current working directory as an


     36                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                      argument to the -I option.

                           -l listingfile
                                      Directs the listing to file
                                      listingfile.  A listing is not produced
                                      unless this option is specified.

                           -M lines   Sets the maximum number of lines of
                                      code allowed for automatic inline
                                      expansion of any one routine; default
                                      is 50.

                           -N80       Specifies an 80-column Fortran input
                                      file rather than the default 72-column
                                      Fortran input file.

                           -o outputfile
                                      Directs the translated source to file
                                      outputfile rather than standard output.
                                      The output file is ready for processing
                                      by fmp(1) or cft77(1), the other
                                      components of the CF77 compiling
                                      system.

                           -p liston
                           -q listoff
                                      Enables (-p) or disables (-q) listing
                                      option switches specified in liston and
                                      listoff.  The listing switches are
                                      described in "fpp listing switches,"
                                      page 45, and listed in Table 2, page
                                      46.

                           -P pagelength
                                      Specifies the number of lines per page,
                                      for page-formatted listings.
                                      pagelength must be 9 or greater; the
                                      default is 66 lines.

                           -Q tempspace
                                      Specifies the size, in words, of space
                                      to be used for FPP-generated temporary
                                      arrays in any one program unit.  The
                                      default is 8191 words.  For more
                                      information, see "FPP Source Output,"
                                      page 191.

                           -r formaton
                           -n formatoff
                                      Enables (-r) or disables (-n)
                                      reformatting (TIDY) option switches
                                      specified in formaton and formatoff.
                                      The reformatting switches are described
                                      in "FPP TIDY Subprocessor," page xx.x
                                      0.








     SG-3074 5.0               Cray Research, Inc.                         37


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Option     Description
                           ______     ___________
                           -S file1,file2,...
                                      Specifies file names or complete path
                                      names of files in which routines to be
                                      expanded inline are located.  For
                                      example, any of the following
                                      specifications is acceptable:

                                        -S file.f
                                        -S /usr/fred/file.f
                                        -S /usr/fred/abc.f,xyz.f

                                      This option specifies only the source
                                      location of routines to be expanded
                                      inline.  To enable inlining, you must
                                      also specify the -I, -e6, or
                                      -e7 options, or use the AUTOEXPAND,
                                      EXPAND, or NEXPAND directives.

                           -T threshold
                                      Specifies the maximum Autotasking
                                      threshold value for comparison to the
                                      loop iteration count.  See "Threshold
                                      tests," page 149, for more information
                                      on default values and threshold
                                      testing.

                           -V         Displays current FPP version
                                      information to standard error (stderr)
                                      during execution.  If you specify the
                                      -V option on the fpp command line and
                                      do not specify an input file (for
                                      example, fpp -V), only version
                                      information is displayed.

                           By default, the translated Fortran source output
                           file is written to the standard output file
                           (usually the terminal); a listing file is not
                           produced.  If you invoke fpp without arguments, it
                           prints a short usage summary.



     fpp command examples
     3.1.1
                           This subsection contains example fpp command lines
                           and explanations for those command lines.

                           Example 1:

                           To run the Fortran source file crunch.f through
                           fpp, enter the following:

                           -------------------------------------------------
                           $ fpp crunch.f > crunch.m
                           -------------------------------------------------



     38                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The optimized output is sent to crunch.m.

                           Example 2:

                           To run crunch.f through fpp, with inline expansion
                           enabled, and to save the output in crunch.m, run
                           the output (crunch.m) through fmp, and save the
                           output in file crunch.j.  Then, compile the
                           translated code and enter the following:

                           -------------------------------------------------
                           $ fpp -e 78 crunch.f > crunch.m
                           $ fmp crunch.m crunch.j
                           $ cft77 -a stack crunch.j
                           -------------------------------------------------

                           The output of the last command is crunch.o, which
                           can then be loaded.

                           Example 2:

                           The following is an example of a typical -F file
                           (fppopts):

                                -Dnoinner:sub1
                                -Dnexpand(sub2):sub1#/usr/psr
                                -Ffile2.com           (Nested command file)
                                -D relation(n.gt.32):sub2
                                -Dswitch,tdyoff=p,indal=5,renumb=1000:100


                           To run the source file prog.f through fpp using
                           the options in file fppopts and producing an fpp
                           listing (prog.l), enter the following:

                           -------------------------------------------------
                           $ fpp -F fppopts -l prog.l prog.f > prog.m
                           -------------------------------------------------

                           The output is sent to file prog.m.

                           Example 4:

                           A source file references INCLUDE files contained
                           in three directories:  the directory of the source
                           file, /usr/joe/inc, and /usr/jane/misc.  The fpp
                           command line used to specify the INCLUDE files
                           that are not contained in the same directory as
                           the source file could be as follows:

     ------------------------------------------------------------------------
     $ fpp -H /usr/joe/inc -H /usr/jane/misc -o myprog.m myprog.f
     ------------------------------------------------------------------------

                           File myprog.f is the main input file; file
                           myprog.m will be the output file.





     SG-3074 5.0               Cray Research, Inc.                         39


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example 5:

                           The -S option can be used with either automatic or
                           explicit inlining.  File routines.f contains
                           subroutines that you want considered for automatic
                           inlining.  To enable automatic inlining, and to
                           specify the location of the files to be inlined,
                           enter the following:

                           -------------------------------------------------
                           $ fpp -e7 -S routines.f program.f > program.m
                           -------------------------------------------------

                           File program.f is the main input source file.
                           Specifying the
                           -e7 option enables automatic inlining, the -S
                           option tells FPP where to look for the files to be
                           inlined, and the output goes to file program.m.
                           If you have more than one file containing routines
                           to be inlined, they can be specified with the -S
                           option, separated by commas, as shown in the
                           following example command line:

     ------------------------------------------------------------------------
     $ fpp -S r1.f,r2.f -I solvx,solvy,solvj -o comp.m comp.f
     ------------------------------------------------------------------------

                           In this case, you invoked explicit inlining (-I)
                           for any calls to routines solvx, solvy, and solvj
                           that occur in input file comp.f.  The -S option
                           tells FPP to look for the source for solvx, solvy,
                           and solvj in files r1.f and r2.f.



     fpp optimization
     switches
     3.1.2
                           Switches, also called option-arguments, let you
                           control optimization of FPP.  These switches are
                           called optimization switches.

                           You can pass optimization switches in any of the
                           following ways:

                           * Using the -d (disable) and -e (enable) options
                             of fpp

                           * Using the -Wd"-d" and -Wd"-e" options of cf77

                           * Using the SWITCH directive

                           For more information about the use of the SWITCH
                           directive, see "Concepts and Directives," page 51.





     40                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Table 1, page 42, shows the optimization switches
                           that affect the transformation of the input
                           program.  For example, specifying fpp option -d el
                           means that EQUIVALENCE statements are not examined
                           for data dependency analysis, and IF loops are not
                           converted to DO loops.

                           Some switches duplicate or overlap the functions
                           of directives.  For example, the -d d switch is
                           equivalent to the NODEPCHK directive with file
                           scope (CFPP$ NODEPCHK F).

                           Switches that correspond to directives (a, c, d,
                           e, i, r, u, and v) may be toggled more than once
                           within a routine (using the SWITCH directive).
                           Switches that do not correspond to directives (b,
                           f, h, j, k, l, m, o, p, s, t, y, 0, 1, 4, 5, 6,
                           and 7) can have only one valid setting for any one
                           routine; if they are set more than once within a
                           routine, only the last setting is used.

                           The q, x, and 8 switches are valid only as
                           command-line option-arguments.






































     SG-3074 5.0               Cray Research, Inc.                         41


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        Table 1.  Optimization switches enabled and disabled by -e and -d

     ________________________________________________________________________

     Switch        Description                                       Default
     ________________________________________________________________________

       a           Allows associative transformations.  -d a	     ON
                   is equivalent to the NOASSOC directive
                   with file scope.

       b           Generates linear recursion library calls.	     OFF

       c           Autotasks loops; all loops (inner and	     ON
                   outer) that have enough work to justify
                   concurrent execution are analyzed for
                   Autotasking.  -d c is equivalent to the
                   NOCONCUR directive with file scope.

       d           Does not ignore potential data                    ON
                   dependencies.  -d d is equivalent to the
                   NODEPCHK directive with file scope.

       e           Examines EQUIVALENCE statements for data          ON
                   dependency.  -d e is equivalent to the
                   NOEQVCHK directive with file scope.

       f           Generates BTRNSFRM and ETRNSFRM markers           OFF
                   for debugging.

       h           Allows parallel case optimization.                ON
                   Ignored if the c switch (autotask) is off.

       i           Analyzes inner loops with variable                OFF
                   iteration counts at compile time to
                   determine whether they are candidates for
                   Autotasking.  By default, outer loops and
                   inner loops that obviously have enough
                   work are autotasked.  For inner loops with
                   high iteration counts and many statements,
                   enabling this option may improve
                   performance.  -e i is equivalent to the
                   INNER directive with file scope.  If the c
                   switch (autotask) is off, the i switch is
                   ignored.

       j           Translates nested loop idioms, such as            ON
                   matrix multiplication, matrix-vector
                   multiplication, and rank one update, to
                   library calls.

       k           Treats D in column 1 as a comment                 ON
                   character.  If this switch is off, a D in
                   column 1 is treated as a blank.  This
                   switch provides compatibility with a
                   debugging feature of some compilers.

       l           Transforms IF loops to DO loops.                  ON

       m           Generates alternative code for potential          ON
                   dependencies.  If this switch is off,
                   loops containing potential data
                   dependencies will not be optimized.
     ________________________________________________________________________



     42                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Table 1.  Optimization switches enabled and disabled by -e and -d
               (continued)


     ________________________________________________________________________
     Switch        Description                                       Default
     ________________________________________________________________________

       o           Specifies a minimum DO loop trip count of         OFF
                   1.  Provides compatibility with ANSI '66
                   Fortran compilers.

       p           Collapses loop nests into single loops            ON
                   when possible.

       q           Takes error exit if syntax or fatal errors        OFF
                   are found.  If this switch is on and fpp
                   detects a syntax or fatal error, it
                   returns an error code of 2.  If fpp was
                   invoked by cf77, cf77 would cease
                   processing at this point.

       r           Splits user subroutines and functions out         OFF
                   of loops nests where possible.  This
                   sometimes results in additional loops
                   being autotasked.  -e r is equivalent to
                   the SPLIT directive with file scope.

       s           Permits loop splitting to isolate                 ON
                   recursion, which permits partial
                   vectorization of loops.

       t           Specifies use of aggressive loop exchange         OFF
                   criteria.  Weights desirability of stride
                   one vectors and increased vector length
                   more heavily compared to retaining
                   original loop nest ordering.

       u           Generates final values for transformed            ON
                   scalars when appropriate.  -d u is
                   equivalent to the NOLSTVAL directive with
                   file scope.

       v           Enhances CFT77 vectorization.  -d v is            ON
                   equivalent to the NOVECTOR directive with
                   file scope.  If this switch is off, the b,
                   m, p, r, and s switches are not
                   meaningful.

       x           Creates optimized source file.  This              ON
                   switch may be turned off if only the
                   diagnostic listing is desired.  Turning
                   this switch off may speed compile time and
                   reduce disk space used.  The setting of
                   this switch does not affect the listing of
                   the transformed source in the listing
                   file.  This switch is valid only as a
                   command-line argument; it may not be
                   specified with the SWITCH directive.

       y           Reformats only restructured loops.  -d y          ON
                   causes the entire program unit to be
                   reformatted with the TIDY subprocessor.
     ________________________________________________________________________


     SG-3074 5.0               Cray Research, Inc.                         43


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Table 1.  Optimization switches enabled and disabled by -e and -d
               (continued)

     ________________________________________________________________________
     Switch        Description                                       Default
     ________________________________________________________________________

       0           Generates Autotasking threshold tests for         ON
                   comparison to loop iteration counts.

       1           Converts array syntax to DO loops.                ON

       4           Asserts that first values of private              OFF
                   arrays are not needed.  In some cases, it
                   allows more loops to be autotasked.

       5	   Generates output for CRAY-2 systems.  If          ON
                   this option is enabled, strides are		     (CRAY-2
                   heavily weighted as a negative factor in	     systems)
                   choosing vector loops and CRAY-2 systems	     OFF
                   threshold tests are generated.  See		     (all
                   "Threshold tests," page 149, for more	     other
                   information on threshold testing.	             systems)

       6           Automatically expands called routines             OFF
                   inline (always safe).  The subroutines and
                   functions must meet certain criteria.
                   Always produces safe code.
                   This option must be used in conjunction
                   with the -S or -e8 options to enable
                   inlining; used alone, it does not provide
                   information about the location of routines
                   to be inlined.  Information about routine
                   expansion or about why routines were not
                   expanded is sent to the file specified
                   with the -l option.

       7           Automatically expands called routines             OFF
                   inline (rarely unsafe).  The subroutines
                   and functions must meet certain criteria.
                   Usually exploits many more opportunities
                   for inline expansion than the 6 switch,
                   but in rare cases, creates incorrect code
                   because of adjustable array dimensioning
                   problems.  Using the -e 7 option is
                   equivalent to the AUTOEXPAND directive
                   with file scope.
                   This option must be used in conjunction
                   with the -S or -e8 options to enable
                   inlining; used alone, it does not provide
                   information about the location of routines
                   to be inlined.  Information about routine
                   expansion or about why routines were not
                   expanded is sent to the file specified
                   with the -l option.



     44                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

       8           Searches input file for expandable                OFF
                   routines.  This switch is valid only as a
                   command-line option; it may not be
                   specified with the SWITCH directive.
                   The only purpose of this switch is to set
                   the search path used by FPP when searching
                   for routines to be expanded inline.  To
                   enable inlining, you must also specify the -I,
                   -e 6, or -e 7 options; or you must insert
                   AUTOEXPAND, EXPAND, or NEXPAND directives
                   in your source code.
     ________________________________________________________________________




     fpp listing switches
     3.1.3
                           Switches (or option-arguments) also let you
                           control the contents of the listing file for FPP.
                           These switches are called listing switches.

                           You can pass listing switches in any of the
                           following ways:

                           * Using the -p (enable) and -q (disable) options
                             of fpp

                           * Using the -Wd"-p" and -Wd"-q" options of cf77

                           * Using the SWITCH directive

                           For more information about the use of the SWITCH
                           directive, see "Concepts and Directives," page 51.

                           Table 2, page 46, shows the switches that control
                           the format of the listing file.  For example, if
                           you want to get a 132-column printer listing
                           without warning messages and no event summary,
                           specify the -q twe option on the fpp command line.

                           The TIDY subprocessor is another feature of FPP
                           that improves the readability of the output code,
                           either by using default standards, or according to
                           user-specified parameters.  By default, TIDY is
                           applied only to loops that require restructuring
                           in order to be vectorized and to be run in
                           parallel.  To apply TIDY to the entire program
                           unit, use the -dy option of fpp or the -Wd"-dy"
                           option of cf77.  See "FPP TIDY Subprocessor," page
                           xx.x 0, for more information on TIDY switches.






     SG-3074 5.0               Cray Research, Inc.                         45


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

           Table 2.  Listing switches enabled and disabled by -p and -q

     ________________________________________________________________________
     Switch        Description                                       Default
     ________________________________________________________________________

       b           Lists corresponding input line numbers in         ON
                   columns 73 through 80 of output listing of
                   the transformed source.  This switch is
                   valid only if the n switch is on.  This
                   listing feature is useful in relating
                   transformed source lines to original
                   source lines.

       c           Lists data dependency conflict messages.          ON

       d           Lists declarations added by FPP.  This            OFF
                   switch is valid only if the n switch is
                   on.

       e           Lists event summary at end of routine.            ON

       f           Lists fatal error messages.                       ON

       g           Lists translation diagnostics.                    ON

       h           Lists input source lines.                         ON

       i           Lists lines that come from INCLUDE files.         ON
                   When this switch is on, source lines
                   obtained from INCLUDE files are listed.
                   They are identified by a dash following
                   the line number.  This switch is valid
                   only if the h switch (list source lines) is on.

       l           Produces a listing.  -q l is equivalent to        ON
                   the NOLIST directive.

       n           Lists translated code.                            ON

       p           Lists loop summary at end of routine.             ON

       s           Lists only summary information.  If the s         OFF
                   switch is used, the c, d, e, f, g, h, i,
                   l, n, p, t, w, and y listing switches are
                   ignored.

       t           Formats FPP listing for a terminal (format        ON
                   output for 80 columns).  -q t results in a
                   wide-format listing file, with printer
                   control, pagination, and page headers,
                   suitable for a 132-column line printer.

       u           Shows extent and disposition of loops in          ON
                   source code.

       w           Lists warning messages.                           ON

       y           Lists syntax errors.                              ON
     ________________________________________________________________________





     46                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     UNICOS fmp command
     3.2
                           The UNICOS fmp command invokes the translation
                           phase of Autotasking to translate Autotasking and
                           microtasking directives.  You can invoke FMP
                           either as a part of the CF77 compiling system, as
                           described in "CF77 User Interface," page 11, or
                           separately, by using the fmp command.

                           When fmp is executed as a separate command, it has
                           the following syntax:

     ------------------------------------------------------------------------
     fmp [-c file] [-d optoff] [-e opton] [-f] [-g gvalue]
       [-i] [-I directory] [-l] [-N80] [-p] [-s file]
       [-S] [-V] [input_file] [output_file]
     ------------------------------------------------------------------------

                           Options to fmp are as follows:

                           Option     Description
                           ------     -----------

                           -c file    Specifies file for CAL source stub
                                      program; used for microtasking
                                      routines.  If you do not specify file,
                                      it defaults to file multc.s.

                           -d optoff
                           -e opton
                                      Enables (-e) or disables (-d)
                                      optimization switches specified in
                                      optoff and opton, as follows:

                                      f     Generates BTRNSFRM and ETRNSFRM
                                            markers for debugging.
                                            The default is -df, which means
                                            that no debugging markers are
                                            generated.

                           -f         Selects CFT or CFT2 (depending on
                                      machine type), rather than CFT77, for
                                      Fortran output syntax.  The default is
                                      CFT77 output syntax.

                           -g gvalue  Specifies value to be used by the
                                      guided and vector scheduling
                                      algorithms.  The gvalue determines the


     SG-3074 5.0               Cray Research, Inc.                         47


     Invoking FPP and FMP Directly  CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                      number of iterations to be assigned to
                                      a processor each time it is
                                      redispatched, based on the number of
                                      processors available.  A gvalue of 0
                                      causes code to be generated that reads
                                      the current number of processors
                                      acquired by the program at run time.
                                      The default gvalue is the number of
                                      physical processors on the system.

                           -i         Disables generation of code that uses
                                      the CFT77 Autotasking inline intrinsic
                                      functions.  External calls are
                                      generated in place of the intrinsic
                                      function calls.  This is the default
                                      for CRAY-2 systems; the CFT77
                                      Autotasking inline intrinsic functions
                                      cannot be enabled for CRAY-2 systems.

                           -I directory
                                      Specifies a directory that contains
                                      INCLUDE files for FMP to expand.  The
                                      fmp command searches first in the
                                      directory of its input file and then in
                                      directories specified by the -I option.
                                      Multiple directories may be specified
                                      on the command line by using a -I
                                      option for each directory.  If input to
                                      FMP is from stdin, FMP looks in the
                                      current working directory for the
                                      INCLUDE file.

                                      If no INCLUDE file is found in the
                                      specified directory, the fmp command
                                      generates an error message and exits.

                           -l         Replaces the last character of an 8-
                                      character name with an S or M when
                                      generating names for microtasked
                                      routines.  By default, the preprocessor
                                      creates two subroutines for each
                                      microtasked routine and appends as much
                                      of MULT or SNGL as it can, within the
                                      8-character name limit.  (This means
                                      that by default, 8-character routine
                                      names produce an FMP abort because no
                                      characters from MULT or SNGL can be
                                      added and still stay within the 8-
                                      character limit.)  With this option,
                                      however, a routine named FUNCTION would
                                      become FUNCTIOM and FUNCTIOS.  The fmp
                                      command aborts if it finds an 8-
                                      character subroutine name for a
                                      microtasked routine that already ends
                                      in M or S.

                           -N80       Specifies an 80-column Fortran input
                                      file rather than the default 72-column
                                      Fortran input file.


     48                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide  Invoking FPP and FMP Directly
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Option     Description
                           ______     ___________
                           -p         Generates FMP output suitable for use
                                      by atexpert, the Autotasking Expert
                                      System.

                           -s file    Specifies file name where the
                                      uniprocessor versions of the code are
                                      to be placed.  This includes routines
                                      that do not contain microtasking or
                                      Autotasking directives, uniprocessor
                                      versions of microtasked routines, and
                                      the modified version of routines
                                      containing parallel regions and DOALLs.
                                      The default is to direct output to
                                      output_file.  If output_file is not
                                      specified, the default is to direct
                                      output to stdout.

                           -S         Prints symbol table to stdout.

                           -V         Displays current FMP version
                                      information to standard error (stderr)
                                      during execution.  If you specify the
                                      -V option on the fmp command line, you
                                      must specify an input file, as follows:

                                        fmp -V file.m file.j

                           input_file Specifies optional file for input;
                                      default is stdin.

                           output_file
                                      Specifies optional file for output;
                                      default is stdout.

                           The FMP translator can interpret both Autotasking
                           and microtasking directives in Fortran code.
                           Subroutines in a program may contain Autotasking
                           and microtasking directives, subject to certain
                           restrictions.  See "Using microtasking and
                           macrotasking with Autotasking," page 5, for these
                           restrictions.  For more information about
                           directives, see "Concepts and Directives," page
                           51.














     SG-3074 5.0               Cray Research, Inc.                         49



                                                 Concepts and Directives  [4]
     ########################################################################







                           To use the remaining portions of this guide, you
                           must understand some fundamental multitasking
                           concepts and be familiar with the terms that
                           describe these concepts.  This section introduces
                           basic multitasking and Autotasking concepts, gives
                           definitions of these concepts, discusses the
                           levels of intervention you, as a programmer, have
                           with Autotasking, and describes directives that
                           can be used with Autotasking.




     Concepts
     4.1
                           The following definitions describe basic
                           multitasking and Autotasking concepts that are
                           useful when dealing with any of the CRI parallel
                           processing software products.  These terms are
                           also summarized in Figure 10, page 53, and Figure
                           11, page 54.

                           Term          Definition
                           ----          ----------

                           Multitasking  One program makes use of multiple
                                         processors to execute portions of
                                         the program simultaneously.  Because
                                         multiple processes or tasks execute
                                         at the same time, the execution of
                                         the processes is not synchronous.
                                         There is no guarantee that
                                         concurrent processes will execute in
                                         any given order or sequence unless
                                         the program contains implicit
                                         synchronizations or the programmer
                                         has inserted explicit
                                         synchronization mechanisms.  (Also
                                         called parallel processing.)

                           Autotasking   Automatic distribution of loop
                                         iterations to multiple processors
                                         (or tasks) using the CF77 compiling
                                         system.  Autotasking can exploit
                                         parallelism at the DO loop level; it
                                         can be fully automatic, but you also
                                         can interact with the CF77 compiling
                                         system on several levels.  See
                                         "Levels of user intervention with


     SG-3074 5.0               Cray Research, Inc.                         51


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                         Autotasking," page 56, for more
                                         information.



























































     52                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                            Multitasking Terminology


       Multitasking or Parallel processing
                                   One program makes use of multiple
                                   processors to execute portions of the
                                   program simultaneously.


       Autotasking                 Automatic distribution of loop
                                   iterations to multiple processors (or
                                   tasks) using the CF77 compiling system.


       Parallel region             Section of code that is executed by
                                   multiple processors.  All code within a
                                   parallel region can be classified as
                                   partitioned or redundant.


       Single-threaded code        Section of code that is executed by
                                   only one processor at a time.


       Serial code                 Section of code that is executed by
                                   only one processor.


       Partitioned code            Code within a parallel region in which
                                   multiple processors share the work that
                                   needs to be done.  Each processor does
                                   a different portion of the work.

     ________________________________________________________________________

                       Figure 10.  Multitasking terminology






















     SG-3074 5.0               Cray Research, Inc.                         53


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                            Multitasking Terminology
                                   (continued)


       Redundant code              Code in a parallel region in which
                                   processors duplicate the work that
                                   needs to be available to all
                                   processors.


       Data dependency             When a computation in one iteration of
                                   a loop requires a value computed in
                                   another iteration of the loop.


       Synchronization             Process of coordinating the steps
                                   within concurrent/parallel regions.


       Master task                 Task that executes all of the serial
                                   code, initiates parallel processing,
                                   and waits until parallel processing is
                                   finished before leaving the Autotasking
                                   region.


       Slave task                  Task initiated by the master task.


       Directives                  Special lines of code beginning with
                                   CDIR$, CDIR@, CMIC$, CMIC@, or CFPP$
                                   that give the compiling system
                                   information about a program.
     ________________________________________________________________________

                 Figure 11.  Multitasking terminology (continued)

                           Term          Definition
                           ----          ----------

                           Parallel region
                                         Section of code that is executed by
                                         multiple processors.  All code
                                         within a parallel region can be
                                         classified as partitioned or
                                         redundant.

                           Single-threaded code
                                         Section of code that is executed by
                                         only one processor at a time.
                                         Another processor may enter this
                                         section of code as soon as the
                                         current processor is finished
                                         executing the code.

                           Serial code   Section of code that is executed by


     54                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                         only one processor.

                           Partitioned code
                                         Code within a parallel region in
                                         which multiple processors share the
                                         work that needs to be done.  Each
                                         processor does a different portion
                                         of the work.  (Also called control
                                         structure.)

                           Redundant code
                                         Code within a parallel region in
                                         which multiple processors can
                                         execute the same code and make the
                                         results available to all processors.

                           Data dependency
                                         When a computation in one iteration
                                         of a loop requires a value that was
                                         computed in another iteration of the
                                         loop.

                           Synchronization
                                         The process of coordinating the
                                         steps within concurrent/parallel
                                         regions.

                           Master task   The task that executes all of the
                                         serial code, initiates parallel
                                         processing when an Autotasking
                                         region is entered, performs  all or
                                         part (or none) of the work in the
                                         Autotasking region, and waits until
                                         parallel processing is finished
                                         before leaving the Autotasking
                                         region.  The code executed by the
                                         master task is in the original
                                         calling routine of a program that is
                                         being autotasked and contains the
                                         initialization and termination code
                                         for parallel execution.

                           Slave task    The task initiated by the master
                                         task that contains the parallel
                                         region code to be executed by slave
                                         processors.

                           Directives    Special lines of code beginning with
                                         CDIR$, CDIR@, CMIC$, CMIC@, or CFPP$
                                         that give the compiling system
                                         information about a program.  FPP
                                         automatically inserts CMIC@ and
                                         CDIR@ directives; you can manually
                                         add CDIR$, CMIC$, or CFPP$
                                         directives to the program.
                                         Directives CDIR$ and CDIR@ differ in
                                         that when you disable interpretation
                                         of directives by CFT77, CDIR@
                                         directives are still interpreted by


     SG-3074 5.0               Cray Research, Inc.                         55


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                         the compiler to optimize your code.




     Levels of user
     intervention with
     Autotasking
     4.2
                           As explained previously, Autotasking represents
                           the third phase in the development of CRI
                           multitasking software.  You have several options
                           for interaction with Autotasking:

                           * No intervention (use it as an automatic system).

                           * Insert Autotasking directives to identify
                             parallelism not detected automatically.

                           * Process previously microtasked code through the
                             Autotasking system to detect and exploit
                             parallelism in nonmicrotasked routines.

                           * Process previously autotasked code to exploit
                             more parallelism.  FPP can recognize parallelism
                             outside a loop nest; that is, FPP examines all
                             code except a loop nest that already contains an
                             Autotasking directive.  (Even subroutines
                             containing Autotasking directives are analyzed.)

                           * Insert Autotasking directives yourself,
                             bypassing FPP.

                           Autotasking can be used as a fully automatic
                           system.  You can simply use an error-free Fortran
                           program as input to the Autotasking system and let
                           the software automatically detect and exploit
                           parallelism.  For many programs, using this
                           "automatic mode" gives a substantial speedup.

                           These levels of intervention are summarized in
                           Figure 12, page 57.

















     56                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                     Levels of Intervention with Autotasking



       * No intervention (use it as an automatic system).



       * Insert Autotasking directives to identify parallelism not
         detected automatically.



       * Process previously microtasked code through the Autotasking
         system to detect and exploit parallelism in nonmicrotasked
         routines.



       * Process previously autotasked code to exploit more parallelism.
         FPP can recognize parallelism outside a loop nest; that is, FPP
         examines all code except a loop nest that already contains an
         Autotasking directive.  (Even subroutines containing Autotasking
         directives are analyzed.)



       * Insert Autotasking directives yourself, bypassing FPP.
     ________________________________________________________________________

               Figure 12.  Levels of intervention with Autotasking

                           However, you may sometimes know information about
                           the structure of a program and about its data that
                           is unavailable to FPP through inspection of
                           individual program units.  For this reason,
                           directives supply a way for you to guide FPP.  All
                           directives are treated as comments by Fortran
                           compilers, thus preserving code transportability.
                           The following subsections explain directives that
                           you can use to pass information to the CF77
                           compiling system.




     Directives
     4.3
                           You can pass information to all phases of the
                           compiling system by inserting directives into your
                           source code.  FPP, FMP, and CFT77 each have their
                           own set of directives.  For example, you can
                           insert directives to instruct FPP where to perform
                           or not perform parallel and vector dependency


     SG-3074 5.0               Cray Research, Inc.                         57


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           analysis.  The following types of directives are
                           available for use with the Autotasking system:

                           * FPP directives

                           * FMP directives

                           * Compiler directives

                           These directives are summarized in Figure 13, page
                           59.

                           FPP directives are special lines of code beginning
                           with CFPP$ that are interpreted by FPP to provide
                           more information about the program.  You can
                           manually add them to the program.  (These
                           directives are also called user directives.)

                           FMP directives are special lines of code beginning
                           with CMIC$  or CMIC@ that are interpreted by FMP
                           to give it more information about the program.
                           FPP automatically inserts CMIC@ directives; you
                           can also manually add CMIC$ directives to the
                           program.  (These directives are also called
                           microtasking directives.)

                           Compiler directives are special lines of code
                           beginning with CDIR$ and CDIR@ that are
                           interpreted by the CFT77 compiler as information
                           about the program.  FPP automatically inserts
                           CDIR@ directives; you can also manually add CDIR$
                           directives to the program.  FPP also interprets
                           certain compiler directives associated with vector
                           processing; see "Compiler directives," page 99,
                           for more information.


























     58                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                                   Directives


       FPP Directives


       * Special lines of code beginning with CFPP$

       * Interpreted by FPP

       * Can manually add to the program

       * Also called user directives

         FMP Directives

       * Special lines of code beginning with CMIC$

       * Interpreted by FMP

       * FPP automatically inserts

       * Can also manually add them to the program

       * Also called microtasking directives

         Compiler directives


       * Special lines of code beginning with CDIR$ and CDIR@

       * Interpreted by the compiler

       * FPP automatically inserts CDIR@ directives

       * Can manually add CDIR$ directives to the program
     ________________________________________________________________________

                      Figure 13.  Directive summary by type



     FPP directives
     4.3.1
                           FPP or user directives have the following syntax:

                           -------------------------------------------------
                           CFPP$ directive scope
                           -------------------------------------------------

                           The C in column 1 makes the directive a comment
                           for all other Fortran compilers.  The FPP$ flags
                           this line as a directive to FPP.  Following the
                           directive is an optional scope parameter, scope.
                           Table 3 shows allowable scope values.


     SG-3074 5.0               Cray Research, Inc.                         59


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Table 3.  Allowable scope parameters for CFPP$
                                     directives

                           _______________________________________________
                           Value     Meaning       Description
                           _______________________________________________
                             R       Routine	   Directive applies
                                                   until the end of the
                                                   current routine.

                             L       Loop	   Default scope.
                                                   Directive applies to
                                                   the next loop, and
                                                   only the next loop;
                                                   that is, any inner and
                                                   outer loops are
                                                   considered
                                                   independently.

                             F      File	   Directive applies
                                                   until the end of the
                                                   input file.

                             I      Immediate	   Directive applies
                                                   immediately at that
                                                   point in the source
                                                   code.

                                    Blank	   Same as L; directive
                                                   applies to the next
                                                   loop encountered.
                           _______________________________________________


                           Some directives ignore the scope parameter.
                           Directives affecting IF loops must have R or F
                           scope; directives with L scope apply only to DO
                           loops.  The body of the directive begins after one
                           or more blanks.  Many directives can be preceded
                           by NO, thus effecting the reverse operation.

                           The following example tells FPP to ignore
                           potential data dependencies in the next loop:

                                CFPP$ NODEPCHK


                           The following example disables the vectorization
                           enhancement for the rest of this routine.

                                CFPP$ NOVECTOR R


                           The following example enables the listing for the
                           rest of the input file.

                                CFPP$ LIST F





     60                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The full set of directives is summarized in Table
                           4.  The scope entry is either L, indicating that
                           it applies to the next loop; R, indicating that it
                           applies to the whole routine; I, indicating that
                           it applies immediately; or LRF, which indicates
                           that any of the loop, routine, or file options can
                           be used to control the scope.  If the scope is not
                           specified for these directives, the default is L,
                           or loop.  A short description of each of these
                           directives follows the table, grouped by
                           functionality.


                            Table 4.  CFPP$ directives

     _______________________________________________________________________
     Directive      Function                                 Default   Scope
     _______________________________________________________________________

     NOVECTOR/	   Disables/enables vectorization               VECTOR   LRF
     VECTOR	   enhancement 

     NOCONCUR/	   Disables/enables Autotasking                 CONCUR   LRF
     CONCUR

     SKIP          Disables Autotasking and vectorization         None   LRF

     INNER/	   Enables/disables Autotasking for            NOINNER   LRF
     NOINNER	   inner loops

     CNCALL        Allows concurrent calls in loop                None   LRF

     NOALTCODE/    Disables/enables generation of alternate    ALTCODE   LRF
     ALTCODE	   code blocks

     NOASSOC/	   Disables/enables all associative              ASSOC   LRF
     ASSOC	   transformations

     SPLIT/	   Disables/enables cutting out subroutine     NOSPLIT   LRF
     NOSPLIT	   and function calls from loop

     SELECT        Selects which loop in a nest of loops          None   L
     		   to optimize

     NOLSTVAL/ 	    Disables/enables saving last values of      LSTVAL   LRF
     LSTVAL	    transformed scalars

     UNROLL	    Enables/disables automatic or explicit    NOUNROLL   LRF
     NOUNROLL	    loop unrolling

     NODEPCHK/	    Disables/enables data dependency check      DEPCHK   LRF
     DEPCHK

     NOSYNC/	    Enables/disables analysis of potential        SYNC   LRF
     SYNC	    overlag of array sections

     _______________________________________________________________________


     SG-3074 5.0               Cray Research, Inc.                         61


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                     Table 4.  CFPP$ directives  (continued)
     _______________________________________________________________________
     Directive      Function                                 Default   Scope
     _______________________________________________________________________

     NOEQVCHK/	    Disables/enables checking of EQUIVALENCE    EQVCHK   LRF
     EQVCHK	    statements to see whether they cause data 
		    dependencies

     PERMUTATION    Declares that listed integer arrays,          None   R
                    for use as subscripts in array section names, 
		    have no repeated values

     RELATION       Specifies relationship between two 		  None   LRF
                    simple variables

     NOLIST/	    Disables/enables listing of the input 	  LIST   I
     LIST	    source file

     SWITCH         Sets global switches                          None   I

     COUNT          Supplies iteration count for loop             None   LRF

     ITERATIONS     Supplies iteration counts for classes         None   R
                    of loops

     AUTOEXPAND/    Enables/disables automatic routine    NOAUTOEXPAND   LRF
     NOAUTOEXPAND   inlining

     EXPAND         Expands particular routines inline            None   RF

     NEXPAND        Expands particular nested routines inline     None   RF

     SEARCH         Supplies location for source of routines      None   RF
		    to be expanded

     PRIVATEARRAY   Asserts that private arrays can be 		  None   LRF
     		    autotasked
     _______________________________________________________________________




































     62                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                          FPP Transformation Directives


       * NOVECTOR/VECTOR disables/enables vectorization enhancement;
         VECTOR serves only to toggle back from NOVECTOR.


       * NOCONCUR/CONCUR disables/enables Autotasking; CONCUR serves only
         to toggle back from NOCONCUR.


       * SKIP disables Autotasking and vectorization.


       * INNER/NOINNNER enables/disables Autotasking for inner loops.


       * CNCALL allows concurrent calls in loop.


       * NOALTCODE/ALTCODE disables/enables generation of alternate code
         blocks.


       * NOASSOC/ASSOC disables/enables all associative transformations.


       * SPLIT/NOSPLIT disables/enables cutting out subroutine and
         function calls from loop.


       * SELECT selects which loop in a nest of loops to optimize.


       * NOLSTVAL/LSTVAL disables/enables saving last values of
         transformed scalars.


       * UNROLL/NOUNROLL enables/disables automatic or explicit loop
         unrolling.
     ________________________________________________________________________

                 Figure 14.  FPP transformation directive summary


     Transformation
     directives
     4.3.1.1
                           Transformation directives change the way FPP
                           transforms a loop.  These directives are
                           summarized in Figure 14, page 63, and discussed in
                           the following subsections.





     SG-3074 5.0               Cray Research, Inc.                         63


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     NOVECTOR/VECTOR
     4.3.1.1.1
                           The NOVECTOR directive disables vectorization
                           enhancement.  Despite the best efforts of FPP to
                           make the right choices, occasionally a loop may be
                           less efficient after transformation.  NOVECTOR is
                           provided to disable vectorization enhancement in
                           such cases.  VECTOR serves only to toggle back
                           from NOVECTOR; it does not force vectorization.

                           The -d v option to fpp is equivalent to NOVECTOR
                           with file scope.

     NOCONCUR/CONCUR
     4.3.1.1.2
                           The NOCONCUR directive disables conversion of
                           loops to autotasked form.  The CONCUR directive
                           serves only to toggle back from, or locally
                           override, a previous directive or command-line
                           option that disabled concurrency analysis; it does
                           not force conversion of a specific loop.  (See the
                           SELECT directive, page 66, for information about
                           selecting a loop for concurrency analysis.)

                           Specifying the NOCONCUR directive with loop scope
                           (the default) does not inhibit FPP from making a
                           loop part of a parallel case, neither does it
                           inhibit FPP from expanding a parallel region
                           outside of a nonparallel loop.  You can use the
                           NOCONCUR directive with routine or file scope, or
                           use fpp command-line options (-d h and -d c) to
                           inhibit these transformations.

                           The -d c option to fpp is equivalent to NOCONCUR
                           with file scope.

     SKIP
     4.3.1.1.3
                           The SKIP directive disables Autotasking and
                           vectorization; it acts like a combined NOCONCUR
                           and NOVECTOR.

     INNER/NOINNER
     4.3.1.1.4
                           The INNER directive enables Autotasking of inner
                           loops.  For more information on the use of the
                           INNER directive, see "INNER directive use," page
                           148.





     64                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     CNCALL
     4.3.1.1.5
                           The CNCALL directive asserts that any subroutines
                           called in a loop have no recursive side effects;
                           they can be called concurrently by separate
                           iterations of the loop.  See "CNCALL directive
                           use," page 155, for more information on the use of
                           the CNCALL directive,

     NOALTCODE/ALTCODE
     4.3.1.1.6
                           The NOALTCODE directive disables the generation of
                           alternate code blocks.

                           For potentially dependent vector loops, the
                           ALTCODE directive directs FPP to generate both
                           vector and nonvector versions of the loop,
                           together with a run-time test to choose between
                           them based on the value of array subscripts.

                           For autotasked loops, ALTCODE directs FPP to
                           supply a similar threshold test for the IF clause
                           of the DO ALL or DO PARALLEL.

                           The ALTCODE directive allows an optional
                           parameter.  If the parameter is an integer
                           constant, FPP generates a test comparing the
                           loop's iteration count to the constant.  If the
                           iteration count is larger than the constant, the
                           loop is vectorized; otherwise, it is not.  If the
                           parameter is not an integer constant, the
                           parameter is echoed verbatim for the IF test.  If
                           the result of the IF test is true, the loop is
                           vectorized; otherwise, it is not.

                           The ALTCODE directive is in force by default.  The
                           -d m option to fpp is equivalent to NOALTCODE with
                           file scope.

     NOASSOC/ASSOC
     4.3.1.1.7
                           By default, FPP transforms certain constructs into
                           vector or concurrent versions in which the order
                           of operations may be different than the original
                           (that is, they have been associatively
                           transformed).  (This is similar to the associative
                           property of real numbers.)  Because of the way
                           numbers are internally represented in computers,
                           this operation reordering may result in answers
                           that differ slightly from the scalar original.
                           For example, floating-point arithmetic is not
                           associative.  The NOASSOC directive disables all
                           associative transformations, including the
                           following:


     SG-3074 5.0               Cray Research, Inc.                         65


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * Reductions - sum, dot product, and index of
                             minimum and maximum

                           * Operation reordering when minimizing dependent
                             regions

                           * Linear recursion translation

                           The -d a option to fpp is equivalent to NOASSOC
                           with file scope.

     SPLIT/NOSPLIT
     4.3.1.1.8
                           The SPLIT directive asserts that subroutine and
                           function calls do not cause feedback of results
                           from one loop pass to another, and thus may be
                           "split out" from an optimized loop into a separate
                           loop.  For more information on the use of the
                           SPLIT directive, see "SPLIT directive use," page
                           159.

     SELECT
     4.3.1.1.9
                           The SELECT directive advises FPP to choose the
                           next loop as the one to vectorize or autotask in a
                           nest of loops.  If FPP cannot analyze the loop or
                           finds a dependence, the SELECT directive is
                           ignored.

                           In choosing a single loop from a nest, FPP weighs
                           loop iteration count, the presence of data
                           dependence, and the amount of work within the loop
                           to a heuristic algorithm.  Because not all
                           pertinent information is available at compile
                           time, FPP may not always be able to make the best
                           choice.  Therefore, the SELECT directive allows
                           you to dictate the optimization mode of a specific
                           loop.  Place the SELECT directive directly before
                           the DO statement of the loop to be optimized.  An
                           optional argument indicates the mode of
                           optimization, either VECTOR or CONCUR.  The
                           default is VECTOR.

     NOLSTVAL/LSTVAL
     4.3.1.1.10
                           The NOLSTVAL directive advises FPP not to save the
                           final values for transformed scalars; that is, the
                           final values for transformed scalars (within the
                           directive's scope) do not need to be identical to
                           those in the scalar version.  This directive is
                           useful when FPP cannot determine by inspecting the
                           current subprogram whether a variable is
                           subsequently used.  Such variables are typically


     66                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           in common blocks.

                           Transformed scalars are array indexes and promoted
                           scalars.  See "Last value saving," page 119,
                           "Array indexing," page 117, and "Scalars in
                           loops," page 186, for more information.

                           The LSTVAL directive causes FPP to save the last
                           values of transformed scalars; this is the
                           default.

     UNROLL/NOUNROLL
     4.3.1.1.11
                           The UNROLL directive has two functions:  the first
                           function is to enable the automatic unrolling of
                           loops with small constant iteration counts; the
                           second function is to force explicit unrolling of
                           a particular loop, regardless of iteration count.
                           Eliminating an inner loop by unrolling may allow
                           another loop to vectorize.

                           The UNROLL directive has the following syntax:

                           -------------------------------------------------
                           CFPP$  UNROLL  [(number_of_times)]  [{L,R,F}]
                           -------------------------------------------------

                           When routine or file scope is specified (R or F),
                           automatic unrolling of loops is enabled or
                           disabled over that scope.  The optional parameter
                           number_of_times, which must be a constant,
                           specifies the threshold loop iteration count for
                           automatic unrolling.  Loops with an iteration
                           count greater than this value are not unrolled.
                           By default, the threshold is 3.

                           To force a loop to be explicitly unrolled, use the
                           UNROLL directive with local scope (L) immediately
                           preceding the loop.  In this case, the optional
                           parameter is taken as the number of times to
                           unroll the loop.  If a parameter is not supplied,
                           FPP uses an internally calculated function of the
                           loop length, loop complexity, and default
                           threshold to determine the number of times to
                           unroll the loop.

                           The NOUNROLL directive disables automatic loop
                           unrolling; this is the default.











     SG-3074 5.0               Cray Research, Inc.                         67


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                         FPP Data Dependency Directives



       * NODEPCHK/DEPCHK disables/enables data dependency checks.



       * NOSYNC/SYNC enables/disables analysis of potential overlap of
         array sections.  (NOSYNC is a generalization of NODEPCHK to
         concurrency.)



       * NOEQVCHK/EQVCHK disables/enables checking of EQUIVALENCE
         statements to see whether they cause data dependencies.



       * PERMUTATION declares that listed integer arrays, for use as
         subscripts in array section names, have no repeated values.



       * RELATION specifies relationship between two simple variables.



       * PRIVATEARRAY asserts that private arrays can be autotasked.
     ________________________________________________________________________

                Figure 15.  FPP data dependency directive summary


     Data dependency
     directives
     4.3.1.2
                           Data dependency directives are used to help FPP
                           decide whether data dependency conflicts actually
                           exist in a loop.  If you know that an operation is
                           not recursive, you can supply one of these
                           directives to inform FPP.  These directives are
                           summarized in Figure 15, page 68, and briefly
                           discussed in the following subsections.  They are
                           discussed further and examples are given in "Using
                           data dependency directives," page 110.

     NODEPCHK/DEPCHK
     4.3.1.2.1
                           When elements of an array are modified within a
                           loop, FPP must determine the exact storage
                           relationship of these elements to all other


     68                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           references to the array in the loop.  This must be
                           done to ensure that the references do not overlap,
                           and thus, they can be safely executed in parallel.
                           When the relationships cannot be determined, FPP
                           issues a potential dependency diagnostic.  The
                           NODEPCHK directive asserts that all such
                           potentially recursive relationships are, in fact,
                           not recursive.  You should use this capability
                           only when you know no real recursion exists.  Use
                           of the directive does not, however, force the
                           optimization of operations that are unambiguously
                           recursive.  Use the DEPCHK directive to toggle
                           back to the default state.  The -d d option to fpp
                           is equivalent to NODEPCHK with file scope.  See
                           "NODEPCHK (declaring nonrecursion)," page 111, for
                           more information.

     NOSYNC/SYNC
     4.3.1.2.2
                           The NOSYNC directive is a generalization of the
                           NODEPCHK directive to concurrency.  FPP generates
                           the diagnostic MUST SYNCHRONIZE TO PRESERVE ORDER
                           OF ACCESSES when one processor might write over
                           elements of an array before another processor
                           reads those elements.  If there is no overlap, you
                           can use the NOSYNC directive to allow full
                           optimization.

                           You can use the SYNC directive to toggle back to
                           the default state.

     NOEQVCHK/EQVCHK
     4.3.1.2.3
                           The NOEQVCHK directive tells FPP to ignore
                           relationships between variables caused by
                           EQUIVALENCE statements, when examining the data
                           dependencies in a loop.  You can use the EQVCHK
                           directive to toggle back to the default state.
                           The -d e option to fpp is equivalent to a NOEQVCHK
                           directive with file scope.  See the example
                           "NOEQVCHK (declaring nonrecursion in
                           equivalences)," page 112.

     PERMUTATION
     4.3.1.2.4
                           The PERMUTATION directive declares that an integer
                           array does not have repeated values.  This is
                           useful when the integer array is used as a
                           subscript for another array ("indirect
                           addressing").  If it is known that the integer
                           array is used merely to permute the elements of
                           the subscripted array, it can often be determined
                           that feedback does not exist with that array


     SG-3074 5.0               Cray Research, Inc.                         69


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           reference.  See "PERMUTATION (declaring safe
                           indirect addressing)," page 114, for directive
                           syntax and more information.

     RELATION
     4.3.1.2.5
                           The RELATION directive advises FPP that a
                           specified relationship exists between two integer
                           variables or between an integer variable and an
                           integer constant.  This information may be useful
                           to FPP in resolving otherwise ambiguous array
                           relationships.

                           RELATION directives are informative only, they do
                           not force any action.  They can be applied at the
                           loop, routine, or file level.  If conflicting
                           relations are given, the result is unpredictable.
                           You must ensure that the relations specified are
                           correct and consistent.  See "RELATION (specifying
                           relationship between variables)," page 113, for
                           the syntax of the directive and more information.

     PRIVATEARRAY
     4.3.1.2.6
                           The PRIVATEARRAY directive tells FPP that it is
                           safe to autotask private arrays, specifically,
                           that private arrays use only values generated
                           within the autotasked loop.  The PRIVATEARRAY
                           directive has no parameters, and can have loop,
                           routine, or file scopes.

                           The fpp -e 4 command-line option is equivalent to
                           the PRIVATEARRAY directive with file scope.

                           Example:

                                CFPP$ PRIVATEARRAY
                                      DO 500 J = 1, M
                                         DO 100 I = 1, N
                                            X(I) = A(I,J) + B(I,J)
                                100      CONTINUE
                                         DO 200 I = 1, NM1
                                            C(I,J) = X(I) + D(I,J)
                                200      CONTINUE
                                500   CONTINUE


                           Without the PRIVATEARRAY directive, FPP does not
                           autotask the 500 loop, because it cannot tell
                           whether all the values of X used in the 200 loop
                           are generated by the 100 loop.  (If they are not,
                           the values generated outside the 500 loop are
                           required, and X cannot efficiently be made
                           private.)



     70                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



     Miscellaneous
     directives
     4.3.1.3
                           FPP directives also exist that cannot be
                           categorized in any of the preceding classes.
                           These directives are summarized in Figure 16, page
                           72, and discussed in the following subsections.

     Advisory directives:
     COUNT and ITERATIONS
     4.3.1.3.1
                           Advisory directives provide information for FPP
                           that may result in a better choice of loops to be
                           optimized.

                           If the iteration count of a loop (or class of
                           loops) is variable and cannot be determined from
                           the information in the routine until execution
                           time, but you know the approximate number of
                           iterations, you can use the COUNT or ITERATIONS
                           directive to supply this information.

                           The COUNT directive has the following syntax:

                           -------------------------------------------------
                           CFPP$  COUNT (val1) [{L,R,F}]
                           -------------------------------------------------

                           The ITERATIONS directive has the format:

                           -------------------------------------------------
                           CFPP$  ITERATIONS (var1=val1 [,var2=val2] ...)
                           -------------------------------------------------

                           var1, var2...    Specifies indices of loops.

                           val1, val2...    Specifies vector length values
                                             for the given loop indices.
                                             These values do not have to be
                                             exact because they are used only
                                             as guidelines.

                           The COUNT directive can be used at the file,
                           routine, or loop levels.  The ITERATIONS directive
                           can be used only at the routine level.  A CFPP$
                           COUNT(0) F or NOITERATIONS directive returns FPP
                           to its normal iteration count processing.








     SG-3074 5.0               Cray Research, Inc.                         71


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________


                          Miscellaneous FPP Directives




       * COUNT lets you provide approximate iteration counts for loops to
         FPP.




       * ITERATIONS lets you provide approximate iteration counts for
         classes of loops to FPP.




       * NOLIST/LIST disables or enables the listing of the input source
         file.




       * SWITCH lets you set (or change) FPP global switches.
     ________________________________________________________________________

                 Figure 16.  Miscellaneous FPP directive summary

                           Example:

           SUBROUTINE OPTIM6 ( A, B, N )
           REAL A(N), B(N)
     C
     CFPP$ COUNT ( 3 ) R
           DO 606 I = 1,N
           (Not autotasked; iteration count is too small.)
              A(I) = B(I)
       606 CONTINUE
     C
           END

















     72                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Example:

                                      SUBROUTINE ITERS (A,B,C,D,M,MM1,N,NP1)
                                      REAL A(M,N), B(M,N), C(M,N), D(M,N)
                                C
                                CFPP$ ITERATIONS (I=15,J=20)
                                C
                                      DO 200 J = 1, M
                                         DO 100 I = 1, N
                                            A(I,J) = B(I,J) + C(I,J)
                                 100     CONTINUE
                                 200  CONTINUE
                                C
                                      CALL CALC1
                                C
                                      DO 400 J = 1, MM1
                                         DO 300 I = 2, NP1
                                            A(I,J) = A(I,J) + D(I,J)
                                            B(I,J) = B(I,J) + D(I,J)
                                            C(I,J) = C(I,J) + D(I,J)
                                 300     CONTINUE
                                 400  CONTINUE


                           Translation:

                                      DO 200 J = 1, M
                                CDIR@ IVDEP
                                         DO 100 I = 1, N
                                            A(I,J) = B(I,J) + C(I,J)
                                 100     CONTINUE
                                 200  CONTINUE
                                C
                                      CALL CALC1
                                C
                                CMIC$ DO ALL SHARED(MM1, NP1, M, N, A, D, B,
                                CMIC$1   C) PRIVATE(J, I)
                                      DO 400 J = 1, MM1
                                CDIR@ IVDEP
                                         DO 300 I = 2, NP1
                                            A(I,J) = A(I,J) + D(I,J)
                                            B(I,J) = B(I,J) + D(I,J)
                                            C(I,J) = C(I,J) + D(I,J)
                                 300     CONTINUE
                                 400  CONTINUE


                           Without the ITERATIONS directive, both loop nests
                           would have been conditionally autotasked,
                           resulting in lower performance for the DO 200 loop
                           and unnecessary compile-time and run-time overhead
                           for the DO 400 loop.

     Listing directives:
     NOLIST and LIST
     4.3.1.3.2
                           Listing directives change the appearance of the


     SG-3074 5.0               Cray Research, Inc.                         73


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           FPP listing.  The following subsections discuss
                           the FPP listing directives.

                           You can selectively suppress listing of the input
                           source code with the NOLIST/LIST directive pair.
                           If the NOLIST directive (or the -q l option to
                           fpp) is in force when the END statement is
                           encountered, the rest of the listing (messages,
                           translated source, summaries) is suppressed,
                           unless specifically enabled (with the -p option
                           switches to fpp).

     SWITCH
     4.3.1.3.3
                           The SWITCH directive enables you to set (or
                           change) global switches, including listing
                           switches.  You can also use the SWITCH directive
                           to set optimization and reformatting switches.
                           See Table 1, page 42, for a list of the
                           optimization switches.  See Table 2, page 46, for
                           a list of listing switches.  See "FPP TIDY
                           Subprocessor," page xx.x 0, for a list of the
                           reformatting switches.

                           The format of the SWITCH directive is as follows:

     ------------------------------------------------------------------------
     CFPP$ SWITCH,OPTON=string,OPTOFF=string,LSTON=string,
       LSTOFF=string, TDYON=string,TDYOFF=string,   TIDY parameters
     ------------------------------------------------------------------------

                           Parameters OPTON, OPTOFF, LSTON, LSTOFF, TDYON,
                           and
                           TDYOFF correspond to fpp options -e, -d, -p, -q,
                           -r, and
                           -n, respectively.

                           Blanks are not significant, and keywords and
                           switches can be in either uppercase or lowercase.
                           See "FPP TIDY Subprocessor," page xx.x 0, for
                           information on TIDY parameters specific to the
                           SWITCH directive.

     Inline expansion
     directives
     4.3.1.3.4
                           Inline expansion directives provide information
                           for FPP that allows expansion of the bodies of
                           certain subroutines and functions into the loops
                           that call them.  The directives are as follows:

                           * AUTOEXPAND

                           * EXPAND



     74                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           * NEXPAND

                           * SEARCH

                           See "Inline expansion," page 174, for more
                           information about these directives and examples of
                           their use.



     FMP directives
     4.3.2
                           The FMP translator interprets CMIC$ and CMIC@
                           directives in Fortran code.

                           Autotasking and microtasking directives can be
                           used in the same subprogram unit, with
                           restrictions.  See the following note and "Using
                           microtasking and macrotasking with Autotasking,"
                           page 5, for a complete discussion of these
                           restrictions.

                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                  Note

                           Autotasking CMIC$ directives inhibit FPP action on
                           any loop nest in which they appear.  Also, FPP
                           does not try to optimize anything inside a
                           parallel region (that is, anything bounded by a
                           CMIC$ PARALLEL and CMIC$ END PARALLEL pair.)
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


                           The general form of FMP directives is as follows:

                           -------------------------------------------------
                           CMIC$ GENERIC_DIRECTIVE directive_parameters
                           CMIC$*directive_parameters continued
                           -------------------------------------------------

                           User-specified FMP directives begin with "CMIC$ "
                           in columns 1 through 6 and directive text in
                           columns 7 through 72.  Directives can be continued
                           by using CMIC$ in columns 1 through 5 and any
                           nonblank, nonzero character in column 6.
                           Parameters on directives (for example, PRIVATE)
                           may be repeated as needed, and need not be
                           ordered.  Uppercase and lowercase may be used
                           freely in the directive text.  In the
                           descriptions, brackets [ ] delimit optional
                           parameters to individual directives.






     SG-3074 5.0               Cray Research, Inc.                         75


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Certain FMP directives are the same for both
                           Autotasking and microtasking.  Common directives
                           include the following:

                                CMIC$ CONTINUE
                                CMIC$ END GUARD
                                CMIC$ GETCPUS
                                CMIC$ GUARD
                                CMIC$ RELCPUS


                           There are also FMP directives for Autotasking and
                           microtasking that perform basically the same
                           functions.  Equivalent directives are shown in
                           Table 5.

                           Table 5.  Equivalent Autotasking and microtasking
                                     directives

                           __________________________________________________
                           Autotasking                 Microtasking
                           __________________________________________________
                           CMIC$ CASE                  CMIC$ PROCESS
                           CMIC$ END CASE              CMIC$ ALSO PROCESS
                                                       CMIC$ END PROCESS
                           CMIC$ DO PARALLEL           CMIC$ DO GLOBAL
                           CMIC$ SOFT EXIT             CMIC$ STOP ALL PROCESS
                           __________________________________________________


                           The data scope rules used for FMP directives are
                           discussed in "FMP data scope rules," page 95.


     FMP Autotasking
     directives
     4.3.2.1
                           FMP Autotasking directives provide a way for you
                           to specify loop-level parallelism in your
                           programs; you can start and end parallel
                           processing at any number of suitable points within
                           a subroutine.  Using these directives eliminates
                           the microtasking requirement that parallel
                           processing always start at the first executable
                           statement of a subroutine and always end at the
                           last executable statement of the subroutine.
                           Figure 17, page 77, shows the difference between
                           parallel processing with Autotasking and
                           microtasking.

                           These directives are also useful when FPP fails to
                           recognize parallelism that you know exists.  They
                           are summarized in Figure 18, page 78, and
                           discussed in the following subsections.




     76                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     See the printed manual for this figure; it doesn't display on-line.



                   Figure 17.  Autotasking versus microtasking























































     SG-3074 5.0               Cray Research, Inc.                         77


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                           FMP Autotasking Directives



       * CMIC$ DO ALL



       * CMIC$ PARALLEL and CMIC$ END PARALLEL



       * CMIC$ DO PARALLEL and CMIC$ END DO



       * CMIC$ CASE and CMIC$ END CASE



       * CMIC$ GUARD and CMIC$ END GUARD



       * CMIC$ CONTINUE



       * CMIC$ SOFT EXIT



       * CMIC$ TASKCOMMON
     ________________________________________________________________________

                      Figure 18.  FMP Autotasking directives






















     78                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     CMIC$ DO ALL
     4.3.2.1.1
                           The CMIC$ DO ALL directive has the following
                           options:

                           -------------------------------------------------
                           CMIC$ DO ALL [IF (expr)] [SHARED (var [,...])]
                           [PRIVATE (var [,...])] [AUTOSCOPE]
                           [CONTROL(var [,...])] [SAVELAST] [MAXCPUS (val)]
                           {[SINGLE] | [CHUNKSIZE (n)] | [NUMCHUNKS (m)]
                           | [GUIDED] | [VECTOR]}
                           -------------------------------------------------


                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                Caution

                           In this description and in the remaining
                           descriptions of FMP directives, the syntax is not
                           correct as shown.  (The descriptions attempt to
                           show all possible options.)   Directives can be
                           continued by using CMIC$ in columns 1 through 5
                           and any nonblank, nonzero character in column 6.
                           See "FMP directives," page 75, for a complete
                           description of the correct syntax.
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


                           The DO ALL directive indicates that the DO loop
                           that begins on the next line may be executed in
                           parallel by multiple processors.  No directive is
                           needed to end a DO ALL loop, (that is, the DO ALL
                           initiates a parallel region whose only code is a
                           DO loop with independent iterations).  The loop
                           index variable for a DO ALL is PRIVATE.  Optional
                           parameters are as follows:

                           Parameter   Description
                           ---------   -----------

                           IF(expr)    Performs a run-time test to choose
                                       between uniprocessing and
                                       multiprocessing.  When not specified,
                                       multiprocessing is chosen if the loop
                                       was not called from within a parallel
                                       region.  The logical expression (expr)
                                       determines (at run-time) whether
                                       multiprocessing will occur.  When expr
                                       is true, multiprocessing is enabled.

                           SHARED(var1,var2,...)
                                       Specifies that the variables listed
                                       will have shared scope; that is, they
                                       are accessible to both the original
                                       task and all helper tasks.  The SHARED
                                       clause identifies those variables that
                                       are shared between parallel processes.
                                       The data scope rules used in a


     SG-3074 5.0               Cray Research, Inc.                         79


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                       partitioned loop are discussed in "FMP
                                       data scope rules," page 95.

                           PRIVATE(var1,var2,...)
                                       Specifies that the variables listed
                                       will have private scope; that is, each
                                       task (original or helper) will have
                                       its own private copy of these
                                       variables.  The PRIVATE clause
                                       identifies those variables that are
                                       not shared between parallel processes.
                                       The data scope rules used in a
                                       partitioned loop are discussed in "FMP
                                       data scope rules," page 95.

                           AUTOSCOPE   Specifies that all unscoped variables
                                       that have not been explicitly scoped
                                       with a PRIVATE/SHARED declaration are
                                       scoped according to the default rules
                                       for scoping variables.  The data scope
                                       rules used in a partitioned loop are
                                       discussed in "FMP data scope rules,"
                                       page 95.

                           CONTROL(var1,var2,...)
                                       Specifies that the variables listed
                                       are considered control variables for
                                       the purpose of the AUTOSCOPE
                                       directive.  An array indexed by a
                                       control variable has shared scope.

                           SAVELAST    When present, this parameter specifies
                                       that private variables' values (from
                                       the final iteration of a DO ALL) will
                                       persist in the original task after
                                       execution of the iterations of the DO
                                       ALL.  By default, private variables
                                       are not guaranteed to retain the last
                                       iteration values.  SAVELAST can be
                                       used only with DO ALL, and if the full
                                       iteration set is not completed (for
                                       example, due to a SOFT EXIT), the
                                       values of private variables are
                                       indeterminate.

                           MAXCPUS (val)
                                       Specifies the maximum number of CPUs
                                       that the parallel region can use
                                       effectively.  Specifying MAXCPUS(val)
                                       does not ensure that the val
                                       processors will be assigned; it
                                       specifies the optimal maximum.
                                       Argument val can be either a constant
                                       or a variable; both of the following
                                       are valid specifications:

                                         MAXCPUS (2)

                                         MAXCPUS (val)


     80                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The rest of the parameters (SINGLE, CHUNKSIZE,
                           NUMCHUNKS, GUIDED, and VECTOR) specify the work
                           distribution policy for the iterations of the
                           parallel DO loop.  These parameters are summarized
                           in Figure 19, page 82, and discussed in more
                           detail in the following paragraphs.  By default,
                           the iterations are distributed one at a time
                           (SINGLE).  You can select only one of the
                           following work distribution algorithms for a given
                           DO loop:

                           Parameter   Description
                           ---------   -----------

                           SINGLE      Specifies to distribute the iterations
                                       one at a time to available processors.

                           CHUNKSIZE(n)
                                       Specifies the number of iterations to
                                       distribute to an available processor.
                                       n is an expression (for best
                                       performance, n should be an integer
                                       constant).  (For example, given 100
                                       iterations and CHUNKSIZE(4), 4
                                       iterations at a time are distributed
                                       to each available processor until the
                                       100 iterations are complete.)
                                       CHUNKSIZE(64) is an analog of the
                                       microtasking LONGVECTOR directive.

                           NUMCHUNKS(m)
                                       Specifies to divide the iterations
                                       into m chunks of equal size (with a
                                       possible smaller residual chunk), and
                                       distribute these chunks to available
                                       processors.  (For example, given 100
                                       iterations and NUMCHUNKS(4), 25
                                       iterations at a time are distributed
                                       to each available processor until the
                                       100 iterations are complete.)





















     SG-3074 5.0               Cray Research, Inc.                         81


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                 Work Distribution Parameters for Parallel Loops




       SINGLE                 Hand out the iterations one at a time to
                               available processors.  (Default
                               distribution.)



       CHUNKSIZE(n)           Number (n) of iterations to distribute to an
                               available processor.



       NUMCHUNKS(m)           Divide the iterations into m chunks of equal
                               size (with a possible smaller residual
                               chunk), and distribute these chunks to
                               available processors.



       GUIDED                 Uses the "Guided Self-scheduling" to
                               partition the iteration space.



       VECTOR                 Specifies use of a scheduling algorithm used
                               only in the case of "stripmining" an
                               innermost vectorized loop.  (Default for
                               inner-loop Autotasking.)
     ________________________________________________________________________

           Figure 19.  Work distribution parameters for parallel loops

                           Parameter   Description
                           ---------   -----------

                           GUIDED      Specifies the use of "Guided Self-
                                       scheduling" to distribute the
                                       iterations to available processors.
                                       This mechanism does a good job of
                                       minimizing synchronization overhead
                                       while providing acceptable dynamic
                                       load balancing.

                           VECTOR      Specifies the use of "Guided Self-
                                       scheduling" to distribute a minimum of
                                       64 iterations to available processors.
                                       Also specifies the use of a special
                                       scheduling algorithm when
                                       "stripmining" an innermost vectorized
                                       loop.




     82                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


     CMIC$ PARALLEL and
     CMIC$ END PARALLEL
     4.3.2.1.2
                           The CMIC$ PARALLEL/END PARALLEL directives have
                           the following options:

                           -------------------------------------------------
                           CMIC$ PARALLEL [IF (expr)] [SHARED (var [,...])]
                           [PRIVATE (var [,...])] [AUTOSCOPE]
                           [CONTROL (var [,...])] [MAXCPUS (val)]

                           CMIC$ END PARALLEL
                           -------------------------------------------------

                           The PARALLEL directive marks the start of a
                           parallel region.  The END PARALLEL directive marks
                           the end of a parallel region.  The scope of a
                           variable in a parallel region is either shared or
                           private.  Shared variables are used by all
                           processors; private variables are unique to a
                           processor.  Parallel regions are combinations of
                           redundant code blocks and partitioned code blocks.
                           The PARALLEL directive indicates where multiple
                           processors enter execution, which may be different
                           from where they demonstrate a direct benefit
                           (partitioned code block).  See the descriptions of
                           the optional parameters (IF, SHARED, PRIVATE,
                           AUTOSCOPE, CONTROL, and MAXCPUS) in the DO ALL
                           directive description, page 79.

     CMIC$ DO PARALLEL and
     CMIC$ END DO
     4.3.2.1.3
                           The CMIC$ DO PARALLEL/END DO directives have the
                           following options:

                           -------------------------------------------------
                           CMIC$ DO PARALLEL {[SINGLE]|[CHUNKSIZE (n)]|
                           [NUMCHUNKS (m)]|[GUIDED]|[VECTOR]}

                           CMIC$ END DO
                           -------------------------------------------------

                           The DO PARALLEL directive indicates that the DO
                           loop that begins on the next line may be executed
                           in parallel by multiple processors.  A directive
                           is not needed to end a DO PARALLEL loop.  A
                           control structure can be extended beyond the loop
                           by using the END DO directive.  The END DO
                           directive marks the end of a partitioned code
                           block that contains a DO PARALLEL loop.  This
                           ability to define partitioned code blocks that
                           contain DO loops as well as other code lets
                           Autotasking exploit parallelism in loops


     SG-3074 5.0               Cray Research, Inc.                         83


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           containing some forms of reduction computations.
                           These directives can be used only within a
                           parallel region, which is bounded by PARALLEL/END
                           PARALLEL directives.

                           The DO PARALLEL directive is equivalent to a CMIC$
                           DO GLOBAL microtasking directive.

                           The rest of the parameters (SINGLE, CHUNKSIZE,
                           NUMCHUNKS, GUIDED, and VECTOR) specify the work
                           distribution policy for the iterations of the
                           parallel DO loop.  By default, the iterations are
                           distributed one at a time (SINGLE).  Only one of
                           the work distribution algorithms can be chosen for
                           a given DO loop.  See the descriptions of SINGLE,
                           CHUNKSIZE, NUMCHUNKS, GUIDED, and VECTOR in the DO
                           ALL directive description, page 81.

                           In the following example, a parallel region
                           (PARALLEL/END PARALLEL) is defined that uses a DO
                           PARALLEL/END DO pair and GUARD/END GUARD pair to
                           implement a parallel reduction computation.  A
                           description of the GUARD/END GUARD directives
                           follows the example.

                           Example:

                                      SUM = 0.0
                                      BIG = -1.0
                                CMIC$ PARALLEL  PRIVATE(XSUM,XBIG)
                                CMIC$1SHARED(SUM,BIG,AA,BB,CC)
                                      XSUM = 0.0
                                      XBIG = -1.0
                                CMIC$ DO PARALLEL
                                      DO 200   I = 1,2000
                                      :
                                      XSUM = XSUM+(AA(I)*(BB(I)-CC(AA(I))))
                                      XBIG = MAX(ABS(AA(I)*BB(I)),XBIG)
                                      :
                                200   CONTINUE
                                CMIC$ GUARD
                                      SUM = SUM+XSUM
                                      BIG = MAX(XBIG,BIG)
                                CMIC$ END GUARD
                                CMIC$ END DO
                                CMIC$ END PARALLEL


                           In this example, the GUARD/END GUARD protect the
                           update of the shared variables (SUM and BIG), and
                           the DO PARALLEL/END DO pair ensure that all
                           contributions to SUM and BIG are included.

     CMIC$ CASE and
     CMIC$ END CASE
     4.3.2.1.4
                           The CASE directive serves as a separator between


     84                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           adjacent code blocks that are concurrently
                           executable.  The END CASE directive serves as the
                           terminator for a group of one or more parallel
                           CASE directives.  These directives can appear only
                           in a parallel region.

                           The CASE directive is equivalent to the CMIC$
                           PROCESS/ALSO PROCESS microtasking directives.  The
                           END CASE directive is equivalent to a CMIC$ END
                           PROCESS microtasking directive.

                           In the following example, CASE directives have
                           been added.  Currently, FPP does not perform
                           interprocedural analysis and would not, therefore,
                           add the CASE directives automatically for this
                           example.  Because CASE directives have been added
                           in the following example, subroutines called
                           within the CASE/END CASE directives are
                           concurrently executable:

                                CMIC$ PARALLEL
                                CMIC$   CASE
                                        CALL ABC
                                CMIC$   CASE
                                        CALL DEF
                                CMIC$   CASE
                                        CALL GHI
                                CMIC$   END CASE
                                CMIC$ END PARALLEL


                           The work in the subroutine calls completes before
                           execution continues with the code below the END
                           CASE.  A special form of the CASE/END CASE
                           directive pair forces only one processor to
                           execute a code block in a parallel region, as
                           follows:

                                CMIC$ PARALLEL
                                CMIC$ CASE
                                      CALL XYZ
                                CMIC$ END CASE
                                       :
                                CMIC$ DO PARALLEL
                                      DO 200   I = 1,IMAX
                                       :
                                200   CONTINUE
                                CMIC$ END PARALLEL


                           In the preceding example, only one processor calls
                           XYZ.

     CMIC$ GUARD and
     CMIC$ END GUARD
     4.3.2.1.5
                           The CMIC$ GUARD/END GUARD directives have the


     SG-3074 5.0               Cray Research, Inc.                         85


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           following syntax:

                           -------------------------------------------------
                           CMIC$ GUARD [n]
                           CMIC$ END GUARD [n]
                           -------------------------------------------------

                           The GUARD/END GUARD directive pair delimits a
                           critical region, and it provides the necessary
                           synchronization to protect (or guard) the code
                           inside the critical region.  A critical region is
                           a code block that is to be executed by only one
                           processor at a time, although all processors in
                           the parallel region execute it.

                           The optional n parameter is an expression that
                           serves as a mutual exclusion flag (using the low-
                           order 6 bits of the value).  That is, GUARD 1 and
                           GUARD 2 can be active concurrently, but two GUARD
                           7 directives cannot.  For optimal performance, n
                           should be an integer constant, and the general
                           expression capability is provided only for the
                           unusual case that the critical region number must
                           be passed to a lower-level routine.  When n is not
                           provided, the critical region blocks only other
                           instances of itself, but no other critical
                           regions.  Critical regions may appear anywhere in
                           a program; that is, they are not limited to
                           parallel regions.

     CMIC$ CONTINUE
     4.3.2.1.6
                           The CONTINUE directive indicates that the external
                           routine called on the next line is a microtasked
                           subroutine.  The Fortran dependence analyzer (FPP)
                           does not generate this directive, and it cannot
                           prepare the called subprogram for this special
                           form of processing.

     CMIC$ SOFT EXIT
     4.3.2.1.7
                           The SOFT EXIT directive indicates that the branch
                           statement on the next line jumps somewhere outside
                           of the current parallel region.  Use of a SOFT
                           EXIT directive, in effect, terminates the parallel
                           region, and is typically used in search loops,
                           where if a single process reaches a specified
                           condition, all processors should stop.  Jumps can
                           have different targets.  You can use multiple SOFT
                           EXIT directives within one parallel region.

                           The SOFT EXIT directive is equivalent to a CMIC$
                           STOP ALL PROCESS microtasking directive.




     86                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           Branch statements that jump around inside a
                           parallel region should not be preceded by a SOFT
                           EXIT directive.  Jumps to labels completely
                           outside of a parallel region must be preceded by a
                           SOFT EXIT directive.  Jumps from inside a DO
                           PARALLEL or a CASE structure to areas outside of
                           the structure, but still inside the parallel
                           region, are not allowed.  Jumps from one CASE into
                           another CASE of a multiple CASE/END CASE structure
                           are also not permitted.

     CMIC$ TASKCOMMON
     4.3.2.1.8
                           A CMIC$ TASKCOMMON directive causes FMP to change
                           each occurrence of a specified common block into a
                           task common block (throughout all routines).
                           Converting a common block into a task common block
                           makes the contents of the block local to a task
                           but global within a task.  It also ensures that
                           processes get separate copies of the contents of
                           these blocks.

                           The CMIC$ TASKCOMMON directive has the following
                           syntax:

                           -------------------------------------------------
                           CMIC$ TASKCOMMON blocks
                           -------------------------------------------------

                           Argument blocks is a comma-separated list of
                           common blocks to be converted to task common
                           blocks.  You must specify this directive before
                           the first executable statement of a program unit
                           and before any of the specified blocks.  Common
                           blocks to be converted with this directive should
                           contain only read-only variables and write-first
                           variables; otherwise, correct results cannot be
                           ensured.

                           Example:

                                CMIC$ TASKCOMMON DATA
                                      COMMON /DATA/ A(100), B(100)


                           This directive is equivalent to the following
                           code:

                                      TASK COMMON /DATA/ A(100), B(100)


                           You can also specify multiple blocks, as in the
                           following example:






     SG-3074 5.0               Cray Research, Inc.                         87


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                CMIC$ TASKCOMMON data1, data2, data3
                                      . . .
                                      COMMON /data1/ a,b,c
                                      COMMON /data2/ d,e,f
                                      COMMON /data3/ g,h,i


                           FMP also recognizes CDIR$ TASKCOMMON compiler
                           directives, which are placed before the first
                           executable statement of a program unit, and
                           specify which common blocks are to be converted to
                           task common blocks.


     FMP microtasking
     directives
     4.3.2.2
                           In addition to the preceding Autotasking
                           directives, FMP recognizes the microtasking
                           directives.  These directives are summarized in
                           Figure 20, page 89, and described in the following
                           subsections.




































     88                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                           FMP Microtasking Directives



       * CMIC$ MICRO



       * CMIC$ DO GLOBAL



       * CMIC$ DO GLOBAL LONG VECTOR



       * CMIC$ DO GLOBAL BY expression



       * CMIC$ DO GLOBAL FOR expression



       * CMIC$ GETCPUS



       * CMIC$ PROCESS, CMIC$ ALSO PROCESS, and CMIC$ END PROCESS



       * CMIC$ RELCPUS



       * CMIC$ STOP ALL PROCESS
     ________________________________________________________________________

                  Figure 20.  FMP microtasking directive summary

     CMIC$ MICRO
     4.3.2.2.1
                           The CMIC$ MICRO directive designates a subroutine
                           to be microtasked and appears just before the
                           SUBROUTINE statement.  A subroutine introduced in
                           this way becomes a microtasked subroutine, or
                           fray.  Executing a RETURN or END statement signals
                           the end of multiprocessing work.  On exit, only
                           one processor returns to the calling routine.  A
                           function may not be microtasked, though it may, of
                           course, be rewritten as a subroutine and then
                           microtasked.  FMP microtasking directives
                           described in the following subsections provide a
                           way for you to specify subroutine-level


     SG-3074 5.0               Cray Research, Inc.                         89


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           parallelism in your programs.

                           The CMIC$ MICRO directive is an optional directive
                           and is not needed to use microtasking effectively.

     CMIC$ DO GLOBAL
     4.3.2.2.2
                           The CMIC$ DO GLOBAL directive marks the beginning
                           of a control structure in which the iterations of
                           a DO loop comprise all of the processes.  DO
                           GLOBAL is probably the most commonly used control
                           structure.

                           The statement following the CMIC$ DO GLOBAL
                           directive is a DO statement.  The end of the
                           control structure is marked by the statement
                           containing the label referred to in the DO
                           statement; the DO GLOBAL control structure does
                           not require a preprocessor directive to close it.

                           DO GLOBAL directives may be used to create control
                           structures within a DO loop, but the path through
                           such control structures cannot be altered inside
                           the microtasked subroutine.  A DO GLOBAL statement
                           may be nested within a DO loop, but only one DO
                           GLOBAL can be executing at a time.

                           The loop variable for loops using DO GLOBAL must
                           be of type integer and the initial, final, and
                           step values must be integer expressions.

                           Example:

                                CMIC$ DO GLOBAL
                                      DO 20 J= 1, 1000
                                      DO 10 I= 1, 1000
                                      A(I,J)= X(I,J) * Y(I,J)
                                10    CONTINUE
                                20    CONTINUE


                           Three variants of the DO GLOBAL directive are
                           supplied to help you better balance microtasking
                           and vectorization.  These variants are described
                           next.

     CMIC$ DO GLOBAL LONG
     VECTOR
     4.3.2.2.3
                           The CMIC$ DO GLOBAL LONG VECTOR directive marks
                           the beginning of a control structure that permits
                           both vectorization and microtasking on an
                           innermost DO loop.  This structure divides a loop
                           into processes of 64 iterations each, microtasking
                           the "chunks" and vectorizing the iterations.  (One


     90                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           remainder chunk will have 64 or fewer iterations.)

                           To provide a speedup, the loop should be longer
                           than 64 iterations, and it should be vectorizable.
                           Two associated directives (DO GLOBAL BY and DO
                           GLOBAL FOR) let you change the iteration chunk
                           size, also known as the chunking factor.

                           Example:

                                CMIC$ DO GLOBAL LONG VECTOR
                                      DO 100 K = 1, 4096
                                      A(K) = B(K) * C(K)
                                 100  CONTINUE


                           This example divides the original loop into an
                           inner and outer loop, each consisting of 64
                           iterations.

     CMIC$ DO GLOBAL BY
     expression
     4.3.2.2.4
                           The CMIC$ DO GLOBAL BY expression directive is the
                           same as the DO GLOBAL LONG VECTOR directive except
                           that the iterations are divided into chunks of
                           size expression.  It divides a DO loop into an
                           inner loop, with expression iterations, and an
                           outer loop.  The number of iterations in the outer
                           loop is approximately the number of iterations in
                           the original DO loop divided by expression.  The
                           inner loop may be vectorized and the outer loop
                           microtasked.  Setting expression to a multiple of
                           64 maximizes the vectorization performance.

                           You must ensure that the Fortran expression
                           evaluates to an integer greater than 0.  The
                           expression is evaluated at run time and may change
                           each time the DO loop is executed, but it cannot
                           change during the execution of a DO GLOBAL.

                           Example:

                                CMIC$ DO GLOBAL BY 1024
                                      DO 100 K = 1, 4096
                                      A(K) = B(K) * C(K)
                                100   CONTINUE


                           In this example, the 4096 iterations of the DO
                           loop are divided into four pieces consisting of
                           1024 iterations each.






     SG-3074 5.0               Cray Research, Inc.                         91


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     CMIC$ DO GLOBAL FOR
     expression
     4.3.2.2.5
                           The CMIC$ DO GLOBAL FOR expression directive is
                           the same as the DO GLOBAL LONG VECTOR directive,
                           except that the iterations are divided into
                           expression number of chunks.

                           It divides a DO loop into an outer loop, with
                           expression iterations, and an inner loop.  The
                           number of iterations in the inner loop is
                           approximately the number of iterations in the
                           original DO loop divided by expression.  The inner
                           loop is then vectorized and the outer loop
                           microtasked.

                           Example:

                                CMIC$ DO GLOBAL FOR 4
                                      DO 100 K = 1, 4096
                                      A(K) = B(K) * C(K)
                                 100  CONTINUE


                           This example specifies the number of iterations
                           for the generated outer loop to be 4.  The number
                           of iterations for the inner loop is then 1024.
                           The effect is the same as for the DO GLOBAL BY
                           directive in the previous example.  The only
                           difference is whether you want to specify the
                           chunk size or the number of chunks.

     CMIC$ GETCPUS
     4.3.2.2.6
                           The CMIC$ GETCPUS directive has the following
                           syntax:

                           -------------------------------------------------
                           CMIC$ GETCPUS expression
                           -------------------------------------------------

                           This optional directive may appear anywhere in the
                           program outside a control structure.  It specifies
                           the maximum number of processors permitted to work
                           on a microtasked or autotasked program.
                           expression is an integer expression that defines
                           the number of physical CPUs that will be used for
                           the program.  The default value for expression is
                           the maximum number of physical CPUs available for
                           your program.

                           The NCPUS environment variable (if set) defines
                           the maximum number of physical CPUs available for
                           your program.  If NCPUS is not set, the default
                           number of CPUs used for the program is the number
                           of physical CPUs in the machine.


     92                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


     CMIC$ PROCESS,
     CMIC$ ALSO PROCESS,
     and CMIC$ END PROCESS
     4.3.2.2.7
                           The CMIC$ PROCESS directive marks the beginning of
                           a control structure and signals that the code
                           following it is a single process.

                           The CMIC$ ALSO PROCESS directive marks the
                           beginning of a process other than the first
                           process inside a control structure and the end of
                           the previous process.  A PROCESS directive
                           followed by any number of ALSO PROCESS directives
                           implements a classic fork-and-join multitasking
                           structure.

                           The CMIC$ END PROCESS directive marks the end of a
                           process and the end of a control structure.
                           PROCESS and END PROCESS directives can also be
                           used to ensure single-processor execution of a
                           portion of code.  The single-threaded section
                           contains a single CMIC$ PROCESS directive (that
                           is, the section does not contain an ALSO PROCESS
                           directive).

                           Example:

                                CMIC$ PROCESS
                                      DO 10 I= 1, 1000
                                      A(I) = X(I) * Y(I)
                                10    CONTINUE

                                CMIC$ ALSO PROCESS
                                      DO 20 I= 1, 1000
                                      B(I) = X(I) * Z(I)
                                20    CONTINUE

                                CMIC$ END PROCESS


                           In this example, it is possible that two
                           processors will do both the DO 10 loop and the DO
                           20 loop simultaneously.  Both portions must be
                           completed before execution of the remainder of the
                           program.

     CMIC$ RELCPUS
     4.3.2.2.8
                           The CMIC$ RELCPUS directive specifies that the
                           processors acquired for microtasking should be
                           released back to the system.  It is the reverse of
                           the GETCPUS directive.  This directive should be
                           used when no microtasking is to be done for a long
                           period of time or when the program is preparing to


     SG-3074 5.0               Cray Research, Inc.                         93


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           terminate.

                           This directive is optional; if it is not used, all
                           processors acquired by the GETCPUS directive are
                           held until the program terminates.  When a STOP,
                           END, or CALL EXIT statement is encountered, the
                           microtasking slave processors are released
                           automatically before the job step is terminated.

     CMIC$ STOP ALL
     PROCESS
     4.3.2.2.9
                           The CMIC$ STOP ALL PROCESS directive provides a
                           way to exit from both PROCESS and DO GLOBAL
                           control structures without performing all of the
                           processes or iterations.  This directive forces
                           all processors to complete work in a process if
                           they are in one and then to accept no more work,
                           closing the control structure.

                           The CMIC$ STOP ALL PROCESS directive must be
                           followed by a branch statement.  Processors resume
                           work at the target of this branch statement.  For
                           example, you may want to end processing in a DO
                           loop when a certain solution is found.  If the
                           solution is never found, the loop is executed some
                           maximum number of iterations.  STOP ALL PROCESS
                           provides this graceful exit.  Typically, the
                           program will appear as in the following example.

                           Example:

                                CMIC$ DO GLOBAL
                                      DO 1 I = 1,10000
                                      . . .
                                      IF end-condition THEN
                                CMIC$ STOP ALL PROCESS
                                      GO TO 2
                                      ENDIF
                                      . . .
                                1     CONTINUE
                                2     CONTINUE


                           The previous section of code is portable.  You
                           must ensure that work is not done between the
                           statement that ends the DO loop and the statement
                           at which processing resumes.  You must also ensure
                           that the statement number to which the single-
                           processor version jumps and the one to which the
                           microtasked version jumps (as a result of the STOP
                           ALL PROCESS directive) are the same.  The
                           preprocessor does not check for any errors you
                           might make in using the STOP ALL PROCESS
                           directive.


94                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     FMP data scope rules
     4.3.2.3
                           When FMP generates code to handle a CMIC$ DO ALL
                           or a CMIC$ PARALLEL statement, all the variables
                           and arrays in the region must be assigned either
                           shared or private status or the AUTOSCOPE
                           parameter must be specified.  A shared variable or
                           array is one that all the processors use.  A
                           private variable or array is one for which each of
                           the processors has it own copy.

                           If the AUTOSCOPE parameter is specified, the
                           following rules are used by fmp to determine
                           shared or private status:

                           Shared variables are any of the following:

                           * The variables or arrays are in a SHARED
                             statement.

                           * The variables are read-only variables.

                           * The arrays are read-only arrays.

                           * The arrays are indexed by the loop index.

                           * The variables are read-then-write variables.

                           * The arrays are read-then-write arrays.

                           Private variables are any of the following:

                           * The variables or arrays are in a PRIVATE
                             statement.

                           * The variables are write-then-read variables.

                           * The arrays are write-then-read arrays.

                           In both cases, the verbs SHARED and PRIVATE
                           override the default determination.

                           When you specify the AUTOSCOPE parameter, FMP can
                           scope data incorrectly.  Sometimes the scope of
                           data cannot be determined at compile time, because
                           of conditional blocks of code within a loop.  FMP
                           assumes the flow of all loops to be top to bottom.
                           Also, FMP cannot correctly determine the scope of
                           data passed as arguments to subroutines from
                           within a parallel region.  Therefore, if you
                           insert an Autotasking directive around a loop
                           containing subroutine calls, conditional branches,
                           or code blocks, do not use the AUTOSCOPE
                           parameter, but scope all data explicitly.






     SG-3074 5.0               Cray Research, Inc.                         95


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           These data scoping problems apply only when you
                           use the AUTOSCOPE parameter on user-inserted
                           directives.  This problem does not occur with
                           directives that are inserted autotmatically by
                           FPP.

     Read-only variables
     4.3.2.3.1
                           The following examples show read-only variables.

                                CMIC$ DOALL PRIVATE(I) SHARED(N1,N2,A)
                                      DO 10 I=N1, N2
                                      ...=A
                                  10  CONTINUE


                           A is a shared variable because it is a read-only
                           variable.  All processors share the same location
                           for A.

                           CMIC$ DOALL SHARED(N1,N2,M1,M2,V) PRIVATE(I,J)
                                 DO 10 I=N1, N2
                                 DO 10 J=M1, M2
                                 ... = V(J)
                             10  CONTINUE


                           V is shared because it is a read-only array.  M1
                           and M2 are also shared because they are read-only
                           variables.  I and J are written and then read, so
                           they are private variables.

     Array indexed by loop
     index
     4.3.2.3.2
                           The following example shows an array indexed by
                           the loop index:

                                CMIC$ DOALL SHARED(N1,N2,V,U,J) PRIVATE(I,T)
                                      DO 10 I=N1, N2
                                      T=V(I)
                                      U(I,J)=T
                                  10  CONTINUE


                           U and V are shared arrays because they are indexed
                           by the loop index.  All processors share the same
                           location for V and U.  T is written and then read,
                           so it is a private variable.  J is shared because
                           it is a read-only variable.

     Read-then-write
     variables
     4.3.2.3.3


     96                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           The following example shows read-then-write
                           variables:

                                      SUM=0.0
                                CMIC$ DOALL SHARED(N1,N2,V,SUM) PRIVATE(I,T)
                                      DO 10 I = N1, N2
                                      T = V(I)
                                CMIC$ GUARD
                                      SUM = SUM + T
                                CMIC$ END GUARD
                                   10 CONTINUE


                           SUM is a shared variable because it is read before
                           it is written.  Special care is needed in writing
                           into a shared variable.

     Write-then-read
     variables and arrays
     4.3.2.3.4
                           The following example shows write-then-read
                           variables and arrays:

                           CMIC$ DOALL SHARED(N1,N2,M1,M2) PRIVATE(I,J,V)
                                 DO 10 I = N1, N2
                                 DO 10 J = M1, M2
                                 V(J) = ...
                                 ... = V(J)
                              10 CONTINUE


                           V is written to and then read.  It must be a
                           private array.

     User-added scope
     required
     4.3.2.3.5
                           The automatic determination also misses some
                           cases.  The flow of the loop is assumed to be top
                           to bottom.  If the code is not in this order, FMP
                           can misdetermine type.  In all cases, the
                           SHARED/PRIVATE verbs override the determination by
                           FMP.

                           The examples that follow show situations that
                           require you to add scope verbs.

                           In the following example, the flow of control is
                           confused:








     SG-3074 5.0               Cray Research, Inc.                         97


     Concepts and Directives        CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                                C     This is wrong
                                CMIC$ DO ALL AUTOSCOPE
                                      DO 10 I = N1, N2
                                      GO TO 3
                                    2 V(I) = T
                                      GO TO 10
                                    3 T = V(I)
                                      GO TO 2
                                   10 CONTINUE


                           FMP determines that T is read before it is
                           written.  It determines incorrectly that T is
                           shared.  A correction to this code is as follows:

                                C     This is correct
                                CMIC$ DOALL PRIVATE(T) AUTOSCOPE
                                      DO 10 I = N1, N2
                                      GO TO 3
                                    2 V(I) = T
                                      GO TO 10
                                    3 T = V(I)
                                      GO TO 2
                                   10 CONTINUE


                           The private declaration overrides the read-
                           before-write rule.

                           The following example shows use of a subroutine
                           call:

                                CMIC$ DOALL AUTOSCOPE
                                      DO 10 I = N1, N2
                                      CALL MMP(A(I), B, C)
                                   10 CONTINUE


                           It is not possible to determine whether A(I), B,
                           or C is read or written.  FMP assigns A as shared
                           because it is indexed by the control variable.
                           FMP assumes that B and C are read; therefore, it
                           designates them as shared variables.  FMP prints a
                           message stating that such variables as A, B, and C
                           "require a private/shared declaration."  However,
                           FMP treats them as shared.  If you want the scope
                           of arguments to such subroutine calls to be
                           private, you must explicitly declare them as
                           private.  To be certain of correct scope, all
                           variables or arrays that occur in a function or
                           subroutine call should be specified as shared or
                           private.






     98                        Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide        Concepts and Directives
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     Compiler directives
     4.3.3
                           Besides the FPP and FMP directives already
                           described, FPP recognizes CFT77 directives, which
                           are briefly described in this subsection.  For
                           more information, see the CF77 Compiling System,
                           Volume 1:  Fortran Reference Manual, publication
                           SR-3071.

                           The FPP dependence analyzer accepts the following
                           CFT77 directives (preceded by CDIR$):  IVDEP,
                           NOVECTOR, VECTOR, SHORTLOOP, NEXTSCALAR, and
                           VFUNCTION.  Table 6 shows the correspondence
                           between CFT77 and FPP directives.

                                 Table 6.  CFT77 versus FPP directives

                           __________________________________________________
                           CFT77                            FPP treats as:
                           __________________________________________________
                           CDIR$ IVDEP                      CFPP$ NODEPCHK L
                           CDIR$ NOVECTOR                   CFPP$ NOVECTOR R
                           CDIR$ VECTOR                     CFPP$ VECTOR R
                           CDIR$ SHORTLOOP                  CFPP$ NOINNER L
                           CDIR$ NEXTSCALAR                 CFPP$ NOVECTOR L
                           CDIR$ VFUNCTION                  Recognized by FPP
                           __________________________________________________


                           Other Cray Fortran directives are treated as
                           comments by FPP.


























     SG-3074 5.0               Cray Research, Inc.                         99



                                            FPP Data Dependency Analysis  [5]
     ########################################################################







                           For certain loops, parallel or vector execution
                           would result (or could result) in incorrect
                           answers.  A loop in which results from one loop
                           pass feed back into a future pass of the same loop
                           is said to have a data dependency conflict and may
                           not be optimized completely.  (Such a loop is also
                           said to be recursive or to contain recurrences.)
                           In these cases, FPP detects the problem, reports
                           it to the user, and leaves the loop in its
                           original form.

                           You can assert that there is no recursion by using
                           a directive or switch, both of which are discussed
                           later in this section.  Indirect addressing of
                           arrays can create hidden dependency conflicts.
                           Unless you make such an assertion (that no
                           recursion exists), the following conditions apply
                           to loops containing indirectly addressed arrays:

                           * A gathered array must not also appear on the
                             left-hand side in the loop.

                           * A scattered array must not appear anywhere else
                             in the loop.

                           In certain cases, FPP can determine that the
                           problem is limited to a subset of the operations
                           in the loop, and it will cut the loop into
                           subloops that can be optimized and those that
                           cannot be optimized.  The "DP" field in the loop
                           summary measures how much of the loop is
                           dependent, that is, left unoptimized.  If FPP
                           determines that more than a certain percentage of
                           a loop is dependent, it will not optimize the loop
                           at all.  (The exact percentage depends on various
                           other factors.)   Partial optimization of this
                           kind is done only for vectorized loops, not for
                           autotasked loops.

                           FPP also examines EQUIVALENCE statements to see
                           whether they may be masking recursion, and it
                           suppresses any potentially unsafe transformations.

                           FPP data dependency analysis is summarized in
                           Figure 21, page 102.  The following subsections
                           expand the discussion of data dependency analysis
                           and provide examples of directives you can use to
                           communicate information to FPP.



     SG-3074 5.0               Cray Research, Inc.                        101


     FPP Data Dependency Analysis   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     ________________________________________________________________________

                            Data Dependency Analysis


       A loop in which results from one loop pass feed back into a future
       pass of the same loop is said to have a data dependency conflict
       and may not be completely optimized.  (Such a loop is also said to
       be recursive or to contain recurrences.)


       Indirect addressing of arrays can create hidden dependency
       conflicts.  Therefore, unless directives are inserted to assert
       otherwise, the following conditions must be true for optimization
       of loops containing indirectly addressed arrays:


       * A gathered array must not also appear on the left-hand side in
         the loop.


       * A scattered array must not appear anywhere else in the loop.


       In these cases, FPP detects the problem, reports it to you, and
       leaves the loop in its original form.  You can assert that there is
       no recursion by using a directive or switch.
     ________________________________________________________________________

                     Figure 21.  FPP data dependency analysis




     Data dependency
     examples
     5.1
                           Figure 22, page 104, demonstrates the concept of
                           data dependency.  Four similar loops are
                           displayed.  For each loop, the sequences of
                           instructions that would be executed in scalar mode
                           (one at a time) and in vector mode (whole arrays
                           at a time) are also shown.  Lowercase variables
                           (such as "a") stand for new values set in the
                           current loop, and uppercase variables (such as
                           "A") stand for old values that were set before the
                           loop started.











     102                       Cray Research, Inc.                SG-3074 5.0


     CF77 Volume 4:  Parallel Processing Guide   FPP Data Dependency Analysis
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                           It is easy to see that the scalar and vector
                           sequences for Figure 22, part 11.1A, are not the
                           same; the vector version uses only old values of
                           A, and the scalar version uses new ones.  FPP
                           detects that this loop is not safe to optimize,
                           puts out a data dependency conflict message, and
                           leaves the loop in its original form (the loop is
                           "rejected").  In contrast, the scalar and vector
                           sequences for Figure 22, part 11.1B, are
                           identical; no feedback of results from one loop
                           pass to another is occurring here.  FPP recognizes
                           that this loop is safe to optimize and does so.

                           The situation is less clear in Figure 22, part
                           11.1C; here the use of the variable "K" in A's
                           subscript makes the proper scalar sequence
                           impossible to determine at compile time (with some
                           exceptions; see "Ambiguous subscript resolution,"
                           page 108.)  If K is 1, the loop functions like the
                           recursive loop in part 11.1A; if K is -1, the loop
                           is safe to optimize, as was 11.1B.  When it is not
                           possible for FPP to tell whether a loop is
                           recursive, the loop is said to have ambiguous
                           subscripting.  Often the user knows that a loop
                           (or perhaps all the loops in a routine or program)
                           is not recursive, even though FPP cannot tell, as
                           in 11.1C.  In these cases, you can direct FPP to
                           ignore potential recursions.  Use the -Wd"-d d"
                           option to cf77, the -d d option to fpp, or the
                           NODEPCHK directive.  Examples are in "NODEPCHK
                           (declaring nonrecursion)," page 111.

                           Figure 22, part 11.1D, shows the same loop as in
                           part 11.1A, but with a DO increment of 2 rather
                           than 1.  FPP detects that the loop is now not
                           recursive, because no results feedback into the
                           calculation.  These four similar examples point
                           out the sensitivity of data dependency analysis to
                           offset and stride values of arrays that appear on
                           both sides of the equal sign within a loop.

                           Data dependency analysis extends to more than just
                           single-line loops, of course; in the following
                           loop, the reference to A at the top of the loop
                           conflicts with the store into A at the bottom.
                           FPP prints a message to this effect and inserts a
                           directive to inhibit vectorization explicitly.














     SG-3074 5.0               Cray Research, Inc.                        103


     FPP Data Dependency Analysis   CF77 Volume 4:  Parallel Processing Guide
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

              a: new value of A           A: old value of A
       ----------------------------11.1A---------------------------------
       ORIGINAL LOOP:                     VECTOR SEQUENCE CORRECT?
           DO 71 I = 2,N                  No. We are not using updated
        71 A(I+1) = A(I)*B(I)+C(I)        values of A.

       SCALAR SEQUENCE:                   VECTOR SEQUENCE:
           a(3) = A(2)*B(2)+C(2)          a(3) = A(2)*B(2)+C(2)
           a(4) = a(3)*B(3)+C(3)          a(4) = A(3)*B(3)+C(3)
           a(5) = a(4)*B(4)+C(4)          a(5) = A(4)*B(4)+C(4)
           a(6) = a(5)*B(5)+C(5)          a(6) = A(5)*B(5)+C(5)
                      :                              :
       ----------------------------11.1B---------------------------------
       ORIGINAL LOOP:                     VECTOR SEQUENCE CORRECT?
           DO 72 I = 2,N                  Yes. Sequence is identical.
        72 A(I-1) = A(I)*B(I)+C(I)

       SCALAR SEQUENCE:                   VECTOR SEQUENCE:
           a(1) = A(2)*B(2)+C(2)          a(1) = A(2)*B(2)+C(2)
           a(2) = A(3)*B(3)+C(3)          a(2) = A(3)*B(3)+C(3)
           a(3) = A(4)*B(4)+C(4)          a(3) = A(4)*B(4)+C(4)
           a(4) = A(5)*B(5)+C(5)          a(4) = A(5)*B(5)+C(5)
                      :                              :
       ----------------------------11.1C---------------------------------
       ORIGINAL LOOP:                     VECTOR SEQUENCE CORRECT?
           DO 73 I = 2,N                  ? Depends on K. Here, if 0