Global Arrays User Guide

  1. Global Arrays Programming Model
  2. Basic Operations
  3. Status of Current Implementation
  4. How Your Program Should Look Like
  5. Message-Passing in GA Programs
  6. C Language Interface

Global Arrays Programming Model

Globally addressable arrays have been developed to simplify writing portable scientific software for both shared and distributed memory computers. Programming convenience, code extensibility and maintainability are gained by adopting the shared memory programming model.

From the user perspective, a global array can be used as it was stored in the shared memory. Details of the data distribution, addressing and communication are encapsulated in the global array objects. However, the information on the actual data distribution and locality can be obtained and taken advantage of whenever data locality is important.

Currently support is limited to 2-D double precision, double complex or integer arrays with block distribution, at most one block per array per processor.

The GA memory consistency is only guaranteed for

  1. Multiple read operations (as the data does not change)
  2. Multiple accumulate operations (as addition is commutative)
  3. Multiple disjoint put operations (as there is only one writer for each element)

The application has to worry about everything else (usually by appropriate insertion of ga_sync (barrier) calls).

Basic Operations

Operations that are globally collective

Operations must be simultaneously invoked by all processes as if in SIMD mode.

Collective elementary operations

ga_initialize initialize global array internal structures
ga_initialize_ltd initialize global arrays and set memory usage limits
ga_create create an array
ga_create_irreg create an array with irregular distribution
ga_duplicate create an array following a reference array
ga_destroy destroy an array
ga_terminate destroy all existing arrays and delete internal data structures
ga_sync synchronize processes (a barrier)

Linear algebra operations

ga_zero zero an array
ga_ddot dot product of two arrays (doubles only)
ga_zdot dot product of two arrays (double complex only)
ga_scale scale the elements in an array by a constant
ga_add scale and add two arrays to put result in a third (may overwrite one of the other two)
ga_copy copy one array into another
ga_dgemm BLAS-like matrix multiply
ga_ddot_patch dot product of two patches (doubles only)
ga_zdot_patch dot product of two patches (double complex only)
ga_scale_patch scale the elements in an array by a constant (patch version)
ga_add_patch scale and add two arrays to put result in a third (patch version)
ga_matmul_patch matrix multiply (patch version)
ga_diag real symmetric generalized eigensolver (sequential version also exists) *
ga_diag_reuse a version of ga_diag for repeated use *
ga_diag_std standard real symmetric eigensolver (sequential version also exists) *
ga_symmetrize symmetrize a matrix
ga_transpose transpose a matrix
ga_lu_solve solve system of linear equations based on LU factorization (sequential version also exists) **
ga_llt_solve solve system of linear equations with SPD coeffcient matrix based on Cholesky factorization **
ga_solve solve system of linear equations trying first Cholesky and then LU factorization **
ga_spd_invert inverts an SPD matrix **
ga_cholesky performs Cholesky factorization of an SPD matrix **
ga_copy_patch copy data from a patch of one global array into another array

Other collective utility operations

ga_fill_patch fill a patch of array with value
ga_summarize print information about already allocated arrays
ga_print_patch print a patch of an array to the screen
ga_print print an entire array to the screen
ga_compare_distr compare distributions of two global arrays
ga_summarize prints summary info about allocated arrays

Operations to support portability between implementations

ga_dgop reduce operation (double precision)
ga_igop reduce operation (integer)
ga_brdcst broadcast operation

Non-Collective Operations

These operations may be invoked by any process in task-parallel MIMD style

Elementary operations

ga_get read from a patch of an array
ga_put write to a patch of an array
ga_acc accumulate into a patch of an array (double precision or complex only)
ga_scatter scatter elements into an array
ga_gather gather elements from an array
ga_read_inc atomically read and increment the value of a single integer array element
ga_locate determine which process 'holds' an array element
ga_locate_region determine which process 'holds' an array section
ga_distribution find coordinates of the array patch that is 'held' by a processor
ga_error print error message and terminate the program
ga_init_fence traces completion of data movement operations
ga_fence blocks until the initiated communication completes
ga_nodeid find requesting compute process ID
ga_nnodes find number of compute processes

Operations intended to support writing new functions

ga_access access 'local' elements of global array
ga_release relinquish access to 'local' data
ga_release_update relinquish access after data were updated
ga_check_handle verify that a GA handle is valid

Other utility operations

ga_inquire find the type and dimensions of the array
ga_inquire_name find the name of the array
ga_inquire_memory find the amount of memory in active arrays
ga_memory_avail find the amount of memory left for GA
ga_uses_ma finds if memory in global arrays comes from MA (memory allocator)
ga_memory_limited finds if limits were set for memory usage in global arrays
ga_proc_topology finds block coordinates for the array section held by a processor
ga_list_nodeid returns message-passing process id for GA processes
ga_print_stats prints misc. execution statistics to the screen

Status of Current Implementation (version 2.2)

Supported Platforms

There are two classes of platforms that GA work on:

Network of shared-memory machines

The implementation uses data-server/compute-node model where each machine needs an additional process (data-server) that services requests for the data. Therefore, the number of processes started by the user on each machine minus one are considered to be compute nodes, and remaining ones are the data servers. Data servers use shared memory to access the data on local machine.

For example, the TCGMSG .p file describing the process configuration for one 4-processor and one 8-processor workstations:


         d3h325 coho 4 /usr/people/d3h325/g/global/testing/test.x /tmp
         d3h325 bohr 8 /usr/people/d3h325/g/global/testing/test.x /tmp

defines 10 compute processes and 2 data-servers (one on coho and one on bohr).

Single (possibly multiprocessor) machines with shared or globally-addressable memory like: the KSR-2, workstations, and even Cray T3D are in fact a special case of 1). The System V shared memory implementation uses TCGMSG or MPI to fork all the processes and implement ga_brdcst and ga_dgop. There are no data server processes; therefore, all the specified processes/processors are used for computations.

Message-passing distributed-memory MPP architectures

The implementation uses interrupt-driven communication on the Intel IPSC/XXX, Delta, Paragon with the NX, and the IBM SP with the MPL message passing library. There are no data-server processes: all processes execute application code.

Selection of message-passing library

Starting with version 2.1, Global Arrays works with either TCGMSG or MPI message-passing libraries. That means that GA applications can use either of these libraries. Selection of the message-passing library takes place when GA is built. TCGMSG is included with the GA distribution package and selected by default.

There are three possible configurations for running GA with the message-passing libraries:

  1. with TCGMSG
  2. with TCGMSG emulation library: TCGMSG-MPI, that implements functionality of TCGMSG using MPI. In this mode, message- passing library is initialized using TCGMSG PBEGIN(F) call which internally refernces MPI_Initialize. Please note that TCGMSG-MPI might 'steal' one process from the application to implement TCGMSG NXTVAL operation. To enable this mode, define environmental variable USE_MPI.
  3. directly with MPI. In this mode, GA program should contain MPI initialization calls instead PBEGIN(F).

For MPI versions, the optional environmental variables MPI_LIB and MPI_INCLUDE are used to point to the location of the MPI library and include directories if they are not in the standard system location(s). GA programs are started with the mechanism that any other MPI programs use on the given platform.

Interface to ScaLAPACK

Interface routines to ScaLAPACK are only available with MPI, and of course with ScaLAPACK. The user is required to define environment variables USE_SCALAPACK, and location of ScaLAPACK & Co. libraries in variable SCALAPACK.

Since there are certain interdependencies between blacs and blacsF77cinit, some system might require specification of -lblacs on link command twice to fix the unresolved external symbols from these libs.

On machines that do not have 'native' implementation of TCGMSG-MPI nxtval operation (see README in tcgmsg-mpi directory), it is required to modify a BLACS file blacs_pinfo_.c to substitute MPI_COMM_WORLD with the communicator returned from ga_mpi_communicator routine.

Compiler Flags

Please refer to compiler flags in file Makefile.h to make sure that Fortran and C compiler flags are consistent with flags use to compile your application. This may be critical when fortran compiler flags are used to change default length of integer datatype.

How Your Program Should Look Like

When GA runs with TCGMSG or TCGMSG-MPI



  call pbeginf()                  ! start TCGMSG
  status = ma_init(..)            ! start memory allocator if required
  call ga_initialize()            ! start global arrays

  .... do work

  call ga_terminate()             ! tidy up global arrays
  call pend()                     ! tidy up tcgmsg
  stop                            ! exit program

When GA runs with MPI


  call MPI_Initialize()           ! start MPI
  status = ma_init(..)            ! start memory allocator if required
  call ga_initialize()            ! start global arrays

  .... do work

  call ga_terminate()             ! tidy up global arrays
  call MPI_Finalize()             ! tidy up MPI
  stop                            ! exit program


The ma_init call looks like :


  status = ma_init(type, stack_size, heap_size)

and it basically just goes to the OS and gets stack_size+heap_size elements of size type. On distributed memory platforms, if you allocate a global array of size N elements, you need to ensure that the call to ma_init gets at least N/(no. of GA processes) elements per process.

Message-Passing in GA Programs

There are two issues to be aware of when using message-passing in GA programs:

  1. Process count and numbering. In the network environment, some message-passing processes are hiden from the application when GA is initialized. This will cause that ga_nodeid() and ga_nnodes() return values inconsistent with the process rank/count values obtained from the message-passing library. The GA routine ga_list_nodeid() could be used to transform GA process ID/rank into the message-passing process ID/rank. In addition, with MPI library, C-language routine ga_mpi_communicator provides communicator for the GA processes. This operation is unavailable if the original TCGMSG library is used. ga_mpi_communicator cannot be called from Fortran since there is no mechanism in MPI for passing handles between C and Fortran. Contact MPI Forum for fixing this problem !
  2. Synchronization. The current implementation of interrupt-receive on the IBM SP requires that in order to avoid deadlock, GA application must synchronize (ga_sync()) before and after message-passing communication calls. This means that GA and message-passing communication must be separated into different phases.

           GA communication phase

           ga_sync()

           message-passing phase

           ga_sync()

           GA communication phase
               ...
        

C Language Interface

Subroutines and functions listed above can be also called from C programs. The GA C language interface is rather minimalistic in style. There are two APIs (mixing allowed):

  1. C routine names are derived by adding an underscore at the end of the operation names and including global.h header file. The GA routines that require a character (string) argument have both C (no underscore suffix) and Fortran (NOT TO BE USED BY C PROGRAMS) versions, for example, ga_error, ga_check_handle or ga_inquire_name. The reason for having two separate versions is to avoid sometimes painful Fortran to C character string conversion.
  2. For convenience purposes, starting with version 2.1, global.h header file defines names for C routines as having GA_ prefix and no underscore suffix, for example the C version of ga_error is GA_error. Please refer to the included with the distribution package programs testc.c or ga-mpi.c for the actual code examples. For the debugging purpose, the programmer should be still aware of the actual naming convention as described in the first API above.

GA arguments are passed by pointers. For an example on how to use GAs in C programs, see implementation of GA operations in global.alg.c file, where some other GA routines are called.

See file global.alg.c for examples on how to use ga_access in C.

For portability purposes, it is recommended to use in your C program DoublePrecision and Integer data types defined in file types.f2c.h (included in global.h file).

ga_access returns an index that is used to reference the data with respect to the data base arrays. This index corresponds to the Fortran addressing convention so to use it in C, you need to decrement it by one, for example DBL_MB[--index] provides value of the first element.

GA routines operate on DoublePrecision and Integer data. Identifiers for these data types (MT_F_DBL and MT_F_INT) are defined in file macommon.h that is located in the MA directory.


Footnotes:

*  - interface to PEIGS
** - interface to ScaLAPACK