Globally addressable arrays have been developed to simplify writing portable scientific software for both shared and distributed memory computers. Programming convenience, code extensibility and maintainability are gained by adopting the shared memory programming model.
From the user perspective, a global array can be used as it was stored in the shared memory. Details of the data distribution, addressing and communication are encapsulated in the global array objects. However, the information on the actual data distribution and locality can be obtained and taken advantage of whenever data locality is important.
Currently support is limited to 2-D double precision, double complex or integer arrays with block distribution, at most one block per array per processor.
The GA memory consistency is only guaranteed for
The application has to worry about everything else (usually by appropriate insertion of ga_sync (barrier) calls).
Operations must be simultaneously invoked by all processes as if in SIMD mode.
ga_initialize | initialize global array internal structures |
ga_initialize_ltd | initialize global arrays and set memory usage limits |
ga_create | create an array |
ga_create_irreg | create an array with irregular distribution |
ga_duplicate | create an array following a reference array |
ga_destroy | destroy an array |
ga_terminate | destroy all existing arrays and delete internal data structures |
ga_sync | synchronize processes (a barrier) |
ga_zero | zero an array | ||
ga_ddot | dot product of two arrays (doubles only) | ||
ga_zdot | dot product of two arrays (double complex only) | ||
ga_scale | scale the elements in an array by a constant | ||
ga_add | scale and add two arrays to put result in a third (may overwrite one of the other two) | ||
ga_copy | copy one array into another | ||
ga_dgemm | BLAS-like matrix multiply | ||
ga_ddot_patch | dot product of two patches (doubles only) | ||
ga_zdot_patch | dot product of two patches (double complex only) | ||
ga_scale_patch | scale the elements in an array by a constant (patch version) | ||
ga_add_patch | scale and add two arrays to put result in a third (patch version) | ||
ga_matmul_patch | matrix multiply (patch version) | ||
ga_diag | real symmetric generalized eigensolver (sequential version also exists) | * | |
ga_diag_reuse | a version of ga_diag for repeated use | * | |
ga_diag_std | standard real symmetric eigensolver (sequential version also exists) | * | |
ga_symmetrize | symmetrize a matrix | ||
ga_transpose | transpose a matrix | ||
ga_lu_solve | solve system of linear equations based on LU factorization (sequential version also exists) | ** | |
ga_llt_solve | solve system of linear equations with SPD coeffcient matrix based on Cholesky factorization | ** | |
ga_solve | solve system of linear equations trying first Cholesky and then LU factorization | ** | |
ga_spd_invert | inverts an SPD matrix | ** | |
ga_cholesky | performs Cholesky factorization of an SPD matrix | ** | |
ga_copy_patch | copy data from a patch of one global array into another array |
ga_fill_patch | fill a patch of array with value |
ga_summarize | print information about already allocated arrays |
ga_print_patch | print a patch of an array to the screen |
ga_print | print an entire array to the screen |
ga_compare_distr | compare distributions of two global arrays |
ga_summarize | prints summary info about allocated arrays |
ga_dgop | reduce operation (double precision) |
ga_igop | reduce operation (integer) |
ga_brdcst | broadcast operation |
These operations may be invoked by any process in task-parallel MIMD style
ga_get | read from a patch of an array |
ga_put | write to a patch of an array |
ga_acc | accumulate into a patch of an array (double precision or complex only) |
ga_scatter | scatter elements into an array |
ga_gather | gather elements from an array |
ga_read_inc | atomically read and increment the value of a single integer array element |
ga_locate | determine which process 'holds' an array element |
ga_locate_region | determine which process 'holds' an array section |
ga_distribution | find coordinates of the array patch that is 'held' by a processor |
ga_error | print error message and terminate the program |
ga_init_fence | traces completion of data movement operations |
ga_fence | blocks until the initiated communication completes |
ga_nodeid | find requesting compute process ID |
ga_nnodes | find number of compute processes |
ga_access | access 'local' elements of global array |
ga_release | relinquish access to 'local' data |
ga_release_update | relinquish access after data were updated |
ga_check_handle | verify that a GA handle is valid |
ga_inquire | find the type and dimensions of the array |
ga_inquire_name | find the name of the array |
ga_inquire_memory | find the amount of memory in active arrays |
ga_memory_avail | find the amount of memory left for GA |
ga_uses_ma | finds if memory in global arrays comes from MA (memory allocator) |
ga_memory_limited | finds if limits were set for memory usage in global arrays |
ga_proc_topology | finds block coordinates for the array section held by a processor |
ga_list_nodeid | returns message-passing process id for GA processes |
ga_print_stats | prints misc. execution statistics to the screen |
There are two classes of platforms that GA work on:
The implementation uses data-server/compute-node model where each machine needs an additional process (data-server) that services requests for the data. Therefore, the number of processes started by the user on each machine minus one are considered to be compute nodes, and remaining ones are the data servers. Data servers use shared memory to access the data on local machine.
For example, the TCGMSG .p file describing the process configuration for one 4-processor and one 8-processor workstations:
d3h325 coho 4 /usr/people/d3h325/g/global/testing/test.x /tmp d3h325 bohr 8 /usr/people/d3h325/g/global/testing/test.x /tmp
defines 10 compute processes and 2 data-servers (one on coho and one on bohr).
Single (possibly multiprocessor) machines with shared or globally-addressable memory like: the KSR-2, workstations, and even Cray T3D are in fact a special case of 1). The System V shared memory implementation uses TCGMSG or MPI to fork all the processes and implement ga_brdcst and ga_dgop. There are no data server processes; therefore, all the specified processes/processors are used for computations.
The implementation uses interrupt-driven communication on the Intel IPSC/XXX, Delta, Paragon with the NX, and the IBM SP with the MPL message passing library. There are no data-server processes: all processes execute application code.
Starting with version 2.1, Global Arrays works with either TCGMSG or MPI message-passing libraries. That means that GA applications can use either of these libraries. Selection of the message-passing library takes place when GA is built. TCGMSG is included with the GA distribution package and selected by default.
There are three possible configurations for running GA with the message-passing libraries:
For MPI versions, the optional environmental variables MPI_LIB and MPI_INCLUDE are used to point to the location of the MPI library and include directories if they are not in the standard system location(s). GA programs are started with the mechanism that any other MPI programs use on the given platform.
Interface routines to ScaLAPACK are only available with MPI, and of course with ScaLAPACK. The user is required to define environment variables USE_SCALAPACK, and location of ScaLAPACK & Co. libraries in variable SCALAPACK.
Since there are certain interdependencies between blacs and blacsF77cinit, some system might require specification of -lblacs on link command twice to fix the unresolved external symbols from these libs.
On machines that do not have 'native' implementation of TCGMSG-MPI nxtval operation (see README in tcgmsg-mpi directory), it is required to modify a BLACS file blacs_pinfo_.c to substitute MPI_COMM_WORLD with the communicator returned from ga_mpi_communicator routine.
Please refer to compiler flags in file Makefile.h to make sure that Fortran and C compiler flags are consistent with flags use to compile your application. This may be critical when fortran compiler flags are used to change default length of integer datatype.
call pbeginf() ! start TCGMSG status = ma_init(..) ! start memory allocator if required call ga_initialize() ! start global arrays .... do work call ga_terminate() ! tidy up global arrays call pend() ! tidy up tcgmsg stop ! exit program
call MPI_Initialize() ! start MPI status = ma_init(..) ! start memory allocator if required call ga_initialize() ! start global arrays .... do work call ga_terminate() ! tidy up global arrays call MPI_Finalize() ! tidy up MPI stop ! exit program
The ma_init call looks like :
status = ma_init(type, stack_size, heap_size)
and it basically just goes to the OS and gets stack_size+heap_size elements of size type. On distributed memory platforms, if you allocate a global array of size N elements, you need to ensure that the call to ma_init gets at least N/(no. of GA processes) elements per process.
There are two issues to be aware of when using message-passing in GA programs:
GA communication phase ga_sync() message-passing phase ga_sync() GA communication phase ...
Subroutines and functions listed above can be also called from C programs. The GA C language interface is rather minimalistic in style. There are two APIs (mixing allowed):
GA arguments are passed by pointers. For an example on how to use GAs in C programs, see implementation of GA operations in global.alg.c file, where some other GA routines are called.
See file global.alg.c for examples on how to use ga_access in C.
For portability purposes, it is recommended to use in your C program DoublePrecision and Integer data types defined in file types.f2c.h (included in global.h file).
ga_access returns an index that is used to reference the data with respect to the data base arrays. This index corresponds to the Fortran addressing convention so to use it in C, you need to decrement it by one, for example DBL_MB[--index] provides value of the first element.
GA routines operate on DoublePrecision and Integer data. Identifiers for these data types (MT_F_DBL and MT_F_INT) are defined in file macommon.h that is located in the MA directory.
Footnotes:
* - interface to PEIGS ** - interface to ScaLAPACK