Low Level HPF Compiler Benchmarks
-------------------------------------------------------------------------

Overview
---------

The benchmark suite comprises several simple, synthetic applications which test
several aspects of  HPF compilation. The current version of the suite
addresses the basic features of HPF, and it is designed to measure
performance of early implementations of the compiler. They concentrate
on testing parallel implementation of explicitly parallel statements,
i.e., array assignments, FORALL statements, INDEPENDENT DO loops, and
intrinsic functions with different mapping directives. In addition, the
low level compiler benchmarks address problem of passing distributed arrays
as arguments to subprograms.

The language features not included in the HPF subset are not addressed in this
release of the suite. The next releases will contain more kernels that
will address all features of HPF, and also they will be sensitive to
advanced compiler transformations.

The codes included in this suite are either adopted from existing benchmark
suites, NAS suite \cite{NAS}, Livermore Loops \cite{Liv}, and the Purdue Set \cite{Rice}, or are developed
at Syracuse University.

FORALL statement - kernel FL
----------------------------------------
FORALL statement provides a convenient syntax for simultaneous assignments
to large groups of array elements. Such assignments lie at the heart of the
data parallel computations that HPF is designed to express. The idea behind 
introducing FORALL in HPF is to generalize Fortran 90 array assignments to
make expressing parallelism easier. Kernel FL provides several examples of 
FORALL statements that are difficult or inconvenient to write using Fortran 90
syntax.

Explicit template - kernel TL
----------------------------------------
Parallel implementation of the array assignments, including FORALL
statements, is a central issue for an early HPF compiler. Given a data
distribution, the compiler distributes computation over available
processors. An efficient compiler achieves an optimal load balance with
minimum interprocessor communication. 

Sometimes, the programmers may help the compiler to minimize 
interprocessor communication by suitable data mapping, in particular by
defining a relative alignment of different data objects. This may be
achieved by aligning the data objects with an explicitly declared
template. Kernel TL provides an example of this kind.

Communication detection in array assignments - kernels AA, SH, ST, and IR
-------------------------------------------------------------------------

Once the data and iteration space is distributed, the next step that
strongly influences efficiency of the resulting codes is communication
detection and code generation to execute data movement. In general, the
off-processor data elements must be gathered before execution of an
array assignment, and the results are to be scattered to destination
processors after the assignment is completed. In other words, some of
the array assignments may require a preprocessing phase to determine
which off-processor data elements are needed and execute the gather
operation. Similarly, they may require postprocessing (scatter). Many
different techniques may be used to optimize these operations. To
achieve high efficiency, it may be very important that the compiler is
able to recognize structured communication patterns, like shift,
multicast, etc. Kernels AA, SH, and ST introduce different structured
communication patterns, and kernel IR is an example of an array
assignment that requires unstructured communication (because of
indirections). 

Non-elemental intrinsic functions - kernel RD
-----------------------------------------------

Fortran 90 intrinsics and HPF functions offer yet another way to express
parallelism. Kernel RD tests implementation of several reduction functions. 

Passing distributed arrays as subprogram arguments - kernels AS, IT, IM 
The last group of kernels, demonstrate passing distributed arrays as 
subprogram arguments. They represents three typical cases: 

 -  a known mapping of the actual argument is to be preserved by the dummy
    argument (AS). 
 -  mapping of the dummy argument is to be inherited from the actual
    argument, thus no remapping is necessary. The mapping is known at
    compile time (IT).
 -  mapping of the dummy argument is to be identical to that of the actual 
    argument, but the mapping is not known at compile time (IM).


Summary
-----------------------------

The synthetic compiler benchmark suite described here is an addition to the
PARKBENCH benchmark kernels and applications. It is not meant as a tool to 
evaluate the overall performance of the compiler generated codes. It has been 
introduced as an aid for compiler developers and implementators to address 
some selected aspect of the HPF compilation process. In the current version, 
the suite does not comprise a comprehensive sample of HPF codes. Actually, it 
addresses only the HPF subset. Hopefully, this way, we will contribute to the 
establishment of a systematic compiler benchmarking methodology. We intend to 
continue our effort to develop a complete, fully representative HPF benchmark 
suite.