Next: InterProcedural Analysis Up: Future Work Previous: Future Work

Fortran 90D/HPF on Low Latency Systems

Much of the work on Fortran 90D/HPF has focused on overcoming the limitations of high latency message based systems. Many existing Massively Parallel Processors (MPP systems) have relatively high latency, especially when compared to conventional shared memory multi-processors. Even machines that support a shared memory programming model have latencies that can be up to 500 clock cycles. Clustered workstations, especially those using off the shelf interconnect technology, may have latencies one or two orders of magnitude higher than this.

For high-latency systems it is important to minimize accesses to remote data, whether it is stored on another processor or in a remote section of shared memory. Techniques such as message vectorization, collective communication, and inspector/executor loops (as found in PARTI) can be used on distributed memory machines. Techniques such as prefetching and data vectorization can hide latencies on shared memory systems.

An important challenge for Fortran 90D/HPF is to exploit lower latency systems. For example, the Meiko CS/2 supports a remote memory access paradigm. Any processor can read or write any other processor's memory without intervention by the second processor. The latency for a remote read or write is less than 500 clock cycles. True shared memory machines like those available from Sun, SGI, and Digital reduce these latencies even further.

Traditionally, shared memory machines are programmed using multi-threaded models. One thread is created for each processor; all but one of these threads waits on a shared memory location, or semaphore, until a parallel region is entered. When a parallel region is entered each thread is handed a unit of work, often a loop iteration or block of loop iterations, it executes until finishing its work unit, then obtains another work unit or returns to the idle state.

The multi-threaded model can be highly efficient in terms of work distribution and processor utilization; however, the model ignores a very important point: locality. Even a simple shared memory multiprocessor does not provide uniform access to memory. Cache hierarchies and bus contention conspire to slow memory access times. On distributed shared memory machines the problem is the same, but perhaps an order of magnitude worse. It is here that Fortran 90D/HPF is useful.

Simply put, Fortran 90D/HPF allows the programmer to describe data locality; on modern multi-processors data locality is the key to performance. One of the most useful features of HPF is the relaxation of Fortran's storage and sequence association rules. For example, consider a simple loop:


	DO I = 1,N 
	    A(I) = A(I) + B(I) 
	ENDDO
On a two processor shared memory system, one strategy for this loop would be to assign even iterations to one processor and odd iterations to the second processor. Let's assume that elements of A and B are 4-bytes long and that cache lines are 32-bytes wide. Then, each processor could access a cache line owned by the other processor as many as 4 times, resulting in many unnecessary cache line transfers and overloading the memory system. This phenomenon is known as false sharing [69]. The loop can also be parallelized by allocating blocks of iterations to the processor. This eliminates some of the performance difficulties, but still results in potential false sharing at block boundaries.

On the other hand, Fortran 90D/HPF allows the compiler to break the arrays A and B into blocks. The compiler can align these blocks on cache line boundaries. This eliminates the false sharing problem without changing the semantics of the program and still allows use of the shared memory system's fast memory access.

Our strategy for Fortran 90D/HPF compilation on low-latency systems is as follows:



Next: InterProcedural Analysis Up: Future Work Previous: Future Work


zbozkus@
Thu Jul 6 21:09:19 EDT 1995