The concerns of the parallel programmer

The concerns of the parallel programmer are those of any programmer: algorithm design, convenience of expression, efficiency of execution, ease of debugging, component reuse, and lifecycle issues. In particular, the parallel programmer, like any programmer, requires: languages and/or application programming interfaces (APIs) that allow for the succinct expression of complex algorithms, hiding unimportant details while providing control over performance-critical issues; associated tools (e.g., performance profilers) that allow them to diagnose and correct errors and performance problems; and convenient formulations of efficient algorithms for solving key problems, ideally packaged so that they can easily be integrated into an application program.

However, despite these commonalities, the particular characteristics of parallel computers and of parallel computing introduce additional concerns that tend to complicate both parallel programming and the development of parallel programming tools. In particular, we must be concerned with the following three challenges

Concurrency and communication. Parallel programs may involve the creation, coordination, and management of potentially thousands of independent threads of control. Interactions between concurrent threads of control may result in nondeterminism. These issues introduce unique concerns that have profound implications for every aspect of the program development process.
Need for high performance. In sequential programming, ease of expression may be as important or even more important than program performance. In contrast, the motivation for using parallel computation is almost always a desire for high performance. This requirement places stringent constraints on the programming models and tools that can reasonably be used for parallel programming.
Diversity of architecture. The considerable diversity seen in parallel computer architectures makes the development of standard tools and portable programs more difficult than is the case in sequential computing, where we find remarkable uniformity in basic architecture.

The goal of parallel programming is thus to satisfy the requirements listed at the beginning of this section, while simultaneously addressing in some fashion the three challenges of concurrency and communication, performance demands, and architectural diversity. This is a difficult task, and so in practice we find a variety of approaches to programming, which we try to summarize in this chapter.

A parallel computer is a collection of processing and memory elements, plus a communication network used to route requests and information among these elements. The task of the parallel programmer is to coordinate the operation of these diverse elements so as to achieve efficient and correct execution on the problem of interest.

The performance of a parallel program is determined by how effectively it maximizes concurrency (the number of operations that can be performed simultaneously) while minimizing the amount of communication required to access ``nonlocal'' data, transfer intermediate results, and synchronize the operation of different threads of control. Communication costs are frequently sensitive to data distribution, the mapping of application data structures to memory elements: a good data distribution can reduce the number of memory accesses that require expensive communication operations. If work is not distributed evenly among processors, load imbalances may occur, reducing concurrency and performance.

When evaluating the correctness of a parallel program, the programmer may need to take into account the possibility of race conditions, which occur when the executions of two or more distinct threads of control are sufficiently unconstrained that the result of a computation can vary nondeterministically, depending simply on the speed at which different threads proceed.

The programmer, when faced with the task of writing an efficient and correct parallel program, can call upon a variety of parallel languages, compilers, and libraries, each of which implements a distinct programming model with different tradeoffs between ease of use, generality, and achievable performance.

In the rest of this chapter, we first review some of the principal programming models, then discuss some critical issues for achieving performance on parallel machine including parallel decomposition, memory hierarchy management, performance analysis and tuning, parallel debugging, and operating system issues.

The next chapter provides a detailed review of parallel computer architectures. For the purposes of this chapter we will provide a simple introduction to these topics that covers most of the important issues needed to understand parallel programming.

First, we observe that most of the modern parallel machines fall into two basic categories:

Shared memory machines, which have a single shared address space that can be accessed by any processor, and
Distributed memory machines, in which the system memory is packaged with individual processors and some form of communication is required to provide data from the memory of one processor to a different processor.

The organization of a shared memory machine is depicted in Figure , which shows a system with four processors, each with a private cache, interconnected to a global shared memory via a single system bus. This organization is used on modern multiprocessor workstations, such as those from Sun and Hewlett-Packard. Many simple desktop multiprocessors share this design.

Figure
A Uniform Access Shared Memory Architecture

In a shared memory system, each processor can access every location in global memory via standard load operations. The hardware ensures that the caches are ``coherent'' by watching the system bus and invalidating cached copies of any block that is written into. This mechnism is generally invisible to the user, except when different processors are simultaneously attempting to write into the same cache line, which can lead to ``thrashing''. To avoid this problem, the programmer and programming system must be careful with shared data structures and non-shared data structures that can be located on the same cache block, a situation known as ``false sharing.'' Synchronization of accesses to shared data structures is a major issue on shared memory systems-it is up to the programmer to ensure that operations by different processors on a shared data structure leave that data structure in a consistent state.

The main problem with the uniform access shared memory system is that it is not scalable. Most bus-based systems are limited to 32 processors. If the bus is replaced by a crossbar switch, systems might well scale to as many as 128 processors although the cost of the switch goes up as the square of the number of processors.

This has led designers to use distributed-memory organizations such as the one depicted in Figure . Here the global shared memory has been replaced by a smaller local memory attached to each processor. Communication among these configurations makes use of an interconnection network. Interconnection networks of this sort can employ a scalable design such as a hypercube.

The advantage of a distributed-memory design is that access to local data can be made quite fast. On the other hand, access to remote memories requires much more effort. On message-passing systems, the processor owning a needed datum must send it to the processor that needs it. These ``send-receive'' communication steps typically incur long start-up times, although the bandwidth after start-up can be high. Hence, it typically pays to send fewer, longer messages on message-passing systems.

Some distributed-memory machines allow a processor to directly access a datum in a remote memory. On these distributed-shared-memory (DSM) systems, the latency associated with a load varies with the distance to the remote memory. Cache coherency on DSM systems is a complex problem and is usually handled by a sophisticated network interface unit. As of this writing, the Silicon Graphics Origin and the HP/Convex Exemplar are the only commercial DSM systems.

The principal programming problem for distributed memory machines is to manage communication of data between processors. On message-passing systems this amounts to minimizing the number of communications and attempting to overlap communication with computation.

Figure
A Distributed-Memory Architecture

For very large parallel systems, a hybrid architecture called a cluster is becoming increasingly common. A cluster looks like a distributed-memory system in which each of the individual components is a shared-memory multiprocessor rather than a single processor node. This design permits high parallel efficiency within a multiprocessor node while permitting systems to scale to hundreds or even thousands of processors. Most manufacturers of high-end parallel systems are planning to offer some sort of cluster.

The design of memory hierarchies is an integral part of the design of parallel computer systems because the memory hierarchy determines the performance of the individual nodes in the processor array of the machine. A typical memory hierarchy is depicted in Figure . Here the processor and a level-1 cache memory are found on chip and a larger level-2 cache lies between the chip and the memory.

When a processor executes a load instruction, the level-1 cache is first interrogated to determine if the desired datum is available. If it is the datum can be delivered to the processor in 2 to 5 processor cycles. If the datum is not found in the L1 cache, the processor stalls while the L2 cache is interrogated. If the desired datum is found in L2, then the stall may last for only 10-20 cycles. If the datum is not found in either cache, a full cache miss is taken with a delay of possibly a hundred cycles or more.

Figure
A Standard Uniprocessor Memory Hierarchy

Generally the performance of the memory hierarchy is determined by two hardware parameters:

Latency, which is the time required to fetch a desired datum from memory, and
Bandwidth, which is the number of bytes per unit time that can be delivered from the memory at full speed.

Generally these two factors are complicated by the multilevel nature of memory hierarchies, because each level will have a different bandwidth and latency to the next level. For example, the SGI Origin can deliver about 4 bytes per machine cycle from L1 to the processor and 4 bytes per cycle from L2 to L1 but only about 0.8 bytes per cycle from memory to L1.

Another important parameter that affects memory performance on a uniprocessor is the length of the standard cache block (or cache line). Most cache systems will only transfer blocks of data between levels of the memory hierarchy. If all the data in a block that is actually transferred are used, then no bandwidth is wasted and the miss can be amortized over all the data in the block. If only one or two data items are used, then the average latency is much higher and the effective bandwidth much lower.

There are generally two kinds of strategies for overcoming latency problems. Latency hiding attempts to overlap the latency of a miss with computation. Prefetching of cache lines is a latency hiding strategy. Latency tolerance on the other hand attempts to restructure a computation to make make it less subject to performance problems due to long latencies. Cache blocking, which we shall discuss shortly, is one such latency tolerance technique.

Strategiest that improve reuse in cache also improve effective bandwidth utilization. Perhaps the most important way to ensure good bandwidth utilization is to organize data and computations to use all the items in a cache line whenever it is fetched from memory. Ensuring that computations access data arrays in strides of one is an example of how this might be done.

The memory hierarchies on parallel machines are more complicated because of the existence of multiple caches on shared-memory systems and the long latencies to remote memories on distributed-memory configurations. There may also be interference between data transfers between memories and from local memory to a processor.

------------------------ We first make some general comments concerning the programming models that underly the various languages and libraries that will be discussed subsequently.

Thirty years of research have led to the definition and exploration of a large number of parallel programming models []. Few of these models have survived, but much experience has been gained in what is useful in practical settings.

Parallel programs may be categorized according to whether they emphasize concurrent execution of the same task on different data elements (data parallelism) or the concurrent execution of different tasks on the same or different data (task parallelism). For example, a simulation of galaxy formation might require that essentially the same operation be performed on each of a large number of data items (stars); in this case, a data parallel algorithm is obtained naturally by performing this operation on multiple items simultaneously. In contrast, in a simulation of a complex physical system comprising multiple processes (e.g., a multidisciplinary optimization of an aircraft might couple airflow, structures, and engine simulations) the different components can be executed concurrently, hence obtaining task parallelism.

Most programs for scalable parallel computers are data parallel in nature, for the simple reason that the amount of concurrency that can obtained from data parallelism tends to be larger than can be achieved via task parallelism. Nevertheless, task parallelism can have an important role to play as a software engineering technique: it often makes sense to execute distinct components on disjoint sets of processors (or even on different computers) for modularity reasons. It is increasingly common for parallel programs to be structured as a task-parallel composition of data-parallel components.

Parallel programming systems can be categorized according to whether they support an explicitly or implicitly parallel programming model. An explicitly parallel system requires that the programmer specify directly the activities of the multiple concurrent ``threads of control'' that form a parallel computation. In contrast, an implicitly parallel system allows the programmer to provide a higher-level specification of program behavior in which parallelism is not represented directly. It is then the responsibility of the compiler or library to implement this parallelism efficiently and correctly.

Implicitly parallel systems can simplify programming by eliminating the need for the programmer to coordinate the execution of multiple processes. For example, in the implicitly parallel, primarily data-parallel language High Performance Fortran, the programmer writes what is essentially sequential Fortran 90 code, augmented with some directives. Race conditions cannot occur and the HPF program need not be rewritten to take advantage of different parallel architectures.

Explicitly parallel systems provide the programmer with more control over program behavior and hence can often be used to achieve higher performance. For example, an MPI implementation of an adaptive mesh refinement algorithm may incorporate sophisticated techniques for computing mesh distributions, for structuring communications among subdomains, and for redistributing data when load imbalances occur. These strategies are beyond the capabilities of today's HPF compilers.

A parallel programming style that is becoming increasingly popular is to encapsulate the complexities of parallel algorithm design within libraries (e.g., an adaptive mesh refinement library, as just discussed). An application program can then consist of just a sequence of calls to such library functions, as illustrated in Figure below. In this way, many of the advantages of an implicitly parallel approach can be obtained within an explicitly parallel framework.

The dichotomy of parallel machine designs has led to a corresponding dichotomy of programming models. Explicitly parallel programming strategies can be categorized according to whether they use a shared or distributed memory programming model. Note that these strategies are not completely determined by the target machine because a distributed-memory programming model can be supported on a shared-memory machine and vice versa. In a shared memory model, the programmer's task is to specify the activities of a set of processes that communicate by reading and writing shared memory. In a distributed memory model, processes only have local memory and must use some other mechanism (e.g., message passing or remote procedure call) to exchange information.

Shared memory models have the significant advantage that the programmer need not be concerned with data distribution issues. On the other hand, high-performance implementations may be difficult on computers that lack hardware support for shared memory, and race conditions tend to arise more easily.

Distributed memory models have the advantage that programmers have explicit control over data distribution and communication; this control facilitates high-performance programming on large distributed memory parallel computers.

In this section we discuss programming strategies that can help make optimal use of the memory hierarchy of a modern parallel computer system. We begin with the strategies that improve the performance of a uniprocessor node within the memory and then proceed to the issues that are complicated by parallelism.

The critical issue in getting good memory hierarchy performance on a uniprocessor is achieving high degrees of reuse of data in both registers and cache memory. Many programmers are surprised to find that proper organization of their programs can dramatically affect the performance that they achieve.

There are three principal strategies available to programmers for improving the performance of memory hierarchy:

Stride-one access. Most cache memories are organized into blocks of data much longer than a single data item. For example, the level-2 cache block on the SGI Origin can hold 16 double precision floating point numbers. On every machine, these numbers are at contiguous addresses in memory. If a program is arranged to iterate over successive items in memory, it can suffer at most one cache miss for every cache block. All successive data items in the cache block will be hits. Thus, programs in which the loops access contiguous data items are typically much more efficient than those that do not.
Blocking. Program performance can also be improved by ensuring that data remains in cache between subsequent accesses to the same memory location. As an example, consider the following code:
```
	        DO I = 1, N
	           DO J = 1, N
	              A(I) = A(I) + B(J)
	           ENDDO
	        ENDDO
     
```
Although this loop achieves a high degree of reuse for array A, missing only N/L times where L is the size of the cache block, it has a dismal performance on array B, on which it incurs N²/L misses. The problem is that, even though access to B is stride-one, B(J) cannot remain in cache until its next use on the next iteration of the outer loop. Therefore, all N/L misses will be incured on each iteration of the outer loop. If the loop on J is ``blocked'' to a size where all the elements of B that it touches can remain in cache, then the following loop results:
```
	        DO J = 1,N,S
	           DO I = 1, N
	              DO j = J, min(J+S,N)
	                 A(I) = A(I) + B(J)
	              ENDDO
	           ENDDO
	        ENDDO
     
```
where S is the maximum number of elements of B that can remain in cache between two iterations of the outer loop.
In this new organization, we suffer at most N/L misses for B, because each element of B is reused N times. On the other hand, we increase the number of misses on A to N²/(LS) because we must now miss for all the elements of A on each of the N/S iterations of the outer loop. In this example, we have strip-mined or blocked the inner loop to the size of the cache and moved the iterate-by-strip loop to the outermost loop position.
Data reorganization. A third strategy for improving the behavior of a uniprocessor memory hierarchy is to reorganize the data structures so that data items that are used together are stored together in memory. For example, many older Fortran programs use multidimensional arrays to store complex data structures. In these programs one often sees an array declaration such as:
```
         DOUBLE PRECISION PART(10000,5)
     
```
Here the second dimension is being used to store the fields of a data structure about one of the 10,000 particles in a simulation. If the fields describing a single particle are updated together, this is the wrong data organization because Fortran uses column-major order and the 5 fields are likely to appear on different cache lines. This organization can reduce the effective bandwidth in the program by up to a factor of 5. A much better organization for Fortran is to swap the data dimensions:
```
         DOUBLE PRECISION PART(5,10000)
     
```
However, this requires rewriting every access to the array PART in order to ensure correctness. Thus this task is often best left to a tool, although the programmer should be aware of this problem while writing the program initially. Data reorganization is also very effective on irregular problems, even though the reorganization must take place at run time. In irregular particle codes, the cost of misses due to a bad data organization far outweighs the cost of rearranging the data at run time.

Judicious use of these strategies can improve performance by integer factors. However, they are tedious to apply by hand, so many of these strategies have been built into modern Fortran compilers.

Multiprocessors add a number of complexities to the problem of managing accesses to memory and improving reuse. In this section, we will concentrate on three of the most significant problems.

Synchronization. In many parallel programs it is useful to have several different processors updating the same shared data structure. An example is a particle-in-cell code where the forces on a single processor are computed by a number of different processors, each of which must update the aggregate force acting on that particle. If two processors attempt two different updates simultaneously, incorrect results may occur. Thus it is essential to use some sort of locking mechanism, such as a semaphore or a critical region to ensure that when one processor is performing such an update on a given particle, all other updates for the same particle are locked out. Typically processors that are locked out execute some sort of busy-waiting loop until the lock is reset. Most machines are designed to ensure that these loops do not cause ping-ponging of the cache block containing the lock.
Elimination of False Sharing. False sharing is a problem that arises when two different processors are accessing data that resides on the same cache block. On a shared-memory machine, if both processors attempt to write into the same block, the block can ping-pong back and fort between those processor caches. This phenomenon is known as false sharing because it has the effect of repeated access to a shared datum even though there is no sharing. False sharing is typically avoided by ensuring that data used by different processors resides on different cache blocks. This can be acheived by the programmer or a compiler through the use of padding in data structures. Padding is the process of inserting empty bytes in a data structure to ensure that different elements are in different cache blocks.
Communication Minimization and Placement. Communication with a remote processor can have a number of negative effects on the performance of a computation node in a parallel machine. First, the communication itself can cause computation to wait. The typical strategy for addressing this problem is to move send and receive commands far enough apart so that the time spent on communication can be overlapped with compuation. Alternatively, a program reorganization may reduce the frequency of communication which not only reduces the number of start-up delays that must be incurred but also reduces the interference with local memory hierarchy management. A form of blocking can be useful in this context-if large data structures are being cycled through all the processors of a distributed-memory computer, it pays to block the data so that all computations involving that data structure by one processor can be carried out at the same time so that each block has to be communicated to a given processor exactly once.

Once again, many of the useful strategies can be automated in a compiler.

File translated from T_EX by T_TH, version 2.33.
On 25 Jan 2000, 12:22.