The concerns of the parallel programmer

The concerns of the parallel programmer are those of any programmer: algorithm design, convenience of expression, efficiency of execution, ease of debugging, component reuse, and lifecycle issues. In particular, the parallel programmer, like any programmer, requires: languages and/or application programming interfaces (APIs) that allow for the succinct expression of complex algorithms, hiding unimportant details while providing control over performance-critical issues; associated tools (e.g., performance profilers) that allow them to diagnose and correct errors and performance problems; and convenient formulations of efficient algorithms for solving key problems, ideally packaged so that they can easily be integrated into an application program.

However, despite these commonalities, the particular characteristics of parallel computers and of parallel computing introduce additional concerns that tend to complicate both parallel programming and the development of parallel programming tools. In particular, we must be concerned with the following three challenges

Concurrency and communication. Parallel programs may involve the creation, coordination, and management of potentially thousands of independent threads of control. Interactions between concurrent threads of control may result in nondeterminism. These issues introduce unique concerns that have profound implications for every aspect of the program development process.
Need for high performance. In sequential programming, ease of expression may be as important or even more important than program performance. In contrast, the motivation for using parallel computation is almost always a desire for high performance. This requirement places stringent constraints on the programming models and tools that can reasonably be used for parallel programming.
Diversity of architecture. The considerable diversity seen in parallel computer architectures makes the development of standard tools and portable programs more difficult than is the case in sequential computing, where we find remarkable uniformity in basic architecture.

The goal of parallel programming is thus to satisfy the requirements listed at the beginning of this section, while simultaneously addressing in some fashion the three challenges of concurrency and communication, performance demands, and architectural diversity. This is a difficult task, and so in practice we find a variety of approaches to programming, which we try to summarize in this chapter.

A parallel computer is a collection of processing and memory elements, plus a communication network used to route requests and information among these elements. The task of the parallel programmer is to coordinate the operation of these diverse elements so as to achieve efficient and correct execution on the problem of interest.

The performance of a parallel program is determined by how effectively it maximizes concurrency (the number of operations that can be performed simultaneously) while minimizing the amount of communication required to access ``nonlocal'' data, transfer intermediate results, and synchronize the operation of different threads of control. Communication costs are frequently sensitive to data distribution, the mapping of application data structures to memory elements: a good data distribution can reduce the number of memory accesses that require expensive communication operations. If work is not distributed evenly among processors, load imbalances may occur, reducing concurrency and performance.

When evaluating the correctness of a parallel program, the programmer may need to take into account the possibility of race conditions, which occur when the executions of two or more distinct threads of control are sufficiently unconstrained that the result of a computation can vary nondeterministically, depending simply on the speed at which different threads proceed.

The programmer, when faced with the task of writing an efficient and correct parallel program, can call upon a variety of parallel languages, compilers, and libraries, each of which implements a distinct programming model with different tradeoffs between ease of use, generality, and achievable performance.

In the rest of this chapter, we first review some of the principal programming models, then discuss some critical issues for achieving performance on parallel machine including parallel decomposition, memory hierarchy management, performance analysis and tuning, parallel debugging, and operating system issues. We first make some general comments concerning the programming models that underly the various languages and libraries that will be discussed subsequently.

Thirty years of research have led to the definition and exploration of a large number of parallel programming models []. Few of these models have survived, but much experience has been gained in what is useful in practical settings.

Parallel programs may be categorized according to whether they emphasize concurrent execution of the same task on different data elements (data parallelism) or the concurrent execution of different tasks on the same or different data (task parallelism). For example, a simulation of galaxy formation might require that essentially the same operation be performed on each of a large number of data items (stars); in this case, a data parallel algorithm is obtained naturally by performing this operation on multiple items simultaneously. In contrast, in a simulation of a complex physical system comprising multiple processes (e.g., a multidisciplinary optimization of an aircraft might couple airflow, structures, and engine simulations) the different components can be executed concurrently, hence obtaining task parallelism.

Most programs for scalable parallel computers are data parallel in nature, for the simple reason that the amount of concurrency that can obtained from data parallelism tends to be larger than can be achieved via task parallelism. Nevertheless, task parallelism can have an important role to play as a software engineering technique: it often makes sense to execute distinct components on disjoint sets of processors (or even on different computers) for modularity reasons. It is increasingly common for parallel programs to be structured as a task-parallel composition of data-parallel components.

Parallel programming systems can be categorized according to whether they support an explicitly or implicitly parallel programming model. An explicitly parallel system requires that the programmer specify directly the activities of the multiple concurrent ``threads of control'' that form a parallel computation. In contrast, an implicitly parallel system allows the programmer to provide a higher-level specification of program behavior in which parallelism is not represented directly. It is then the responsibility of the compiler or library to implement this parallelism efficiently and correctly.

Implicitly parallel systems can simplify programming by eliminating the need for the programmer to coordinate the execution of multiple processes. For example, in the implicitly parallel, primarily data-parallel language High Performance Fortran, the programmer writes what is essentially sequential Fortran 90 code, augmented with some directives. Race conditions cannot occur and the HPF program need not be rewritten to take advantage of different parallel architectures.

Explicitly parallel systems provide the programmer with more control over program behavior and hence can often be used to achieve higher performance. For example, an MPI implementation of an adaptive mesh refinement algorithm may incorporate sophisticated techniques for computing mesh distributions, for structuring communications among subdomains, and for redistributing data when load imbalances occur. These strategies are beyond the capabilities of today's HPF compilers.

A parallel programming style that is becoming increasingly popular is to encapsulate the complexities of parallel algorithm design within libraries (e.g., an adaptive mesh refinement library, as just discussed). An application program can then consist of just a sequence of calls to such library functions, as illustrated in Figure below. In this way, many of the advantages of an implicitly parallel approach can be obtained within an explicitly parallel framework.

Explicitly parallel programming systems can be categorized according to whether they support a shared or distributed memory programming model. In a shared memory model, the programmer's task is to specify the activities of a set of processes that communicate by reading and writing shared memory. In a distributed memory model, processes only have local memory and must use some other mechanism (e.g., message passing or remote procedure call) to exchange information.

Shared memory models have the significant advantage that the programmer need not be concerned with data distribution issues. On the other hand, high-performance implementations may be difficult on computers that lack hardware support for shared memory, and race conditions tend to arise more easily.

Distributed memory models have the advantage that programmers have explicit control over data distribution and communication; thisc control facilitates high-performance programming on large distributed memory parallel computers.

File translated from T_EX by T_TH, version 2.33.
On 25 Jan 2000, 12:23.