Parallel processing has emerged as an enabling technology in modern computers, driven by the ever increasing demand for higher performance, lower costs, and sustained productivity in real life computer applications. Concurrency is being used in today's high performance computers with the common practice of multiprogramming, multiprocessing, or multicomputing.
Over the past five decades, electronic computers have gone through five generations of development. Fifth-generation computers are targeted to achieve teraflop performance by the end of this century. Massively parallel processing is the mainstream of the software and applications of the fifth generation. The fifth-generation MPP systems are represented by several projects at IBM (SP-2), Cray Research (T3D), Thinking Machine Corporation (CM-5), and Intel Supercomputer Systems (Paragon).
Before exploring the methods for mapping a given formulation of an electromagnetic scattering problem onto parallel computers, we take a look at different computer architecture classifications given by Flynn [64]. As shown in Figure 4.1(a), conventional sequential computers are classified as SISD (single instruction stream over a single data stream) computers, SIMD (single instruction over multiple data streams) computers shown in Figure 4.1(b), and MIMD (multiple instruction stream over multiple data streams) machines shown in Figure 4.1(c). In Figure 4.1, CU denotes Control Unit; PU denotes processing unit; MU is memory unit; IS is instruction stream; DS is data stream; PE represents processing element; LM denotes local memory.
The predominant type of computers until recently has been SISD or von Neumann; the CPU progresses in sequential manner from one instruction to the next and performs operations on data items one at a time. Many widely used computer languages, such as FORTRAN, were designed for SISD computers. The advantage of the SISD approach is its simplicity and familiarity to many programmers.
The first important dichotomy in parallel systems is how the processors are controlled. In SIMD systems, all processors are under the control of a master processor, called the controller, and the individual processors all do the same instruction (or nothing) at a given time. Thus, there is a single instruction stream operating on multiple data streams, one for each processor. The Illiac IV, the first large parallel system (which was completed in the early 1970s), was a SIMD machine. Thinking Machine Corporation's connection machine CM-2 is a SIMD machine with 64k simple 1-bit processors. Vector computers may also be conceptually included in the class of SIMD machines by considering the elements of a vector as being processed individually under the control of a vector hardware instruction.
SIMD machines execute efficiently on the types of problems for which they were designed, but there are limitations to their applications. Disappointed numerical analysts discovered soon after the first SIMD machines were programmed that the projected speed improvements were not obtained. The cause was found to be program sections that are not vector operations. In the limit that vector operations take zero time, the maximum throughput is determined by the execution of the non-vector (scalar) instructions in SISD mode (see [65]).
Most parallel computers built since the Illiac IV have been MIMD systems. Here, the individual processors run under the control of their own program, which allows great flexibility in the tasks the processors can do at any given time. Theoretically, this flexibility allows the application programmer to run a greater fraction of the problem in parallel than with vector processing alone. In practice, programming MIMD machines is difficult because of the inherent complexity of multiple processors doing different things simultaneously, as well as the general lack of automatic parallel decomposition software tools. It also introduces the problem of synchronization. In an SIMD system, synchronization of the individual processors is carried out by the controller, but in an MIMD system other mechanisms must be used to ensure that the processors are doing their tasks in the correct order with the correct data.
For many problems, the programs in the individual processors for an MIMD system may be identical (or nearly so). Thus, all programs are carrying out the same operations on different sets of data, just as SIMD machines would do. This gives rise to the Single-Program Multiple-Data (SPMD) model of computation.
Another important dichotomy in parallel computers is shared versus distributed memory. An example of shared memory system with four processors is illustrated in Figure 4.2. Here, all the processors have access to a common memory. Each processor can also have its own local memory for program code and intermediate results. Interprocess communication consists simply of writing to and reading from the common data area, a fast operation because it occurs at random memory access rates. Shared memory architectures also lend themselves to conventional time-sharing applications with multiple users running multiple jobs. Since each user's program and data reside in a common memory area, jobs can be assigned to CPUs dynamically for the purpose of load balancing.
A serious disadvantage is that different processors may wish to use the common memory simultaneously, in which case there will be a delay until the memory is free. This delay, called contention time, can increase as the number of processors increase. Typically, shared memory has been used for systems with a small number of processors such as Cray machines with up to 16 processors.
An alternative to shared memory systems is distributed memory systems, in which each processor can address only its own local memory. Communication between processors takes place by message passing, in which data or other information is transferred between processors. The most practical massively parallel computers are distributed memory MIMD systems. The connection machine CM-5, the Intel Paragon, and the IBM SP-1 are examples.