Let us give some examples of the five problem architectures.
Synchronous: These are regular computations on regular data domains and can be exemplified by full matrix algorithms such as LU decomposition; finite difference algorithms and convolutions such as the fast Fourier transform.
Loosely Synchronous: These are typified by iterative calculations (or time evolutions) on geometrically irregular and perhaps heterogeneous data domains. Examples are irregular mesh finite element problems, and inhomogeneous particle dynamics.
Asynchronous: These are characterized by a temporal irregularity which makes parallelization hard. An important example is even driven simulation where events, as in a battlefield simulation, occur in spatially distributed fashion but irregularly in time. Branch and bound and other pruned tree algorithms common in artificial intelligence such as computer chess [9] fall in this category.
Synchronous and Loosely synchronous problems parallelize naturally in a fashion that scales to large computers with many nodes. One only requires that the application be ``large enough'' which can be quantified by a detailed performance analysis [10] which was discussed quite accurately in my original COMPCON paper [1]. The speed up
on a computer with N nodes where the overhead has a term due to communication which has the form
where and are respectively typical node to node communication and node (floating point) calculation time. n is the application grain size and d its dimension which is defined precisely in [10]; in physical simulations d is usually the geometric dimension. Good performance requires be ``small'' with a value that depends on the critical machine parameter . The grain size n would be the number of grid points stored on each node in a finite difference problem so that the complete problem had Nn grid points. Implicit in the above discussion is that these problems are ``data parallel'' in the language of Hillis [11], [12]. This terminology is sometimes only associated with problems run on SIMD machines but in fact, data parallelism is the general origin of massive parallelism on either SIMD or MIMD architectures. MIMD machines are needed for loosely synchronous data parallel problems where we do not have a homogeneous algorithm with the same update operation on each data element.
The above analysis does not apply to asynchronous problems as this case has additional synchronization overhead. One can, in fact, use message passing to naturally synchronize synchronous or loosely synchronization problems. These typically divide into communication and calculation phases as given by individual iterations or time steps in a simulation. These phases define an algorithmic synchronization common to the entire application. This is lacking in asynchronous problems which require sophisticated parallel software support such as that given by the time warp system [13].
However, there is a very important class of asynchronous applications for which large scale parallelization is possible. These we call loosely synchronous complex as they consist of an asynchronous collection of loosely synchronous (or synchronous) modules. A good example, shown in Figure 2, is the simulation of a satellite based defense system. Viewed at the level of the satellites, we see an asynchronous application. However, the modules are not elemental events but rather large scale data parallel subsystems. In a simulation developed by JPL, these modules included sophisticated Kalman filters and target weapon association [14]. This problem class shows a functional parallelism at the module level and a conventional data parallelism within the modules. The latter ensures that large problems of this class will parallelize on large machines. Image analysis, vision and other large information processing or command and control problems fall in the loosely synchronous complex class.
A final problem class of practical importance is termed ``embarrassingly parallel''. These consist of a set of independent calculations. This is seen in parts of many chemistry calculations where one can independently compute the separate matrix elements of the Hamiltonian. Another example is seen in the operation of the New York stock exchange where the trading of 2000 stocks can be independently controlled by separate computers -- in practice the SIAC corporation distributes the stocks over a few hundred personal computers or workstations with each handling the independent trading of a few stocks.