Conventional algorithms that have been optimized for a single processor computer, or a vector-pipelined computer, are often taken for granted since they are readily available as subroutine packages. However, these algorithms may perform poorly in the parallel environment of a particular parallel computer. To achieve superior performance, new algorithms must be developed that conform better to the parallel architecture.
To illustrate parallelism and load balancing, we consider the problem
of adding two n-vectors and
. The additions
are all independent and can be done in parallel. Thus, it has
perfect mathematical parallelism. On the other hand, it may not have
perfect parallelism on a parallel computer because it may not have perfect
load balancing.
By load balancing we mean the assignment of tasks to the processors
of the system so as to keep each processor doing useful work as much as
possible. For example, there are processors and
in (
). Then, the processors can work in perfect parallelism on 96 additions
but only two processors will be busy during the remaining 2 additions.
Thus, there is not perfect load balancing to match the perfect mathematical
parallelism.
In general, load balancing may be done either statically or dynamically. In static load balancing, tasks (and, perhaps, data for distributed memory system) are assigned to processors at the beginning of a computation. In dynamic load balancing, tasks (and data) are assigned to processors as the computation proceeds. A useful concept for dynamic load balancing is that of a pool of tasks, from which a processor obtains its next task when is is ready to do so.
Related to load balancing is the idea of granularity. Large-scale granularity means large tasks that can be performed independently in parallel. Small-scale granularity means small tasks that can be performed in parallel.
In general, in a distributed memory system, exchange of data between processors will be necessary at various times during an overall computation, and to the extent that processors are not doing useful computation during the communication, this constitutes an overhead.
Synchronization is necessary when certain parts of a computation must be completed before the overall computation can proceed. There are two aspects of synchronization that contribute to overhead. The first is the time to do the synchronization; usually this requires that all processors perform certain checks. The second aspect is that some, or even almost all, processors may become idle, waiting for clearance to proceed with the computation.
The degree to which an algorithm can exploit a multiprocessor is often
measured by either speed-up
or efficiency. Ideally, we could solve a problem times as fast on
processors as on a single processor. This ideal is rarely achieved; what is
achieved is called the speed-up defined by:
The efficiency can be defined as
Since , we have
and an efficiency of
corresponds to a perfect speedup of
.
The speed-up defined in () is a measure of how a given
algorithm compares with itself on
and
processors. However, the
parallel algorithm may not be the best algorithm on a single processors.
Hence, a better measure of what is gained by parallel computation is given
by the alternative definition
Both of the measurements and
are useful and the context of
discussion will determine which is to be used.