Given by Geoffrey C. Fox at Delivered Lectures of CPS615 Basic Simulation Track for Computational Science on 10 September 96. Foils prepared 27 october 1996
Outside Index
Summary of Material
Secs 50.4
This starts by filling in details of communication overhead in parallel processing for case where "range" of interaction is large |
We show two old examples from Caltech illustrates correctness of analytic form |
We return to discussion of computer architectures describing
|
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This starts by filling in details of communication overhead in parallel processing for case where "range" of interaction is large |
We show two old examples from Caltech illustrates correctness of analytic form |
We return to discussion of computer architectures describing
|
The previous foil showed that increasing stencil made slight improvements! |
This foil shows that larger stencils have much lower overheads (and hence better parallel performance) than simple Laplace's equation with 5 point stencil |
Showing linear overhead behavior for fc |
Sequential or von Neuman Architecture |
Vector (Super)computers |
Parallel Computers
|
Instructions and data are stored in the same memory for which there is a single link (the von Neumann Bottleneck) to the CPU which decodes and executues instructions |
The CPU can have multiple functional units |
The memory access can be enhanced by use of caches made from faster memory to allow greater bandwidth and lower latency |
Fig 1.14 of Aspects of Computational Science |
Editor Aad van der Steen |
published by NCF |
This design enhances performance by noting that many applications calculate "vector-like" operations
|
This allows one to address two performance problems
|
They are typified by Cray 1, XMP, YMP, C-90, CDC-205, ETA-10 and Japaneses Supercomputers from NEC Fujitsu and Hitachi |
A pipeline for vector addition looks like:
|
Vector machines pipeline data through the CPU |
They are not so popular/relevant as in the past as
|
In fact excellence of say, Cray C-90 is due to its very good memory architecture allowing one to get enough operands to sustain pipeline. |
Most workstation class machines have "good" CPU's but can never get enough data from memory to sustain good performance except for a few cache intensive applications |
Very high speed computing systems,Proc of IEEE 54,12,p1901-1909(1966) and |
Some Computer Organizations and their Effectiveness, IEEE Trans. on Comp. C-21,948-960(1972) -- both papers by M.J. Flynn |
SISD -- Single Instruction stream, Single Data Stream -- i.e. von Neumann Architecture |
MISD -- Multiple Instruction stream, Single Data Stream -- Not interesting |
SIMD -- Single Instruction stream, Multiple Data Stream |
MIMD -- Multiple Instruction stream and Multiple Data Stream -- dominant parallel system with ~one to ~one match of instruction and data streams. |
Memory Structure of Parallel Machines
|
and Heterogeneous mixtures |
Shared (Global): There is a global memory space, accessible by all processors.
|
Distributed (Local, Message-Passing): All memory is associated with processors.
|
Memory can be accessed directly (analogous to a phone call) as in red lines below or indirectly by message passing (green line below) |
We show two processors in a MIMD machine for distributed (left) or shared(right) memory architectures |
Uniform: All processors take the same time to reach all memory locations. |
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
|
Most NUMA machines these days have two memory access times
|
This simple two level memory access model gets more complicated in proposed 10 year out "petaflop" designs |
SIMD -lockstep synchronization
|
MIMD - Each Processor executes independent instruction streams |
MIMD Synchronization can take several forms
|
MIMD Distributed Memory
|
MIMD with logically shared memory but usually physically distributed. The latter is sometimes called distributed shared memory.
|
A special case of this is a network of workstations (NOW's) or personal computers (metacomputer) |
Issues include:
|