Given by Geoffrey C. Fox at Delivered Lectures of CPS615 Basic Simulation Track for Computational Science on 12 September 96. Foils prepared 11 November 1996
Outside Index
Summary of Material
This continues the computer architecture discussion with
|
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This continues the computer architecture discussion with
|
SIMD -lockstep synchronization
|
MIMD - Each Processor executes independent instruction streams |
MIMD Synchronization can take several forms
|
MIMD Distributed Memory
|
MIMD with logically shared memory but usually physically distributed. The latter is sometimes called distributed shared memory.
|
A special case of this is a network of workstations (NOW's) or personal computers (metacomputer) |
Issues include:
|
SIMD -- Single Instruction Multiple Data -- can have logically distributed or shared memory
|
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network |
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, fast two-dimensional mesh and slower general switch for communication |
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of MIMD or SIMD, Sequential or Parallel machines. |
Metacomputers can vary from full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D |
This is a critical future architecture which is intrinsically distributed memory as multi-vendor heterogenity implies that one cannot have special hardware enhanced shared memory
|
Cluster of workstations or PC's |
Heterogeneous MetaComputer System |
One example is an Associative memory - SIMD or MIMD or content addressable memories |
This is an an example of a special purpose "signal" processing machine which can in fact be built from "conventional" SIMD or "MIMD" architectures |
This type of machine is not so popular as most applications are not dominated by computations for which good special purpose devices can be designed |
If only 10% of a problem is say "track-finding" or some special purpose processing, then who cares if you reduce that 10% by a factor of 100
|
N body problems (e.g. Newton's laws for one million stars in a globular cluster) can have succesful special purpose devices |
See GRAPE (GRAvity PipE) machine (Sugimoto et al. Nature 345 page 90,1990)
|
Note GRAPE uses EXACTLY same parallel algorithm that one finds in the books (e.g. Solving Problems on Concurrent Processors) for N-body problems on classic distributed memory MIMD machines |
GRAPE will execute the classic O(N^2) (parallel) N body algorithm BUT this is not the algorithm used in most such computations |
Rather there is the O(N) or O(N)logN so called "fast-multipole" algorithm which uses hierarchical approach
|
So special purpose devices cannot usually take advantage of new nifty algorithms! |
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors.
|
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors.
|
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines.
|
Note that a machine of given granularity can be used on algorithms of the same or finer granularity |
Switch |
Bus |
The last major architectural feature of a parallel machine is the network or design of hardware/software connecting processors and memories together. |
Bus: All processors (and memory) connected to a common bus or busses.
|
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
|
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube. |
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial |
Useful terms include: |
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout) |
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault? |
Blocking: Some communication requests may not get through, due to conflicts caused by other requests. |
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor. |
Latency (delay): Maximal time for nonblocked request to be transmitted. |
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task. |
Worm Hole Routing -- Intermediate switch nodes do not wait for full message but allow it to pass throuch in small packets |
From Aspects of Computational Science, Editor Aad van der Steen, published by NCF |
System Communication Speed Computation Speed
|
IBM SP2 40 267 |
Intel iPSC860 2.8 60 |
Intel Paragon 200 75 |
Kendall Square |
KSR-1 17.1 40 |
Meiko CS-2 100 200 |
Parsytec GC 20 25 |
TMC CM-5 20 128 |
Cray T3D 150 300 |
tcomm = 4 or 8 /Speed in Mbytes sec
|
tfloat = 1/Speed in Mflops per sec |
Thus tcomm / tfloat is just 4 X Computation Speed divided by Communication speed |
tcomm / tfloat is 26.7, 85, 1.5, 9.35, 8, 5, 25.6, 8 for the machines SP2, iPSC860, Paragon, KSR-1, Meiko CS2, Parsytec GC, TMC CM5, and Cray T3D respectively |
Latency makes situation worse for small messages and double for 64bit arithmetic natural on large problems! |
Transmission Time for message of n bytes: |
T0 + T1 n where |
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize |
T0 = TS + Td . Number of hops |
T1 is the inverse bandwidth -- it can be made small if pipe is large size. |
In practice TS and T1 are most important and Td is unimportant |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
Square blocks indicate shared memory copy performance |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |