Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared 18 Sept 1995
Outside Index
Summary of Material
This starts with a discussion of Parallel Computing using analogies from nature |
It uses foils and material from CSEP chapter on Computer Architecture to discuss how and why to build a parallel computer including synchronization memory structure and network issues |
SIMD and MIMD Architectures with a brief comparison of workstation networks with closely coupled systems |
A look to the future is based on results from Petaflops workshop |
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This starts with a discussion of Parallel Computing using analogies from nature |
It uses foils and material from CSEP chapter on Computer Architecture to discuss how and why to build a parallel computer including synchronization memory structure and network issues |
SIMD and MIMD Architectures with a brief comparison of workstation networks with closely coupled systems |
A look to the future is based on results from Petaflops workshop |
Each node is CPU and 6 memory chips -- CPU Chip integrates communication channels with floating, integer and logical CPU functions |
32 node CM-5 and in foreground old CM-2 diskvault |
Simple, but general and extensible to many more nodes is domain decomposition |
All successful concurrent machines with
|
Have obtained parallelism from "Data Parallelism" or "Domain Decomposition" |
Problem is an algorithm applied to data set
|
The three architectures considered here differ as follows: |
MIMD Distributed Memory
|
MIMD Shared Memory
|
SIMD Distributed Memory
|
2 Different types of Mappings in Physical Spaces |
Both are static
|
Different types of Mappings -- A very dynamic case without any underlying Physical Space |
c)Computer Chess with dynamic game tree decomposed onto 4 nodes |
And the corresponding poor workload balance |
And excellent workload balance |
The fundamental principles behind the use of concurrent computers are identical to those used in society - in fact they are partly why society exists. |
If a problem is too large for one person, one does not hire a SUPERman, but rather puts together a team of ordinary people... |
cf. Construction of Hadrians Wall |
Domain Decomposition is Key to Parallelism |
Need "Large" Subdomains l >> l overlap |
AMDAHL"s LAW or |
Too many cooks spoil the broth |
Says that |
Speedup S is small if efficiency e small |
or for Hadrian's wall |
equivalently S is small if length l small |
But this is irrelevant as we do not need parallel processing unless problem big! |
"Pipelining" or decomposition by horizontal section is:
|
Hadrian's Wall is one dimensional |
Humans represent a flexible processor node that can be arranged in different ways for different problems |
The lesson for computing is: |
Original MIMD machines used a hypercube topology. The hypercube includes several topologies including all meshes. It is a flexible concurrent computer that can tackle a broad range of problems. Current machines use different interconnect structure from hypercube but preserve this capability. |
Comparing Computer and Hadrian's Wall Cases |
The case of Programming a Hypercube |
Each node runs software that is similar to sequential code |
e.g., FORTRAN with geometry and boundary value sections changed |
Geometry irregular but each brick takes about the same amount of time to lay. |
Decomposition of wall for an irregular geometry involves equalizing number of bricks per mason, not length of wall per mason. |
Fundamental entities (bricks, gargoyles) are of different complexity |
Best decomposition dynamic |
Inhomogeneous problems run on concurrent computers but require dynamic assignment of work to nodes and strategies to optimize this |
(we use neural networks, simulated annealing, spectral bisection etc.) |
Global Parallelism
|
Local Parallelism
|
Local and Global Parallelism |
Should both be Exploited |
Disk (input/output) Technology is better matched to several modest power processors than to a single sequential supercomputer |
Concurrent Computers natural in databases, transaction analysis |
At the finest resolution, collection of neurons sending and receiving messages by axons and dendrites |
At a coarser resolution |
Society is a collection of brains sending and receiving messages by sight and sound |
Ant Hill is a collection of ants (smaller brains) sending and receiving messages by chemical signals |
Lesson: All Nature's Computers Use Message Passing |
With several different Architectures |
Problems are large - use domain decomposition Overheads are edge effects |
Topology of processor matches that of domain - processor with rich flexible node/topology matches most domains |
Regular homogeneous problems easiest but |
irregular or |
Inhomogeneous |
Can use local and global parallelism |
Can handle concurrent calculation and I/O |
Nature always uses message passing as in parallel computers (at lowest level) |
Geoffrey Fox |
Syracuse University |
NPAC 3-131 CST, 111 College Place |
Syracuse NY 13244 |
Parallel Computer and Network Architecture |
Overview of Issues including synchronization, granularity and 3 classes of architectures |
More details on networks |
More details on system architectures |
MIMD Distributed Memory |
MIMD with logically shared memory but perhaps physically distributed. The latter is sometimes called distributed shared memory. |
SIMD with logically distributed or shared memory |
One interesting hardware architecture .........
|
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of the above.
|
This list stems from 1990 |
Two critical Issues are: |
Memory Structure
|
and Heterogeneous mixtures |
Control and synchronization |
SIMD -lockstep synchronization |
MIMD -synchronization can take several forms
|
An implied constraint on the ordering of operations.
|
Synchronous: Objects (processors, network, tasks) proceed together in lockstep fashion |
Asynchronous: Objects may proceed at somewhat different speeds, sometimes with occasional synchronization points (barriers) to allow slower objects to catch up. |
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors. Pieces, processors may be heterogeneous. Computation/ communication ratio very high -- Typical of Networked Metacomputing |
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors. Processors typically run the same code.(SPMD Style) Computation/communication ration often hundreds or more. Typical of MIMD Parallel Systems such as SP2 CM5 Paragon T3D |
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines. Processors typically have instructions broadcasted to them.Computation/ Communication ratio often near unity. Typical of SIMD but seen in a few MIMD systems such as Dally's J Machine or commercial Myrianet (Seitz) |
Note that a machine of one type can be used on algorithms of the same or finer granularity |
Shared (Global): There is a global memory space, accessible by all processors. Processors may also have some local memory. Algorithms may use global data structures efficiently. However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times) |
Distributed (Local, Message-Passing): All memory is associated with processors. To retrieve information from another processors' memory a message must be sent there. Algorithms should use distributed data structures. |
Uniform: All processors take the same time to reach all memory locations. |
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
|
Most NUMA machines these days have two memory access times
|
Useful terms include: |
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout) |
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault? |
Blocking: Some communication requests may not get through, due to conflicts caused by other requests. |
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor. |
Latency (delay): Maximal time for nonblocked request to be transmitted. |
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task. |
Switch |
Bus |
Classes of networks include: |
Bus: All processors (and memory) connected to a common bus or busses.
|
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
|
Processors directly connected to only certain other processors, and must go multiple hops to get to additional processors. Also called store-and-forward. |
Usually distributed memory |
Examples:
|
Usually NUMA, nonblocking, scalable, upgradable |
Processor connectivity modeled by a graph in which nodes represent connections between processors. Much theoretical work here but now obsolete (irrelevant) as even if this hardware used, pipelining masks transmission time which ends up not showing term proportional to distance travelled |
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube. |
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial |
Mesh seems to have poor diameter but Dally and Seitz emphasize can make up as easy to build physically (circuits are two dimensional). Thus can have fewer links but as physically short can be "thick" (very high bandwidth) |
Hypercube compromises so both degree and diameter grow modestly. High degree implies hard to build -- low diameter improves communication |
Tree has bottleneck at root(top). In fat tree address this as in CM5 by increasing bandwidth on links as one goes up the tree |
A Hypercube is 2d computers arranged on the corners of a cube in d dimensions and connected to d nearest neighbors |
Label a hypercube e1 e2 e3 e4.......ed |
where each ek takes values zero or 1 depending if vertex at "top" or "bottom" of this axis. |
Think of hypercube as unit cube in d dimensions with origin at bottom left hand corner |
In string e1 e2 e3 e4.......ed, one can write d=d1+d2+d3 |
Any set of d1 binary indices can be mapped into a one dimensional line with periodic (wrap-around) connections |
the remaining d2+d3 indices give other two dimensions |
This decomposes hypercube into a mesh with:
|
So Hypercube includes lots of meshes |
e.g. d=6 hypercube has 4 by 4 by 4, 8 by 8, 4 by 64, 64 by 1 etc. |
So can study one dimension without loss of generality! |
d= 1 -- 2 nodes |
d= 2 -- 4 nodes |
d= 3 -- 8 nodes |
So in Hadrian's wall we saw a one dimensional layout of processors (people) to solve a one dimensional problem |
However computers need to be laid out in two or three dimensional worlds and we can use same ideas. |
We consider one dimension as discussion in two or three similar as each axis(dimension) corresponds to a distinct subset of binary indices |
Hypercube: |
Mesh with equivalent wire length |
So can let long hypercube wires have "local" stops and hypercube becomes a "thick mesh" |
With good routing technologies, a thick mesh is better than a hypercube? |
Hardware may handle only single hop, or multiple hops as in routing chips on Paragon. |
Software may mask hardware limitations so one sees full connectivity even if physically limited. Note that multiple hops always leads to poorer latency as this is travel time per bit. However we can keep bandwidth high even with multiple hops by increasing "size" of channels e.g. transmitting several bits simultaneously. Software can hide
|
Latency related to graph diameter, among many other factors |
Graph may also represent calculations that need to be performed, and the information exchange required. |
Transmission Time for message of n bytes: |
T0 + T1 n where |
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize |
T0 = TS + Td . Number of hops |
T1 is the inverse bandwidth -- it can be made small if pipe is large size. |
In practice TS and T1 are most important and Td is unimportant |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
Square blocks indicate shared memory copy performance |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
Note that the processor networks such as torus or hypercube can be used to build switches when one does not put processors on internal nodes but rather "just" switch chips. |
Thus switches are discussed with same trade-offs -- store and forward, circuit-switched as processor networks. |
Switch based machines have typically the same delay (travel time) between any two processors |
In Processor networks, some machines can be nearer to each other if fewer hops |
BUT in all modern machines, low level hardware essentially makes all these architectures the SAME. There are only two times of importance corresponding to DATA LOCALITY or not
|
Rather than have messages travel single hop at a time, sometimes circuit is first established from sender to receiver, and then data transmitted along circuit. |
Similar to establishing phone connection |
Can result in significantly lower communication overhead as latency deterministic once circuit established |
If circuit blocked, many options possible, including
|
At high levels of message traffic, performance sometimes severely degraded. |
Many issues both between architectures and internal to each architecture |
Situation confused as need to understand
|
is a particular implementation done right
|
All that perhaps is important is what either user or high level software (compiler/message passing system) sees. |
A significant problem is that there is essentially no application software and essentially no machines sold |
Real World Architectural issue - what problems will produce substantial sales - should optimize for these?
|
System integration issues
|
Why is MPP always behind the best workstations?
|
Must reduce delay n by matching parallel design as much as possible to
|
Memory per node (SIMD and MIMD)
|
A special case of this is a network of workstations (NOW's) or personal computers |
Issues include:
|
Choice of Node is a dominant issue?
|
Network Topology as described is not very important today?
|
However we still have two major types of network:
|
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network |
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, two-dimensional mesh and general switch |
Execube - 16 bit processors with 8 integrated into IBM 4 mbit memory chip, SIMD or MIMD or both, |
512 processors on IBM RS6000 with three dimensional mesh |
Node 1bit, 4 bit, 16 bit, 32 bit? |
Interconnect issues similar to MIMD |
Critical point -- is SIMD more cost effective than MIMD on a significant number of problems
|
Current Consensus is opposite of 1990 -- namely now MIMD dominates.
|
Some machines have a "true" shared memory. i.e. every location is "equally accessible" by each node. Original Ultracomputer, Burton Smith's Tera, Some Bus based Systems |
Others, such as BBN Butterfly, Kendall Square KSR-2, have non-uniform access time but all memory locations accessible |
Expected Architectures of Future will be:
|
Essentially all problems run efficiently on a distributed memory BUT |
Software is easier to develop on a shared memory machine |
Some Shared Memory Issues:
|
Each C.P.U. - Two Pipes; |
Each pipe one add and one |
multiply per clock period |
Cycle Time 4 nanoseconds |
each C.P.U. has upto 4 functional units |
Cycle Time 2.9 nanoseconds |
1 continue |
Note MIMD parallelism "larger" than SIMD as MIMD size reflects number of grid points,particles etc. |
Only Superconducting Technology can possibly do better |
Need Optical Interconnects probably |
See Chapter 5 of Petaflops Report -- July 94 |
See Chapter 5 of Petaflops Report -- July 94 |
See Chapter 6 of Petaflops Report -- July 94 |
See Chapter 6 of Petaflops Report -- July 94 |
This shared memory design is the natural evolution of systems such as the Cray-2,3 or Cray C-90 |
See Chapter 6 of Petaflops Report -- July 94 |
This architecture generalizes cutrrent IBM SP-2 type system and requires unlike Category I, data locality for the upto 40,000 CPU's to be able function efficiently with minimum communication overhead |
See Chapter 6 of Petaflops Report -- July 94 |
This design is an extrapolation of systems such as the J machine(Dally), Execube (Loral) or Mosaic(Seitz). It features CPU and memory integrated on the chip (PIM). |
Unlike such systems today, in the year 2015 such PIM designs have substantial memory per processor |
This is formal latency to ensure that one can reach design goals of previous table. I picosecond (Category I) cannot be attained. You solve this by having N concurrent streams -- Then each of them needs latency of N picoseconds |
e.g. If output rate of an arithmetic unit is 1 per nanosecond, then each of 400 teraflop CPU's in Category I design must have 1000 arithmetic units running at full speed to each full system performance of 0.4 petaflops |
See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |