Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared 21 October 1995
Outside Index
Summary of Material
Secs 59
Parallel Computer and Network Architecture |
Overview of Issues including synchronization, granularity and 3 classes of architectures |
More details on networks |
More details on system architectures |
Outside Index Summary of Material
Geoffrey Fox |
Syracuse University |
NPAC 3-131 CST, 111 College Place |
Syracuse NY 13244 |
Parallel Computer and Network Architecture |
Overview of Issues including synchronization, granularity and 3 classes of architectures |
More details on networks |
More details on system architectures |
MIMD Distributed Memory |
MIMD with logically shared memory but perhaps physically distributed. The latter is sometimes called distributed shared memory. |
SIMD with logically distributed or shared memory |
One interesting hardware architecture .........
|
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of the above.
|
This list stems from 1990 |
Two critical Issues are: |
Memory Structure
|
and Heterogeneous mixtures |
Control and synchronization |
SIMD -lockstep synchronization |
MIMD -synchronization can take several forms
|
An implied constraint on the ordering of operations.
|
Synchronous: Objects (processors, network, tasks) proceed together in lockstep fashion |
Asynchronous: Objects may proceed at somewhat different speeds, sometimes with occasional synchronization points (barriers) to allow slower objects to catch up. |
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors. Pieces, processors may be heterogeneous. Computation/ communication ratio very high -- Typical of Networked Metacomputing |
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors. Processors typically run the same code.(SPMD Style) Computation/communication ration often hundreds or more. Typical of MIMD Parallel Systems such as SP2 CM5 Paragon T3D |
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines. Processors typically have instructions broadcasted to them.Computation/ Communication ratio often near unity. Typical of SIMD but seen in a few MIMD systems such as Dally's J Machine or commercial Myrianet (Seitz) |
Note that a machine of one type can be used on algorithms of the same or finer granularity |
Shared (Global): There is a global memory space, accessible by all processors. Processors may also have some local memory. Algorithms may use global data structures efficiently. However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times) |
Distributed (Local, Message-Passing): All memory is associated with processors. To retrieve information from another processors' memory a message must be sent there. Algorithms should use distributed data structures. |
Uniform: All processors take the same time to reach all memory locations. |
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
|
Most NUMA machines these days have two memory access times
|
Classes of networks include: |
Bus: All processors (and memory) connected to a common bus or busses.
|
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
|
Useful terms include: |
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout) |
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault? |
Blocking: Some communication requests may not get through, due to conflicts caused by other requests. |
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor. |
Latency (delay): Maximal time for nonblocked request to be transmitted. |
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task. |
Switch |
Bus |
Processors directly connected to only certain other processors, and must go multiple hops to get to additional processors. Also called store-and-forward. |
Usually distributed memory |
Examples:
|
Usually NUMA, nonblocking, scalable, upgradable |
Processor connectivity modeled by a graph in which nodes represent connections between processors. Much theoretical work here but now obsolete (irrelevant) as even if this hardware used, pipelining masks transmission time which ends up not showing term proportional to distance travelled |
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube. |
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial |
Mesh seems to have poor diameter but Dally and Seitz emphasize can make up as easy to build physically (circuits are two dimensional). Thus can have fewer links but as physically short can be "thick" (very high bandwidth) |
Hypercube compromises so both degree and diameter grow modestly. High degree implies hard to build -- low diameter improves communication |
Tree has bottleneck at root(top). In fat tree address this as in CM5 by increasing bandwidth on links as one goes up the tree |
Hardware may handle only single hop, or multiple hops as in routing chips on Paragon. |
Software may mask hardware limitations so one sees full connectivity even if physically limited. Note that multiple hops always leads to poorer latency as this is travel time per bit. However we can keep bandwidth high even with multiple hops by increasing "size" of channels e.g. transmitting several bits simultaneously. Software can hide
|
Latency related to graph diameter, among many other factors |
Graph may also represent calculations that need to be performed, and the information exchange required. |
Transmission Time for message of n bytes: |
T0 + T1 n where |
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize |
T0 = TS + Td . Number of hops |
T1 is the inverse bandwidth -- it can be made small if pipe is large size. |
In practice TS and T1 are most important and Td is unimportant |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
Square blocks indicate shared memory copy performance |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
Note that the processor networks such as torus or hypercube can be used to build switches when one does not put processors on internal nodes but rather "just" switch chips. |
Thus switches are discussed with same trade-offs -- store and forward, circuit-switched as processor networks. |
Switch based machines have typically the same delay (travel time) between any two processors |
In Processor networks, some machines can be nearer to each other if fewer hops |
BUT in all modern machines, low level hardware essentially makes all these architectures the SAME. There are only two times of importance corresponding to DATA LOCALITY or not
|
Rather than have messages travel single hop at a time, sometimes circuit is first established from sender to receiver, and then data transmitted along circuit. |
Similar to establishing phone connection |
Can result in significantly lower communication overhead as latency deterministic once circuit established |
If circuit blocked, many options possible, including
|
At high levels of message traffic, performance sometimes severely degraded. |
Many issues both between architectures and internal to each architecture |
Situation confused as need to understand
|
is a particular implementation done right
|
All that perhaps is important is what either user or high level software (compiler/message passing system) sees. |
A significant problem is that there is essentially no application software and essentially no machines sold |
Real World Architectural issue - what problems will produce substantial sales - should optimize for these?
|
System integration issues
|
Why is MPP always behind the best workstations?
|
Must reduce delay n by matching parallel design as much as possible to
|
Memory per node (SIMD and MIMD)
|
A special case of this is a network of workstations (NOW's) or personal computers |
Issues include:
|
Choice of Node is a dominant issue?
|
Network Topology as described is not very important today?
|
However we still have two major types of network:
|
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network |
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, two-dimensional mesh and general switch |
Execube - 16 bit processors with 8 integrated into IBM 4 mbit memory chip, SIMD or MIMD or both, |
512 processors on IBM RS6000 with three dimensional mesh |
Node 1bit, 4 bit, 16 bit, 32 bit? |
Interconnect issues similar to MIMD |
Critical point -- is SIMD more cost effective than MIMD on a significant number of problems
|
Current Consensus is opposite of 1990 -- namely now MIMD dominates.
|
Some machines have a "true" shared memory. i.e. every location is "equally accessible" by each node. Original Ultracomputer, Burton Smith's Tera, Some Bus based Systems |
Others, such as BBN Butterfly, Kendall Square KSR-2, have non-uniform access time but all memory locations accessible |
Each C.P.U. - Two Pipes; |
Each pipe one add and one |
multiply per clock period |
Cycle Time 4 nanoseconds |
each C.P.U. has upto 4 functional units |
Cycle Time 2.9 nanoseconds |
1 continue |
Note MIMD parallelism "larger" than SIMD as MIMD size reflects number of grid points,particles etc. |
Expected Architectures of Future will be:
|
Essentially all problems run efficiently on a distributed memory BUT |
Software is easier to develop on a shared memory machine |
Some Shared Memory Issues:
|
Only Superconducting Technology can possibly do better |
Need Optical Interconnects probably |
See Chapter 5 of Petaflops Report -- July 94 |
See Chapter 5 of Petaflops Report -- July 94 |
See Chapter 6 of Petaflops Report -- July 94 |
See Chapter 6 of Petaflops Report -- July 94 |
This shared memory design is the natural evolution of systems such as the Cray-2,3 or Cray C-90 |
See Chapter 6 of Petaflops Report -- July 94 |
This architecture generalizes cutrrent IBM SP-2 type system and requires unlike Category I, data locality for the upto 40,000 CPU's to be able function efficiently with minimum communication overhead |
See Chapter 6 of Petaflops Report -- July 94 |
This design is an extrapolation of systems such as the J machine(Dally), Execube (Loral) or Mosaic(Seitz). It features CPU and memory integrated on the chip (PIM). |
Unlike such systems today, in the year 2015 such PIM designs have substantial memory per processor |
This is formal latency to ensure that one can reach design goals of previous table. I picosecond (Category I) cannot be attained. You solve this by having N concurrent streams -- Then each of them needs latency of N picoseconds |
e.g. If output rate of an arithmetic unit is 1 per nanosecond, then each of 400 teraflop CPU's in Category I design must have 1000 arithmetic units running at full speed to each full system performance of 0.4 petaflops |
See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
A Hypercube is 2d computers arranged on the corners of a cube in d dimensions and connected to d nearest neighbors |
Label a hypercube e1 e2 e3 e4.......ed |
where each ek takes values zero or 1 depending if vertex at "top" or "bottom" of this axis. |
Think of hypercube as unit cube in d dimensions with origin at bottom left hand corner |
In string e1 e2 e3 e4.......ed, one can write d=d1+d2+d3 |
Any set of d1 binary indices can be mapped into a one dimensional line with periodic (wrap-around) connections |
the remaining d2+d3 indices give other two dimensions |
This decomposes hypercube into a mesh with:
|
So Hypercube includes lots of meshes |
e.g. d=6 hypercube has 4 by 4 by 4, 8 by 8, 4 by 64, 64 by 1 etc. |
So can study one dimension without loss of generality! |
d= 1 -- 2 nodes |
d= 2 -- 4 nodes |
d= 3 -- 8 nodes |
So in Hadrian's wall we saw a one dimensional layout of processors (people) to solve a one dimensional problem |
However computers need to be laid out in two or three dimensional worlds and we can use same ideas. |
We consider one dimension as discussion in two or three similar as each axis(dimension) corresponds to a distinct subset of binary indices |
Hypercube: |
Mesh with equivalent wire length |
So can let long hypercube wires have "local" stops and hypercube becomes a "thick mesh" |
With good routing technologies, a thick mesh is better than a hypercube? |
We can in diagram below choose the staging point X to be:
|
Conversely the "processor" always routes messages in a hypercube. This routing element can be attached to CPU as in hypercube or an uncoupled processor -- routing chip pair as in switch
|
Hierarchical Hypercube -- Spatial and Temporal Decomposition |
Homogeneous Hypercube -- Purely Spatial Decomposition |
Cannot be manipulated independently |
Consider "nodes" = printed circuit boards (Paragon) or bunch of printed circuit boards sharing a BUS (SGI Challenge) where the N nodes are connected as a distributed memory architecture |
Inside each node, may have n=2,4,8 .... processors sharing a single memory. |
Should one use this shared memory or just consider total machine as distributed with Nn nodes? |
This is seen today (1995) as a network of SGI Challenge systems (i.e. metacomputer where conventional HIPPI network joins set of shared memory machines) or as a specialized torus network connecting Paragon nodes consisting of multiple i860's |
Now we consider an example of a real system - the Intel Delta Touchstone |
This was perhaps first real parallel supercomputer used in serious work |
It is already obsolete and replaced by a large Intel Paragon which has a similar architecture |
Computational Characteristics
|
System Characteristics
|
Physical Characteristics
|
Communications Characteristics
|
513 Compute Nodes; 30.8 GFlops (Double-precision Peak Performance) configured as 16x32 Compute Mesh +1 Compute Node |
8.2 GBytes Memory |
95 GBytes Disk Space |
513 compute nodes using i860 processors with 16 MBytes memory each |
9 service nodes using 80386 processors with 8MBytes memory |
32 disk I/O nodes using 80386 processors with 8 MBytes memory each controlling 2- 1.5 GByte disks |
9 tape I/O nodes using 80386 processors with 8 MBytes memory
|
2 ethernet nodes using 80386 processors with 8 MBytes memory |
2 HIPPI interfaces using i860 processors with 32 MBytes memory |
RISC engine + 2FPUs + graphics accelerator |
40 MHz clock |
Parallel integer and floating-point processing units |
Parallel multiplier and adder units within floating-point unit |
Pipelined floating-point processing units (3-stage pipe) |
80 MFlops (SP) Peak, 60 MFlops (DP) Peak, 33 MIP's (Integer) |
all transcendental functions done in software |
16 MBytes Main Memory (expandable to 64 MBytes) |
160 MBytes/sec peak DRAM access rate |
Very hard to build good compilers for i860! |
Physically Independent bidirectional 2-Dimensional Mesh |
System-Wide Communications Fabric
|
Interconnect supports arbitrary expansion increments |
Automatically routes messages without interrupting intermediate compute nodes |
Nearly equal performance between any 2 nodes, independent of distance between them |
Programmer can ignore details of how messages move along the interconnect network |
Mesh Routing Chips have 5 bi-directional channels each
|
Interconnect supports 28 MBytes/sec node-node bandwidth |
Allows node fault isolation |
1) ILLIAC |
2) ICL DAP Þ AMT DAP |
3) Goodyear MPP Þ LORAL ASPRO used on E2C plane s
|
4) Denelcor* HEP founded by Burton Smith Þ Tera |
5) Cray XMP/22 characterized by Shared Memory and Powerful nodes |
6) Ultracomputer (NYU) Þ IBM RP3 |
7) Cosmic Cube (Mark I) Þ Intel
|
8) Sequent Encore* |
9) Alliant* Þ Cedar* (Illinois) |
10) Multiflow* Wide word Þ Super Scaler
|
11) Intel iPSC1,2 -- First commercial MIMD
|
12) Transputers (open nCUBE) Embedded systems
|
13) CM-2 Real SIMD systems
|
14) nCUBE-2 Powerful Hypercubes
|
15) Intel Delta (528 nodes) MIMD but not Hypercube (mesh) t
|
16) BBN* TC2000 (Butterfly) Series of elegant but slow shared memory |
17) Cray C-90 Very successful "conventional" supercomputer |
18) CM-5 New MIMD Distributed Memory
|
19) DASH Virtual Shared Memory
|