Given by Geoffrey C. Fox at CPS615 Introduction to Computational Science on Fall Semester 1998. Foils prepared 17 November 98
Outside Index
Summary of Material
This presentation came from material developed by David Culler and Jack Dongarra available on the Web |
See summary of Saleh Elmohamed and Ken Hawick at http://nhse.npac.syr.edu/hpccsurvey/ |
We discuss several examples in detail including T3E, Origin 2000, Sun E10000 and Tera MTA |
These are used to illustrate major architecture types |
We discuss key sequential architecture issues including cache structure |
We also discuss technologies from today's commodities through Petaflop ideas and Quantum Computing |
Outside Index
Summary of Material
Fall Semester 98 1998 |
Geoffrey Fox |
Northeast Parallel Architectures Center |
Syracuse University |
111 College Place |
Syracuse NY |
gcf@npac.syr.edu |
This presentation came from material developed by David Culler and Jack Dongarra available on the Web |
See summary of Saleh Elmohamed and Ken Hawick at http://nhse.npac.syr.edu/hpccsurvey/ |
We discuss several examples in detail including T3E, Origin 2000, Sun E10000 and Tera MTA |
These are used to illustrate major architecture types |
We discuss key sequential architecture issues including cache structure |
We also discuss technologies from today's commodities through Petaflop ideas and Quantum Computing |
CM5 |
nCUBE |
Intel iPSC2 |
Workstation Cluster of Digital alpha machines |
NCSA 1024 nodes |
NPAC |
Architecture translates technology's gifts to performance and capability |
Resolves the tradeoff between parallelism and locality
|
Four generations of architectural history: tube, transistor, IC, VLSI
|
Greatest delineation within VLSI generation has been in type of parallelism exploited |
Greatest trend in VLSI generation is increase in parallelism
|
How good is instruction-level parallelism? |
Thread-level parallelism needed in future microprocessors to use available transistors? |
Threads need classic coarse grain data or functional parallelism |
transistors |
Exponential Improvement is (an example of) Moore's Law |
Shared memory SMP's dominate server and enterprise market, moving down to desktop as cost (size) of a single processor decreased |
Faster processors began to saturate bus, then bus technology advanced |
Today, range of sizes for bus-based systems is desktop (2-8) to 64 on large servers |
Cray 6400 becomes |
Commodity microprocessors not only fast but CHEAP
|
Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors |
Standardization by Intel makes small, bus-based SMPs commodity |
Desktop: few smaller processors versus one larger one?
|
Pipelined High Performance Workstations/PC's |
Vector Supercomputer |
MIMD Distributed Memory machine; heterogeneous clusters; metacomputers |
SMP Symmetric Multiprocessors |
Distributed Shared Memory NUMA |
Special Purpose Machines |
SIMD |
MTA or Multithreaded architectures |
COMA or Cache Only Memories |
The Past? |
The Future? |
Pragmatic issues connected with cost of designing and building new systems -- use commodity hardware and software if possible
|
Data locality and bandwidth to memory -- caches, vector registers -- sequential and parallel |
Programming model -- shared address or data parallel -- explicit or implicit |
Forms of parallelism -- data, control, functional; macroscopic, microscopic, instruction level |
Familiar from such everyday activities as getting food in cafeteria where one processes one person per "clock cycle" where |
clock cycle here is maximum time anybody takes at a single "stage" where stage is here component of meal (salad, entrée etc.) |
Note any one person takes about 5 clock cycles in this pipeline but the pipeline processes one person per clock cycle |
Pipeline has problem if there is a "stall" -- here that one of the people wants an entrée which needs to be fetched from kitchen. This delays everybody! |
In computer case, stall is caused typically by data not being ready for a particular instruction |
Taken from David Patterson CS252 Berkeley Fall 1996 |
3 Functional Units FP Adder, FP Multiplier, FP Divider |
8 Kinds of Stages in FP Units |
Stage Functional Unit Description |
A FP Adder Mantissa ADD stage |
D FP Divider Divide Pipeline stage |
E FP Multiplier Exception Test stage |
M FP Multiplier First stage of multiplier |
N FP Multiplier Second stage of multiplier |
R FP Adder Rounding stage |
S FP Adder Operand Shift stage |
U Unpack Floating point numbers |
Several different pipelines with different lengths! |
Add,Subtract- 4 clocks:U S+A A+R R+S |
Multiply - 8 clocks: U E+M M M M N+A R |
Divide- 38 clocks: U A R D(28) D+A D+R D+R D+A D+R A R |
Square Root- 110 clocks: U E (A+R)(108) A R |
Negate- 2 clocks: U S |
Absolute Value- 2 clocks: U S |
Floating Point Compare- 3 clocks:U A R |
SGI Workstations at NPAC 1995 |
Data locality implies CPU finds information it needs in cache which stores most recently accessed information |
This means one reuses a given memory reference in many nearby computations e.g. |
A1 = B*C |
A2 = B*D + B*B |
.... Reuses B |
L3 Cache |
Main |
Memory |
Disk |
Increasing Memory |
Capacity Decreasing |
Memory Speed (factor of 100 difference between processor |
and main memory |
speed) |
As shown above, caches are familiar in the real world -- here to support movement of food from manufacturer to your larder. It would be inconvenient to drive to the store for every item needed -- it is more convenient to cache items in your larder |
Caches store instructions and data -- often in separate caches |
Cache have a total size but also the cache line size which is minimum unit transferred into cache -- this due to spatial locality is often quite big. |
They also have a "write-back" strategy to define when information is written back from cache into primary memory |
Factory |
Level 3 Middleman Warehouse |
Level 2 Middleman Warehouse |
Local Supermarket |
Your Larder |
CPU -- The Frying pan |
Finally caches have a mapping strategy which moves tells you where to write a given word into cache and when to overwrite with another data value fetched from main memory |
Direct mapped caches hash each word of main memory into a unique location in cache
|
Fully associative caches remove that word in cache which was unreferenced for longest time and then stores new value into this spot |
Set associative caches combine these ideas. They have 2 to 4 (to ..) locations for each hash value and replace oldest reference in that group.
|
In classic loop Do I =2,N Do J=2,N FI(I,J) =.25*(FI(I+1,J)+FI(I-1,J)+FI(I,J+1)+FI(I,J-1)) |
We see spatial locality -- if (I,J) accessed so are neighboring points stored near (I,J) in memory |
Spatial locality is essential for distributed memory as ensures that after data decomposition, most data you need is stored in same processor and so communication is modest (surface over volume)
|
Temporal locality says that if you use FI(I,J) in one loop it is used in next J index value as FI(I,(J+1)-1) |
Temporal locality makes cache machines work well as ensures after a data value stored into cache, it is used multiple times |
If first (main memory access) takes time T1 and each subsequent (i.e. cache ) access takes time T2 with T2 << T1, and a data value is accessed l times while in cache, then average access time is: T2 + T1/l |
Temporal locality ensures l big |
Spatial locality helps here as fetch all data in a cache line (say 128 bytes) in the time T1. Thus one can effectively reduce T1 by a further factor equal to number of words in a cache line (perhaps 16) |
For both parallel and sequential computers, cost is accessing remote memories with some form of "communication" |
Data locality addresses in both cases |
Differences are quantitative size of effect and what is done by user and what automatically |
Main |
Memory |
Interconnection Network |
.... |
.... |
Main |
Memory |
Main |
Memory |
Slow |
Can be Very Slow |
There are 4 approaches to Cache Coherence -- the difficulty of maintaining correct caches in a parallel environment |
1) Ignore problem in hardware -- let user and/or software cope with this chore -- this is approach followed in machines like T3E,SP2 and all explicit parallel programming models |
2)Snoopy Buses. This is approach used in most SMP's where caches (at a given level) share a special bus also connected to memory. When a request is made in a give cache, this is broadcast on the bus, so that caches with a more recent value can respond |
3)Scalable Coherent Interface (SCI). This differs from snoopy bus by using a fast serial connection which pipes requests through all processors. This is standard developed by high energy physics community. |
4)Directory Schemes. These have a directory on each processor which keeps track of which cache line is where and which is up to date. The directories on each node are connected and communicate with each when a memory location is accessed |
Linpack Gflops |
l |
n |
Cray vector machines |
Microprocessor MPP |
Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32) |
Since 1993, Cray produces MPPs too T3D, T3E |
Machine Mflops Place/Country Year # PROCS |
1 Intel ASCI Red 1338000 Sandia National Lab USA 1997 9152 |
2 SGI T3E1200 891500 Classified USA 1998 1084 |
3 SGI T3E900 815100 Classified USA 1997 1324 |
4 SGI ASCI Blue 690900 LANL USA 1998 6144 |
Mountain |
5 SGI T3E900 552920 UK Met Office UK 1997 876 |
6 IBM ASCI Blue 547000 LLNL USA 1998 3904 |
Pacific |
7 IBM ASCI Blue 547000 LLNL USA 1998 1952 |
Pacific |
8 SGI T3E1200 509900 UK Centre for Science UK 1998 612 |
9 SGI T3E900 449000 NAVOCEANO USA 1997 700 |
10 SGI T3E 448600 NASA/GSFC USA 1998 1084 |
(Parallel Vector) |
Shared Address Space or Shared Memory
|
Explicitly parallel or Data Parallel
|
Any processor can directly reference any memory location
|
Convenient:
|
Naturally provided on wide range of platforms
|
Popularly known as shared memory machines or model
|
Process: virtual address space plus one or more threads of control |
Portions of address spaces of processes are shared |
Writes to shared address visible to other threads (in other processes too) |
Natural extension of uniprocessor's model:
|
OS uses shared memory to coordinate processes |
Memory capacity increased by adding modules, I/O by controllers
|
Communication is natural extension of uniprocessor |
Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort |
"Mainframe" approach
|
"Minicomputer" approach
|
Problem is interconnect: cost (crossbar) or bandwidth (bus)
|
Dance-hall: bandwidth still scalable, but lower cost than crossbar
|
Distributed memory or non-uniform memory access (NUMA)
|
Caching shared (particularly nonlocal) data? |
Complete computer as building block, including I/O
|
Programming model:
|
High-level block diagram similar to distributed-memory SAS (Shared Address Space)
|
Programming model more removed from basic hardware operations (?)
|
Send specifies buffer to be transmitted and receiving process |
Receive specifies sending process and application storage to receive into |
Memory to memory copy, but need to name processes |
Optional tag on send and matching rule on receive |
User process names local data and entities in process/tag space too |
In simplest form, the send/recv match achieves pairwise synch event but also collective communication |
Many overheads: copying, buffer management, protection |
Early machines such as Caltech Hypercube and first commercial Intel and nCUBE designs used a FIFO on each node to store and forward messages
|
Diminishing role of topology in modern machines
|
64 node n=6 hypercube in Caltech Computer Science Dept. |
Hypercube Topology for 8 machines |
See next foil for basic chip -- these were connected in a hypercube |
This and related transputer design were very iunnovative but failed as could not exploit commodity microprocessor design economies |
Made out of essentially complete RS6000 workstations |
Network interface integrated in I/O bus |
Most successful MIMD Distributed Memory Machine |
L |
2 |
$ |
Power 2 |
CPU |
Memory |
contr |
oller |
4-way |
interleaved |
DRAM |
General inter |
connection |
network formed fr |
om |
8-port switches |
NIC |
Memory bus (64-bit, 50 MHz) |
i860 |
L |
1 |
$ |
NI |
DMA |
i860 |
L |
1 |
$ |
Driver |
Mem |
ctrl |
4-way |
interleaved |
DRAM |
Intel |
Paragon |
node |
8 bits, |
175 MHz, |
bidir |
ectional |
2D grid network |
with pr |
ocessing node |
attached to every switch |
Sandia' |
s Intel Paragon XP/S-based Super |
computer |
2D Grid Network with Processing Node Attached to Every Switch |
Powerful i860 node made this first "serious" MIMD distributed memory machine |
PCcube using serial ports on 80286 machines as REU Undergraduate Project 1986 |
Naegling at Caltech with Tom Sterling and John Salmon 1998 120 Pentium Pro Processors |
Beowulf at Goddard Space Flight Center |
NCSA measured some single processor results on HP Kayak PC with 300 MHz Intel Pentium II. The compiler is a Digital compiler with optimization level of 4. For comparison, also included are results from the Origin 2000. This is a CFD Application |
http://www.ncsa.uiuc.edu/SCD/Perf/Tuning/sp_perf/ |
In general the Origin is twice as fast - For the HP-Kayak there is a sharp decline going from 64x64 to 128x128 matrices, while on the Origin the decline is more gradual and usually gets memory bound beyond 256x256. This is a result of the smaller cache on the Intel chip. |
CRAY-1 supercomputer |
Cray's first supercomputer. Introduced in 1976, this system had a peak performance of 133 megaflops. The first system was installed at Alamos National Laboratory. |
CRAY T90 systems are available in three models: the CRAY T94 system, offered in air- or liquid-cooled systems, that scales up to four processors; |
the CRAY T916 system, a liquid-cooled system that scales up to 16 processors; and the top-of-the-line CRAY T932 system, also liquid-cooled |
with up to 32 processors and has a peak performance of over 60 gigaflops |
System Memory Clock Ratio FP ops FP ops to cover |
latency speed per clock memory |
[ns] [ns] period latency |
CDC 7600 275 27.5 10 1 10 |
CRAY 1 150 12.5 12 2 24 |
CRAY 120 8.5 14 2 28 |
X-MP |
SGI Power |
Challenge ~760 13.3 57 4 228 |
CRAY |
T3E-900 ~280 2.2 126 2 252 |
This and following foils from Performance of the CRAY T3E Multiprocessor by Anderson, Brooks, Grassi and Scott at http://www.cray.com/products/systems/crayt3e/1200/performance.html |
Air cooled T3E |
T3E Torus Communication Network |
T3E Node with Digital Alpha Chip |
This is innovative as supports "get" and "put" where memory controller converts off processor memory reference to a message with greater convenience of use |
Each CRAY T3E processor contains an 8 KB direct-mapped primary data cache (Dcache), an 8 KB instruction cache, and a 96 KB 3-way associative secondary cache (Scache) which is used for both data and instructions. |
The Scache has a random replacement policy and is write-allocate and write-back, meaning that a cacheable store request to an address that is not in the cache causes that address to be loaded into the Scache, then modified and tagged as dirty for write-back later. |
Write-back of dirty cache lines occurs only when the line is removed from the Scache, either by the Scache controller to make room for a new cache line, or by the back-map to maintain coherence of the caches with the local memory and/or registers. |
Peak data transfer rates on the CRAY T3E-900 |
Type of access Latency Bandwidth |
in CPU cycles [MB/s] |
Dcache load 2 7200 |
Scache load 8-10 7200 |
Dcache or |
Scache store -- 3600 |
These rates correspond to the maximum instruction issue rate of two loads per CPU cycle or one store per CPU cycle. |
From http://www.cray.com/products/systems/crayt3e/1200/performance.html |
REAL*8 AA(513,513), DD(513,513) |
REAL*8 X (513,513), Y (513,513) |
REAL*8 RX(513,513), RY(513,513) |
DO J = 2,N-1 |
DO I = 2,N-1 |
XX = X(I+1,J)-X(I-1,J) |
YX = Y(I+1,J)-Y(I-1,J) |
XY = X(I,J+1)-X(I,J-1) |
YY = Y(I,J+1)-Y(I,J-1) |
A = 0.25 * (XY*XY+YY*YY) |
B = 0.25* (XX*XX+YX*YX) |
C = 0.125 * (XX*XY+YX*YY) |
Continued on Next Page |
AA(I,J) = -B |
DD(I,J) = B+B+A*REL |
PXX = X(I+1,J)-2.*X(I,J)+X(I-1,J) |
QXX = Y(I+1,J)-2.*Y(I,J)+Y(I-1,J) |
PYY = X(I,J+1)-2.*X(I,J)+X(I,J-1) |
QYY = Y(I,J+1)-2.*Y(I,J)+Y(I,J-1) |
PXY = X(I+1,J+1)-X(I+1,J-1)-X(I-1,J+1)+X(I-1,J-1) |
QXY = Y(I+1,J+1)-Y(I+1,J-1)-Y(I-1,J+1)+Y(I-1,J-1) |
RX(I,J) = A*PXX+B*PYY-C*PXY |
RY(I,J) = A*QXX+BCa*QYY-C*QXY |
END DO |
END DO |
Continued from Previous Page |
The inner loop of this kernel has 47 floating point operations, 18 array reads and 4 array writes.
|
9 point stencil |
Since all six arrays are the same size and are accessed at the same rate, we can ensure that they do not interfere with each other if they do not conflict initially. For this example, it is convenient to optimize for only one 4096-word set of the 3-way set associative Scache.
|
The system comes in two versions, deskside or a rack system. |
The deskside has between 1 to 4 node cards (1 to up to 8 CPUs). |
The rack system has 1 to 64 node cards for a total between 2 to 128 CPUs. |
Each node is based on the 64-bit MIPS RISC R10000 architecture. |
Also, each node has two primary caches (each 32KB two-way set-associative) and one secondary L2 cache (1 or 4MB two-way set associative) per CPU. |
Each node has hardware cache coherency using a directory system and a maximum bandwidth of 780MB/sec. |
The entire system (Cray Origin2000) has up to 512 such nodes, that is, up to 1024 processors.
|
The SysAD (system address and data) bus of the previous figure connecting the two processors has a
|
The Hub's connections to the off-board net router chip and Xbow I/O interface are 1.56 GB/s each. |
RIEMANN is a general-purpose, higher-order accurate, Eulerian gas Dynamics code based on Gudunov schemes. Dinshaw Balsara |
Laplace : the solution of sparse linear systems resulting from Navier-Stokes - Laplace equations. Danesh Tafti |
QMC (Quantum Monte Carlo) Lubos Mitas |
PPM (Piece-wise Parabolic Method) |
MATVEC (Matrix-Vector Multiply) Dave McWilliams |
Cactus Numerical Relativity Ed Seidel |
Suppose two processors cache the same variable stored in memory of one of the processors |
One must ensure cache coherence so that when one cache value changes, all do! |
.... |
.... |
System Interconnection Network |
L3 Cache |
Main |
Memory |
Main |
Memory |
Cached Value of same shared variable |
Board level Interconnection Network |
Board level Interconnection Network |
Machines like the SGI Origin 2000 have a distributed shared memory with a so called directory implementation (pioneered in DASH project at Stanford) of cache coherence |
Machines like SGI Cray T3E are distributed memory but do have fast get and put so as to be able to access single variables stored on remote memories
|
Origin 2000 approach does not scale as well as Cray T3E and large Origin 2000 systems must use message passing to link 128 node coherent memory subsystems |
Cray T3E offers a uniform (but not as good for small number of nodes) interface |
Pure Message passing / distributed memory is natural web model |
Commodity SMP |
All coherence and multiprocessing glue in processor module |
Highly integrated, targeted at high volume |
Low latency and bandwidth |
P-Pr |
o bus (64-bit data, 36-bit addr |
ess, 66 MHz) |
CPU |
Bus interface |
MIU |
P-Pr |
o |
module |
P-Pr |
o |
module |
P-Pr |
o |
module |
256-KB |
L |
2 |
$ |
Interrupt |
contr |
oller |
PCI |
bridge |
PCI |
bridge |
Memory |
contr |
oller |
1-, 2-, or 4-way |
interleaved |
DRAM |
PCI bus |
PCI bus |
PCI |
I/O |
car |
ds |
It is perhaps most successful SMP (Symmetric Multiprocessor) with applicability to commercial and scientific market |
Its SMP characteristics are seen in its low uniform latency |
One cannot build huge E10000 systems (only up to 64 nodes and each node is slower than Origin) |
One should cluster multiple E10000's to build a supercomputer |
These are very successful in commercial server market for database and related applications |
E10000 acquired from Cray via SGI is also called Starfire |
TPCD Benchmarks 1000Gbyte(Terabyte) November 1998 |
These measure typical large scale database queries |
System Database Power 5 Year Total $/perf |
Metric System Cost ($/QphD) |
(QppD) |
Sun Starfire (SMP) Oracle 27,024.6 $9,660.193 $776 |
IBM RS/6000 (MPP) DB2 19,137.5 $11,380,178 $797 |
Sun 4x6000 (Clus) Informix 12,931.9 $11,766,932 $1,353 |
NCR 5150 (MPP) Teradata 12,149.2 $14,495,886 $2,103 |
The Starfire server houses a group of system boards interconnected by a centerplane.
|
On each system board there us up to four 336 MHz UltraSPARC microprocessor modules with supporting two level/4 Mbyte cache per module (64 per Starfire system). |
There can be four memory banks with a capacity of up to 4 Gbytes per system board (64 Gbytes per Starfire server). |
There are two SBuses per board, each with slots for up to two adapters for networking and I/O (32 SBuses or 64 slots per system).
|
This is aimed at commercial customers where Origin 2000 is not a major player |
The Starfire scales to smaller sizes than machines like the IBM SP2 and Cray T3E but significantly larger sizes than competing SMP's |
Results from Samson Cheung of NAS (Ames) |
Origin 2000: mpirun -np 2 ./lbw -B -n 4000 -D |
Total transfer time: 57.281 sec. |
Transfer rate: 69.8 transfers/sec. |
Message size: 1000000 bytes |
Bandwidth: 139.664 10e+06 Bytes/sec |
Sun E10000: tmrun -np 2 ./lbw -B -n 4000 -D |
Total transfer time: 54.487 sec. |
Transfer rate: 73.4 transfers/sec. |
Message size: 1000000 bytes |
Bandwidth: 146.825 10e+06 Bytes/sec |
Origin 2000: mpirun -np 2 ./lbw -L -n 2500000 -D |
Total transfer time: 52.597 sec. |
Transfer rate: 47531.4 transfers/sec. |
Message size: 40 bytes |
Latency: 10.5 microsec. |
Sun E10000: tmrun -np 2 ./lbw -L -n 2500000 -D |
Total transfer time: 34.585 sec. |
Transfer rate: 72286.7 transfers/sec. |
Message size: 40 bytes |
Latency: 6.9 microsec. |
Note: Origin CPU about twice performance of Sun CPU |
This uses a clever idea developed over many years by Burton Smith who originally used it in the Denelcor system which was one of the first MIMD machines over 15 years ago |
MTA(multithreaded architectures) are designed to hide the different access times of memory and CPU cycle time.
|
First 2 processor system |
Tera computer system is a shared memory multiprocessor.
|
The Tera is a multi-processor which potentially can accommodate up to 256 processors.
|
The clock speed is nominally 333 Mhz, giving each processor a data path bandwidth of one billion 64-bit results per second and a peak performance of one gigaflops. |
The Tera Processors are multithreaded (called a stream) and each processor switches context every cycle among as many as 128 hardware threads, thereby hiding up to 128 cycles (384 ns) of memory latency. |
Each processor executes a 21 stage pipeline and so can have 21 separate streams executing simultaneously |
Each stream can issue as many as eight memory references without waiting for earlier ones to finish, further augmenting the memory latency tolerance of the processor. |
A stream implements a load-store architecture with three addressing modes and 31 general-purpose 64-bit registers.
|
The peak memory bandwidth is 2.67 gigabytes per second. |
From Bokhari (ICASE) |
From Bokhari (ICASE) |
The interconnection net is a sparsely populated 3-D packet switched containing p^(3/2) nodes, where p is the number of processors. |
These nodes are toroidally connected in three dimensions to form a p^(1/2)-ary three-cube, and processor and memory resources are attached to some of the nodes. |
The latency of a node is three cycles: a message spends two cycles in the node logic proper and one on the wire that connects the node to its neighbors. |
A p-processor system has worst-case one-way latency of 4.5p^(1/2) cycles. |
Messages are assigned random priorities and then routed in priority order. Under heavy load, some messages are derouted by this process. The randomization at each node insures that each packet eventually reaches its destination.
|
A node has four ports (five if a resource is attached). |
Each port simultaneously transmits and receives an entire 164-bit packet every 3 ns clock cycle. |
Of the 164 bits, 64 are data, so the data bandwidth per port is 2.67 GB/s in each direction. |
The network bisection bandwidth is 2.67p GB/s. (p is number of processors) |
The network routing nodes contain no buffers other than those required for the pipeline.
|
From Allan Snavely at SDSC |
The overall hardware configuration of the system: |
Processors 16 64 256 |
Peak Gflops 16 64 256 |
Memory, Gbytes 16-32 64-128 256-512 |
HIPPI channels 32 128 512 |
I/O, Gbytes/sec 6.2 25 102 |
Relative MTA and T90 Performance |
From Allan Snavely at SDSC |
NAS Benchmarks |
255 Mhz |
300 Mhz |
KSR-1 (Colorado) |
Cache Only Machines "only" have cache and are typified by the Kendall Square KSR-1,2 machines. Although this company is no longer in business, the basic architecture is interesting and could still be used in future important machines |
In this class of machine one has a NUMA architecture with memory attached to a given processor being lower access cost |
In simplest COMA architectures, think of this memory as the cache and when needed migrate data to this cache |
However in conventional machines, all data has a natural "original home"; in (simple) COMA, the home of the data moves when it is accessed and one hopes that data is "attracted" through access to "correct" processor |
Maspar (DECmpp) SIMD Machine at NPAC 1995 We had two such machines with 8K and 16K nodes respectively |
For a short time Digital resold the Maspar as the DECmpp |
ICL DAP 4096 Processors circa 1978 |
Proving ground and driver for innovative architecture and techniques
|
Large-scale multiprocessors replace vector supercomputers
|
Evolution and role of software have blurred boundary
|
Hardware organization converging too
|
Even clusters of workstations/SMPs are parallel systems
|
Programming models distinct, but organizations converging
|
Node: processor(s), memory system, plus communication assist
|
Scalable network |
Convergence allows lots of innovation, now within framework
|
A generic modern multiprocessor (Culler) |
Disk Vault |
GRAPE 4 1.08 Teraflops |
GRAPE 6 200 Teraflops |
Here the case for special purpose machines is less compelling than for GRAPE as QCD is "just" regular (easy to parallelize; lowish communication) and extremely floating point intensive. |
We illustrate with two machines which are classic MIMD distributed memory architecture with optimized nodes/communication networks |
BNL/Columbia QCDSP 400 Megaflops |
Univ. of CP-PACS (general physics) 600 Megaflops |
Full System by Artist and camera |
ASCI Red Interconnect |