Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 98. Foils prepared 17 November 1998
Outside Index
Summary of Material
This presentation came from material developed by David Culler and Jack Dongarra available on the Web |
See summary of Saleh Elmohamed and Ken Hawick at http://nhse.npac.syr.edu/hpccsurvey/ |
We discuss several examples in detail including T3E, Origin 2000, Sun E10000 and Tera MTA |
These are used to illustrate major architecture types |
We discuss key sequential architecture issues including cache structure |
We also discuss technologies from today's commodities through Petaflop ideas and Quantum Computing |
Outside Index
Summary of Material
Fall Semester 98 1998 |
Geoffrey Fox |
Northeast Parallel Architectures Center |
Syracuse University |
111 College Place |
Syracuse NY |
gcf@npac.syr.edu |
This presentation came from material developed by David Culler and Jack Dongarra available on the Web |
See summary of Saleh Elmohamed and Ken Hawick at http://nhse.npac.syr.edu/hpccsurvey/ |
We discuss several examples in detail including T3E, Origin 2000, Sun E10000 and Tera MTA |
These are used to illustrate major architecture types |
We discuss key sequential architecture issues including cache structure |
We also discuss technologies from today's commodities through Petaflop ideas and Quantum Computing |
CM5 |
nCUBE |
Intel iPSC2 |
Workstation Cluster of Digital alpha machines |
NCSA 1024 nodes |
NPAC |
We can choose technology and architecture separately in designing our high performance system |
Technology is like choosing ants people or tanks as basic units in our society analogy
|
In HPCC arena, we can distinguish current technologies
|
Near term technology choices include
|
Further term technology choices include
|
It will cost $40 Billion for next industry investment in CMOS plants and this huge investment makes it hard for new technologies to "break in" |
Architecture is equivalent to organization or design in society analogy
|
We can distinguish formal and informal parallel computers |
Informal parallel computers are typically "metacomputers"
|
Metacomputers are a very important trend which uses similar software and algorithms to conventional "MPP's" but have typically less optimized parameters
|
Formal high performance computers are the classic (basic) object of study and are |
"closely coupled" specially designed collections of compute nodes which have (in principle) been carefully optimized and balanced in the areas of
|
In society, we see a rich set of technologies and architectures
|
With several different communication mechanisms with different trade-offs
|
Architecture translates technology's gifts to performance and capability |
Resolves the tradeoff between parallelism and locality
|
Four generations of architectural history: tube, transistor, IC, VLSI
|
Greatest delineation within VLSI generation has been in type of parallelism exploited |
Greatest trend in VLSI generation is increase in parallelism
|
How good is instruction-level parallelism? |
Thread-level parallelism needed in future microprocessors to use available transistors? |
Threads need classic coarse grain data or functional parallelism |
transistors |
Exponential Improvement is (an example of) Moore's Law |
These foils come from summary by David Bailey |
First of all, there seems to be general agreement that Moore's Law will not stop anytime soon. |
The current state of the art in the semiconductor industry is 250 nm, as recently announced by Intel among others. Researchers are confident that current approaches can be used for at least another three generations with their current approach. |
In the years ahead we may even see some manufacturers skip a generation, proceeding directly to significantly smaller feature sizes. This means that the 100 nm technology wall will be reached earlier than previously anticipated. |
Below about 100 nm feature sizes, progress will be more difficult, for various well known reasons, among them the lack of any substance that can be used as a lens for photolithography at such wavelengths. |
Nonetheless, there are a number of "tricks" in the works, including the usage of diffraction gratings, parallel exposures, and others. |
Groups working on these "ultimate" silicon techniques include Frances Houle's group at IBM Almaden and Jeff Baker's group at UC Berkeley. |
Also helping will be some improvements such as better materials and increased wafer sizes. The consensus is that these techniques will be good for another two generations, to about 50 nm. Thus Moore's Law should continue until about 2010 or 2012. |
Below about 50 nm feature sizes, it appears that a completely new device fabrication technology is needed. |
Progress in electron beams and X-rays, which were the leading candidates of a few years ago, has stalled.
|
The researchers I spoke with nonetheless agree that future devices will be electronic for the foreseeable future. |
As Stan Williams of HP points out, electrons are the ideal basis for device technology because they have basically zero size and mass, can travel at large fractions of the speed of light and interact strongly with matter. |
In contrast, DNA computing and the like are still curiosities with no clear path to practical high-speed computing. |
One exciting development, which has emerged just in the past year or two, is nanotubes, i.e. cylindrical buckyballs. |
It was recently discovered that a nanotube can be "tuned" from insulator to semiconductor to conductor just by changing the pitch of the helical structure of carbon atoms. |
Kinks introduced in a tube can be used to form a conductor-semiconductor junction.
|
Shared memory SMP's dominate server and enterprise market, moving down to desktop as cost (size) of a single processor decreased |
Faster processors began to saturate bus, then bus technology advanced |
Today, range of sizes for bus-based systems is desktop (2-8) to 64 on large servers |
Cray 6400 becomes |
Commodity microprocessors not only fast but CHEAP
|
Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors |
Standardization by Intel makes small, bus-based SMPs commodity |
Desktop: few smaller processors versus one larger one?
|
Pipelined High Performance Workstations/PC's |
Vector Supercomputer |
MIMD Distributed Memory machine; heterogeneous clusters; metacomputers |
SMP Symmetric Multiprocessors |
Distributed Shared Memory NUMA |
Special Purpose Machines |
SIMD |
MTA or Multithreaded architectures |
COMA or Cache Only Memories |
The Past? |
The Future? |
Pragmatic issues connected with cost of designing and building new systems -- use commodity hardware and software if possible
|
Data locality and bandwidth to memory -- caches, vector registers -- sequential and parallel |
Programming model -- shared address or data parallel -- explicit or implicit |
Forms of parallelism -- data, control, functional; macroscopic, microscopic, instruction level |
Sequential or von Neuman Architecture |
Vector (Super)computers |
Parallel Computers
|
Very high speed computing systems,Proc of IEEE 54,12,p1901-1909(1966) and |
Some Computer Organizations and their Effectiveness, IEEE Trans. on Comp. C-21,948-960(1972) -- both papers by M.J. Flynn |
SISD -- Single Instruction stream, Single Data Stream -- i.e. von Neumann Architecture |
MISD -- Multiple Instruction stream, Single Data Stream -- Not interesting |
SIMD -- Single Instruction stream, Multiple Data Stream |
MIMD -- Multiple Instruction stream and Multiple Data Stream -- dominant parallel system with ~one to ~one match of instruction and data streams. |
Linpack Gflops |
l |
n |
Cray vector machines |
Microprocessor MPP |
Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32) |
Since 1993, Cray produces MPPs too T3D, T3E |
Machine Mflops Place/Country Year # PROCS |
1 Intel ASCI Red 1338000 Sandia National Lab USA 1997 9152 |
2 SGI T3E1200 891500 Classified USA 1998 1084 |
3 SGI T3E900 815100 Classified USA 1997 1324 |
4 SGI ASCI Blue 690900 LANL USA 1998 6144 |
Mountain |
5 SGI T3E900 552920 UK Met Office UK 1997 876 |
6 IBM ASCI Blue 547000 LLNL USA 1998 3904 |
Pacific |
7 IBM ASCI Blue 547000 LLNL USA 1998 1952 |
Pacific |
8 SGI T3E1200 509900 UK Centre for Science UK 1998 612 |
9 SGI T3E900 449000 NAVOCEANO USA 1997 700 |
10 SGI T3E 448600 NASA/GSFC USA 1998 1084 |
(Parallel Vector) |
See Original Foil |
See Original Foil |
See Original Foil |
See Original Foil |
Instructions and data are stored in the same memory for which there is a single link (the von Neumann Bottleneck) to the CPU which decodes and executues instructions |
The CPU can have multiple functional units |
The memory access can be enhanced by use of caches made from faster memory to allow greater bandwidth and lower latency |
Familiar from such everyday activities as getting food in cafeteria where one processes one person per "clock cycle" where |
clock cycle here is maximum time anybody takes at a single "stage" where stage is here component of meal (salad, entrée etc.) |
Note any one person takes about 5 clock cycles in this pipeline but the pipeline processes one person per clock cycle |
Pipeline has problem if there is a "stall" -- here that one of the people wants an entrée which needs to be fetched from kitchen. This delays everybody! |
In computer case, stall is caused typically by data not being ready for a particular instruction |
Three Instructions are shown overlapped -- each starting one clock cycle after last |
Taken from David Patterson CS252 Berkeley Fall 1996 |
3 Functional Units FP Adder, FP Multiplier, FP Divider |
8 Kinds of Stages in FP Units |
Stage Functional Unit Description |
A FP Adder Mantissa ADD stage |
D FP Divider Divide Pipeline stage |
E FP Multiplier Exception Test stage |
M FP Multiplier First stage of multiplier |
N FP Multiplier Second stage of multiplier |
R FP Adder Rounding stage |
S FP Adder Operand Shift stage |
U Unpack Floating point numbers |
Several different pipelines with different lengths! |
Add,Subtract- 4 clocks:U S+A A+R R+S |
Multiply - 8 clocks: U E+M M M M N+A R |
Divide- 38 clocks: U A R D(28) D+A D+R D+R D+A D+R A R |
Square Root- 110 clocks: U E (A+R)(108) A R |
Negate- 2 clocks: U S |
Absolute Value- 2 clocks: U S |
Floating Point Compare- 3 clocks:U A R |
SGI Workstations at NPAC 1995 |
Fig 1.14 of Aspects of Computational Science |
Editor Aad van der Steen |
published by NCF |
Data locality implies CPU finds information it needs in cache which stores most recently accessed information |
This means one reuses a given memory reference in many nearby computations e.g. |
A1 = B*C |
A2 = B*D + B*B |
.... Reuses B |
L3 Cache |
Main |
Memory |
Disk |
Increasing Memory |
Capacity Decreasing |
Memory Speed (factor of 100 difference between processor |
and main memory |
speed) |
As shown above, caches are familiar in the real world -- here to support movement of food from manufacturer to your larder. It would be inconvenient to drive to the store for every item needed -- it is more convenient to cache items in your larder |
Caches store instructions and data -- often in separate caches |
Cache have a total size but also the cache line size which is minimum unit transferred into cache -- this due to spatial locality is often quite big. |
They also have a "write-back" strategy to define when information is written back from cache into primary memory |
Factory |
Level 3 Middleman Warehouse |
Level 2 Middleman Warehouse |
Local Supermarket |
Your Larder |
CPU -- The Frying pan |
Finally caches have a mapping strategy which moves tells you where to write a given word into cache and when to overwrite with another data value fetched from main memory |
Direct mapped caches hash each word of main memory into a unique location in cache
|
Fully associative caches remove that word in cache which was unreferenced for longest time and then stores new value into this spot |
Set associative caches combine these ideas. They have 2 to 4 (to ..) locations for each hash value and replace oldest reference in that group.
|
In classic loop Do I =2,N Do J=2,N FI(I,J) =.25*(FI(I+1,J)+FI(I-1,J)+FI(I,J+1)+FI(I,J-1)) |
We see spatial locality -- if (I,J) accessed so are neighboring points stored near (I,J) in memory |
Spatial locality is essential for distributed memory as ensures that after data decomposition, most data you need is stored in same processor and so communication is modest (surface over volume)
|
Temporal locality says that if you use FI(I,J) in one loop it is used in next J index value as FI(I,(J+1)-1) |
Temporal locality makes cache machines work well as ensures after a data value stored into cache, it is used multiple times |
If first (main memory access) takes time T1 and each subsequent (i.e. cache ) access takes time T2 with T2 << T1, and a data value is accessed l times while in cache, then average access time is: T2 + T1/l |
Temporal locality ensures l big |
Spatial locality helps here as fetch all data in a cache line (say 128 bytes) in the time T1. Thus one can effectively reduce T1 by a further factor equal to number of words in a cache line (perhaps 16) |
System Memory Clock Ratio FP ops FP ops to cover |
latency speed per clock memory |
[ns] [ns] period latency |
CDC 7600 275 27.5 10 1 10 |
CRAY 1 150 12.5 12 2 24 |
CRAY 120 8.5 14 2 28 |
X-MP |
SGI Power |
Challenge ~760 13.3 57 4 228 |
CRAY |
T3E-900 ~280 2.2 126 2 252 |
This and following foils from Performance of the CRAY T3E Multiprocessor by Anderson, Brooks, Grassi and Scott at http://www.cray.com/products/systems/crayt3e/1200/performance.html |
Air cooled T3E |
T3E Torus Communication Network |
T3E Node with Digital Alpha Chip |
This is innovative as supports "get" and "put" where memory controller converts off processor memory reference to a message with greater convenience of use |
Each CRAY T3E processor contains an 8 KB direct-mapped primary data cache (Dcache), an 8 KB instruction cache, and a 96 KB 3-way associative secondary cache (Scache) which is used for both data and instructions. |
The Scache has a random replacement policy and is write-allocate and write-back, meaning that a cacheable store request to an address that is not in the cache causes that address to be loaded into the Scache, then modified and tagged as dirty for write-back later. |
Write-back of dirty cache lines occurs only when the line is removed from the Scache, either by the Scache controller to make room for a new cache line, or by the back-map to maintain coherence of the caches with the local memory and/or registers. |
Peak data transfer rates on the CRAY T3E-900 |
Type of access Latency Bandwidth |
in CPU cycles [MB/s] |
Dcache load 2 7200 |
Scache load 8-10 7200 |
Dcache or |
Scache store -- 3600 |
These rates correspond to the maximum instruction issue rate of two loads per CPU cycle or one store per CPU cycle. |
From http://www.cray.com/products/systems/crayt3e/1200/performance.html |
REAL*8 AA(513,513), DD(513,513) |
REAL*8 X (513,513), Y (513,513) |
REAL*8 RX(513,513), RY(513,513) |
DO J = 2,N-1 |
DO I = 2,N-1 |
XX = X(I+1,J)-X(I-1,J) |
YX = Y(I+1,J)-Y(I-1,J) |
XY = X(I,J+1)-X(I,J-1) |
YY = Y(I,J+1)-Y(I,J-1) |
A = 0.25 * (XY*XY+YY*YY) |
B = 0.25* (XX*XX+YX*YX) |
C = 0.125 * (XX*XY+YX*YY) |
Continued on Next Page |
AA(I,J) = -B |
DD(I,J) = B+B+A*REL |
PXX = X(I+1,J)-2.*X(I,J)+X(I-1,J) |
QXX = Y(I+1,J)-2.*Y(I,J)+Y(I-1,J) |
PYY = X(I,J+1)-2.*X(I,J)+X(I,J-1) |
QYY = Y(I,J+1)-2.*Y(I,J)+Y(I,J-1) |
PXY = X(I+1,J+1)-X(I+1,J-1)-X(I-1,J+1)+X(I-1,J-1) |
QXY = Y(I+1,J+1)-Y(I+1,J-1)-Y(I-1,J+1)+Y(I-1,J-1) |
RX(I,J) = A*PXX+B*PYY-C*PXY |
RY(I,J) = A*QXX+BCa*QYY-C*QXY |
END DO |
END DO |
Continued from Previous Page |
The inner loop of this kernel has 47 floating point operations, 18 array reads and 4 array writes.
|
9 point stencil |
Since all six arrays are the same size and are accessed at the same rate, we can ensure that they do not interfere with each other if they do not conflict initially. For this example, it is convenient to optimize for only one 4096-word set of the 3-way set associative Scache.
|
CRAY-1 supercomputer |
Cray's first supercomputer. Introduced in 1976, this system had a peak performance of 133 megaflops. The first system was installed at Alamos National Laboratory. |
CRAY T90 systems are available in three models: the CRAY T94 system, offered in air- or liquid-cooled systems, that scales up to four processors; |
the CRAY T916 system, a liquid-cooled system that scales up to 16 processors; and the top-of-the-line CRAY T932 system, also liquid-cooled |
with up to 32 processors and has a peak performance of over 60 gigaflops |
This design enhances performance by noting that many applications calculate "vector-like" operations
|
This allows one to address two performance problems
|
They are typified by Cray 1, XMP, YMP, C-90, CDC-205, ETA-10 and Japaneses Supercomputers from NEC Fujitsu and Hitachi |
A pipeline for vector addition looks like:
|
Vector machines pipeline data through the CPU |
They are not so popular/relevant as in the past as
|
In fact excellence of say, Cray C-90 is due to its very good memory architecture allowing one to get enough operands to sustain pipeline. |
Most workstation class machines have "good" CPU's but can never get enough data from memory to sustain good performance except for a few cache intensive applications |
Memory Structure of Parallel Machines
|
and Heterogeneous mixtures |
Shared (Global): There is a global memory space, accessible by all processors.
|
Distributed (Local, Message-Passing): All memory is associated with processors.
|
Memory can be accessed directly (analogous to a phone call) as in red lines below or indirectly by message passing (green line below) |
We show two processors in a MIMD machine for distributed (left) or shared(right) memory architectures |
Uniform: All processors take the same time to reach all memory locations. |
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
|
Most NUMA machines these days have two memory access times
|
This simple two level memory access model gets more complicated in proposed 10 year out "petaflop" designs |
SIMD -lockstep synchronization
|
MIMD - Each Processor executes independent instruction streams |
MIMD Synchronization can take several forms
|
Hypercube Topology for 8 machines |
See next foil for basic chip -- these were connected in a hypercube |
This and related transputer design were very iunnovative but failed as could not exploit commodity microprocessor design economies |
Made out of essentially complete RS6000 workstations |
Network interface integrated in I/O bus |
Most successful MIMD Distributed Memory Machine |
L |
2 |
$ |
Power 2 |
CPU |
Memory |
contr |
oller |
4-way |
interleaved |
DRAM |
General inter |
connection |
network formed fr |
om |
8-port switches |
NIC |
Memory bus (64-bit, 50 MHz) |
i860 |
L |
1 |
$ |
NI |
DMA |
i860 |
L |
1 |
$ |
Driver |
Mem |
ctrl |
4-way |
interleaved |
DRAM |
Intel |
Paragon |
node |
8 bits, |
175 MHz, |
bidir |
ectional |
2D grid network |
with pr |
ocessing node |
attached to every switch |
Sandia' |
s Intel Paragon XP/S-based Super |
computer |
2D Grid Network with Processing Node Attached to Every Switch |
Powerful i860 node made this first "serious" MIMD distributed memory machine |
Full System by Artist and camera |
ASCI Red Interconnect |
For both parallel and sequential computers, cost is accessing remote memories with some form of "communication" |
Data locality addresses in both cases |
Differences are quantitative size of effect and what is done by user and what automatically |
Main |
Memory |
Interconnection Network |
.... |
.... |
Main |
Memory |
Main |
Memory |
Slow |
Can be Very Slow |
Suppose two processors cache the same variable stored in memory of one of the processors |
One must ensure cache coherence so that when one cache value changes, all do! |
.... |
.... |
System Interconnection Network |
L3 Cache |
Main |
Memory |
Main |
Memory |
Cached Value of same shared variable |
Board level Interconnection Network |
Board level Interconnection Network |
There are 4 approaches to Cache Coherence -- the difficulty of maintaining correct caches in a parallel environment |
1) Ignore problem in hardware -- let user and/or software cope with this chore -- this is approach followed in machines like T3E,SP2 and all explicit parallel programming models |
2)Snoopy Buses. This is approach used in most SMP's where caches (at a given level) share a special bus also connected to memory. When a request is made in a give cache, this is broadcast on the bus, so that caches with a more recent value can respond |
3)Scalable Coherent Interface (SCI). This differs from snoopy bus by using a fast serial connection which pipes requests through all processors. This is standard developed by high energy physics community. |
4)Directory Schemes. These have a directory on each processor which keeps track of which cache line is where and which is up to date. The directories on each node are connected and communicate with each when a memory location is accessed |
The system comes in two versions, deskside or a rack system. |
The deskside has between 1 to 4 node cards (1 to up to 8 CPUs). |
The rack system has 1 to 64 node cards for a total between 2 to 128 CPUs. |
Each node is based on the 64-bit MIPS RISC R10000 architecture. |
Also, each node has two primary caches (each 32KB two-way set-associative) and one secondary L2 cache (1 or 4MB two-way set associative) per CPU. |
Each node has hardware cache coherency using a directory system and a maximum bandwidth of 780MB/sec. |
The entire system (Cray Origin2000) has up to 512 such nodes, that is, up to 1024 processors.
|
The SysAD (system address and data) bus of the previous figure connecting the two processors has a
|
The Hub's connections to the off-board net router chip and Xbow I/O interface are 1.56 GB/s each. |
RIEMANN is a general-purpose, higher-order accurate, Eulerian gas Dynamics code based on Gudunov schemes. Dinshaw Balsara |
Laplace : the solution of sparse linear systems resulting from Navier-Stokes - Laplace equations. Danesh Tafti |
QMC (Quantum Monte Carlo) Lubos Mitas |
PPM (Piece-wise Parabolic Method) |
MATVEC (Matrix-Vector Multiply) Dave McWilliams |
Cactus Numerical Relativity Ed Seidel |
Machines like the SGI Origin 2000 have a distributed shared memory with a so called directory implementation (pioneered in DASH project at Stanford) of cache coherence |
Machines like SGI Cray T3E are distributed memory but do have fast get and put so as to be able to access single variables stored on remote memories
|
Origin 2000 approach does not scale as well as Cray T3E and large Origin 2000 systems must use message passing to link 128 node coherent memory subsystems |
Cray T3E offers a uniform (but not as good for small number of nodes) interface |
Pure Message passing / distributed memory is natural web model |
SIMD -- Single Instruction Multiple Data -- can have logically distributed or shared memory
|
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network |
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, fast two-dimensional mesh and slower general switch for communication |
Maspar (DECmpp) SIMD Machine at NPAC 1995 We had two such machines with 8K and 16K nodes respectively |
For a short time Digital resold the Maspar as the DECmpp |
ICL DAP 4096 Processors circa 1978 |
Disk Vault |
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of MIMD or SIMD, Sequential or Parallel machines. |
Metacomputers can vary from full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D |
This is a critical future architecture which is intrinsically distributed memory as multi-vendor heterogenity implies that one cannot have special hardware enhanced shared memory
|
Cluster of workstations or PC's |
Heterogeneous MetaComputer System |
PCcube using serial ports on 80286 machines as REU Undergraduate Project 1986 |
Naegling at Caltech with Tom Sterling and John Salmon 1998 120 Pentium Pro Processors |
Beowulf at Goddard Space Flight Center |
NCSA measured some single processor results on HP Kayak PC with 300 MHz Intel Pentium II. The compiler is a Digital compiler with optimization level of 4. For comparison, also included are results from the Origin 2000. This is a CFD Application |
http://www.ncsa.uiuc.edu/SCD/Perf/Tuning/sp_perf/ |
In general the Origin is twice as fast - For the HP-Kayak there is a sharp decline going from 64x64 to 128x128 matrices, while on the Origin the decline is more gradual and usually gets memory bound beyond 256x256. This is a result of the smaller cache on the Intel chip. |
One example is an Associative memory - SIMD or MIMD or content addressable memories |
This is an an example of a special purpose "signal" processing machine which can in fact be built from "conventional" SIMD or "MIMD" architectures |
This type of machine is not so popular as most applications are not dominated by computations for which good special purpose devices can be designed |
If only 10% of a problem is say "track-finding" or some special purpose processing, then who cares if you reduce that 10% by a factor of 100
|
N body problems (e.g. Newton's laws for one million stars in a globular cluster) can have succesful special purpose devices |
See GRAPE (GRAvity PipE) machine (Sugimoto et al. Nature 345 page 90,1990)
|
Note GRAPE uses EXACTLY same parallel algorithm that one finds in the books (e.g. Solving Problems on Concurrent Processors) for N-body problems on classic distributed memory MIMD machines |
GRAPE will execute the classic O(N^2) (parallel) N body algorithm BUT this is not the algorithm used in most such computations |
Rather there is the O(N) or O(N)logN so called "fast-multipole" algorithm which uses hierarchical approach
|
So special purpose devices cannot usually take advantage of new nifty algorithms! |
GRAPE 4 1.08 Teraflops |
GRAPE 6 200 Teraflops |
Here the case for special purpose machines is less compelling than for GRAPE as QCD is "just" regular (easy to parallelize; lowish communication) and extremely floating point intensive. |
We illustrate with two machines which are classic MIMD distributed memory architecture with optimized nodes/communication networks |
BNL/Columbia QCDSP 400 Megaflops |
Univ. of CP-PACS (general physics) 600 Megaflops |
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors.
|
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors.
|
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines.
|
Note that a machine of given granularity can be used on algorithms of the same or finer granularity |
The last major architectural feature of a parallel machine is the network or design of hardware/software connecting processors and memories together. |
Bus: All processors (and memory) connected to a common bus or busses.
|
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
|
Switch |
Bus |
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube. |
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial |
Useful terms include: |
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout) |
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault? |
Blocking: Some communication requests may not get through, due to conflicts caused by other requests. |
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor. |
Latency (delay): Maximal time for nonblocked request to be transmitted. |
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task. |
Worm Hole Routing -- Intermediate switch nodes do not wait for full message but allow it to pass throuch in small packets |
Transmission Time for message of n bytes: |
T0 + T1 n where |
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize |
T0 = TS + Td . Number of hops |
T1 is the inverse bandwidth -- it can be made small if pipe is large size. |
In practice TS and T1 are most important and Td is unimportant |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
Square blocks indicate shared memory copy performance |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
From Aspects of Computational Science, Editor Aad van der Steen, published by NCF |
System Communication Speed Computation Speed
|
IBM SP2 40 267 |
Intel iPSC860 2.8 60 |
Intel Paragon 200 75 |
Kendall Square |
KSR-1 17.1 40 |
Meiko CS-2 100 200 |
Parsytec GC 20 25 |
TMC CM-5 20 128 |
Cray T3D 150 300 |
tcomm = 4 or 8 /Speed in Mbytes sec
|
tfloat = 1/Speed in Mflops per sec |
Thus tcomm / tfloat is just 4 X Computation Speed divided by Communication speed |
tcomm / tfloat is 26.7, 85, 1.5, 9.35, 8, 5, 25.6, 8 for the machines SP2, iPSC860, Paragon, KSR-1, Meiko CS2, Parsytec GC, TMC CM5, and Cray T3D respectively |
Latency makes situation worse for small messages and double for 64bit arithmetic natural on large problems! |
Results from Samson Cheung of NAS (Ames) |
Origin 2000: mpirun -np 2 ./lbw -B -n 4000 -D |
Total transfer time: 57.281 sec. |
Transfer rate: 69.8 transfers/sec. |
Message size: 1000000 bytes |
Bandwidth: 139.664 10e+06 Bytes/sec |
Sun E10000: tmrun -np 2 ./lbw -B -n 4000 -D |
Total transfer time: 54.487 sec. |
Transfer rate: 73.4 transfers/sec. |
Message size: 1000000 bytes |
Bandwidth: 146.825 10e+06 Bytes/sec |
Origin 2000: mpirun -np 2 ./lbw -L -n 2500000 -D |
Total transfer time: 52.597 sec. |
Transfer rate: 47531.4 transfers/sec. |
Message size: 40 bytes |
Latency: 10.5 microsec. |
Sun E10000: tmrun -np 2 ./lbw -L -n 2500000 -D |
Total transfer time: 34.585 sec. |
Transfer rate: 72286.7 transfers/sec. |
Message size: 40 bytes |
Latency: 6.9 microsec. |
Note: Origin CPU about twice performance of Sun CPU |
Shared Address Space or Shared Memory
|
Explicitly parallel or Data Parallel
|
Any processor can directly reference any memory location
|
Convenient:
|
Naturally provided on wide range of platforms
|
Popularly known as shared memory machines or model
|
Process: virtual address space plus one or more threads of control |
Portions of address spaces of processes are shared |
Writes to shared address visible to other threads (in other processes too) |
Natural extension of uniprocessor's model:
|
OS uses shared memory to coordinate processes |
Memory capacity increased by adding modules, I/O by controllers
|
Communication is natural extension of uniprocessor |
Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort |
"Mainframe" approach
|
"Minicomputer" approach
|
Problem is interconnect: cost (crossbar) or bandwidth (bus)
|
Dance-hall: bandwidth still scalable, but lower cost than crossbar
|
Distributed memory or non-uniform memory access (NUMA)
|
Caching shared (particularly nonlocal) data? |
Complete computer as building block, including I/O
|
Programming model:
|
High-level block diagram similar to distributed-memory SAS (Shared Address Space)
|
Programming model more removed from basic hardware operations (?)
|
Send specifies buffer to be transmitted and receiving process |
Receive specifies sending process and application storage to receive into |
Memory to memory copy, but need to name processes |
Optional tag on send and matching rule on receive |
User process names local data and entities in process/tag space too |
In simplest form, the send/recv match achieves pairwise synch event but also collective communication |
Many overheads: copying, buffer management, protection |
Early machines such as Caltech Hypercube and first commercial Intel and nCUBE designs used a FIFO on each node to store and forward messages
|
Diminishing role of topology in modern machines
|
64 node n=6 hypercube in Caltech Computer Science Dept. |
Commodity SMP |
All coherence and multiprocessing glue in processor module |
Highly integrated, targeted at high volume |
Low latency and bandwidth |
P-Pr |
o bus (64-bit data, 36-bit addr |
ess, 66 MHz) |
CPU |
Bus interface |
MIU |
P-Pr |
o |
module |
P-Pr |
o |
module |
P-Pr |
o |
module |
256-KB |
L |
2 |
$ |
Interrupt |
contr |
oller |
PCI |
bridge |
PCI |
bridge |
Memory |
contr |
oller |
1-, 2-, or 4-way |
interleaved |
DRAM |
PCI bus |
PCI bus |
PCI |
I/O |
car |
ds |
It is perhaps most successful SMP (Symmetric Multiprocessor) with applicability to commercial and scientific market |
Its SMP characteristics are seen in its low uniform latency |
One cannot build huge E10000 systems (only up to 64 nodes and each node is slower than Origin) |
One should cluster multiple E10000's to build a supercomputer |
These are very successful in commercial server market for database and related applications |
E10000 acquired from Cray via SGI is also called Starfire |
TPCD Benchmarks 1000Gbyte(Terabyte) November 1998 |
These measure typical large scale database queries |
System Database Power 5 Year Total $/perf |
Metric System Cost ($/QphD) |
(QppD) |
Sun Starfire (SMP) Oracle 27,024.6 $9,660.193 $776 |
IBM RS/6000 (MPP) DB2 19,137.5 $11,380,178 $797 |
Sun 4x6000 (Clus) Informix 12,931.9 $11,766,932 $1,353 |
NCR 5150 (MPP) Teradata 12,149.2 $14,495,886 $2,103 |
The Starfire server houses a group of system boards interconnected by a centerplane.
|
On each system board there us up to four 336 MHz UltraSPARC microprocessor modules with supporting two level/4 Mbyte cache per module (64 per Starfire system). |
There can be four memory banks with a capacity of up to 4 Gbytes per system board (64 Gbytes per Starfire server). |
There are two SBuses per board, each with slots for up to two adapters for networking and I/O (32 SBuses or 64 slots per system).
|
This is aimed at commercial customers where Origin 2000 is not a major player |
The Starfire scales to smaller sizes than machines like the IBM SP2 and Cray T3E but significantly larger sizes than competing SMP's |
Proving ground and driver for innovative architecture and techniques
|
Large-scale multiprocessors replace vector supercomputers
|
Evolution and role of software have blurred boundary
|
Hardware organization converging too
|
Even clusters of workstations/SMPs are parallel systems
|
Programming models distinct, but organizations converging
|
Node: processor(s), memory system, plus communication assist
|
Scalable network |
Convergence allows lots of innovation, now within framework
|
A generic modern multiprocessor (Culler) |
This uses a clever idea developed over many years by Burton Smith who originally used it in the Denelcor system which was one of the first MIMD machines over 15 years ago |
MTA(multithreaded architectures) are designed to hide the different access times of memory and CPU cycle time.
|
First 2 processor system |
Tera computer system is a shared memory multiprocessor.
|
The Tera is a multi-processor which potentially can accommodate up to 256 processors.
|
The clock speed is nominally 333 Mhz, giving each processor a data path bandwidth of one billion 64-bit results per second and a peak performance of one gigaflops. |
The Tera Processors are multithreaded (called a stream) and each processor switches context every cycle among as many as 128 hardware threads, thereby hiding up to 128 cycles (384 ns) of memory latency. |
Each processor executes a 21 stage pipeline and so can have 21 separate streams executing simultaneously |
Each stream can issue as many as eight memory references without waiting for earlier ones to finish, further augmenting the memory latency tolerance of the processor. |
A stream implements a load-store architecture with three addressing modes and 31 general-purpose 64-bit registers.
|
The peak memory bandwidth is 2.67 gigabytes per second. |
From Bokhari (ICASE) |
From Bokhari (ICASE) |
The interconnection net is a sparsely populated 3-D packet switched containing p^(3/2) nodes, where p is the number of processors. |
These nodes are toroidally connected in three dimensions to form a p^(1/2)-ary three-cube, and processor and memory resources are attached to some of the nodes. |
The latency of a node is three cycles: a message spends two cycles in the node logic proper and one on the wire that connects the node to its neighbors. |
A p-processor system has worst-case one-way latency of 4.5p^(1/2) cycles. |
Messages are assigned random priorities and then routed in priority order. Under heavy load, some messages are derouted by this process. The randomization at each node insures that each packet eventually reaches its destination.
|
A node has four ports (five if a resource is attached). |
Each port simultaneously transmits and receives an entire 164-bit packet every 3 ns clock cycle. |
Of the 164 bits, 64 are data, so the data bandwidth per port is 2.67 GB/s in each direction. |
The network bisection bandwidth is 2.67p GB/s. (p is number of processors) |
The network routing nodes contain no buffers other than those required for the pipeline.
|
From Allan Snavely at SDSC |
The overall hardware configuration of the system: |
Processors 16 64 256 |
Peak Gflops 16 64 256 |
Memory, Gbytes 16-32 64-128 256-512 |
HIPPI channels 32 128 512 |
I/O, Gbytes/sec 6.2 25 102 |
Relative MTA and T90 Performance |
From Allan Snavely at SDSC |
NAS Benchmarks |
255 Mhz |
300 Mhz |
KSR-1 (Colorado) |
Cache Only Machines "only" have cache and are typified by the Kendall Square KSR-1,2 machines. Although this company is no longer in business, the basic architecture is interesting and could still be used in future important machines |
In this class of machine one has a NUMA architecture with memory attached to a given processor being lower access cost |
In simplest COMA architectures, think of this memory as the cache and when needed migrate data to this cache |
However in conventional machines, all data has a natural "original home"; in (simple) COMA, the home of the data moves when it is accessed and one hopes that data is "attracted" through access to "correct" processor |
Applications that require petaFLOPS can already be identified
|
The need for ever greater computing power will remain. |
PetaFLOPS systems are right step for the next decade |
Nuclear Weopens Stewardship (ASCI) |
Cryptology and Digital Signal Processing |
Satellite Data Analysis |
Climate and Environmental Modeling |
3-D Protein Molecule Reconstruction |
Real-Time Medical Imaging |
Severe Storm Forecasting |
Design of Advanced Aircraft |
DNA Sequence Matching |
Molecular Simulations for nanotechnology |
Large Scale Economic Modelling |
Intelligent Planetary Spacecraft |
Why does one need a petaflop (1015 operations per second) computer? |
These are problems where quite viscous (oil, pollutants) liquids percolate through the ground |
Very sensitive to details of material |
Most important problems are already solved at some level, but most solutions are insufficient and need improvement in various respects:
|
Oil Resevoir Simulation |
Geological variation occurs down to pore size of rock - almost 10-6 metres - model this (statistically) |
Want to calculate flow between wells which are about 400 metres apart |
103x103x102 = 108 grid elements |
30 species |
104 time steps |
300 separate cases need to be considered |
3x109 words of memory per case |
1012 words total if all cases considered in parallel |
1019 floating point operation |
3 hours on a petaflop computer |
Conventional (Distributed Shared Memory) Silcon
|
Note Memory per Flop is much less than one to one |
Natural scaling says time steps decrease at same rate as spatial intervals and so memory needed goes like (FLOPS in Gigaflops)**.75
|
Superconducting Technology is promising but can it compete with silicon juggernaut? |
Should be able to build a simple 200 Ghz Superconducting CPU with modest superconducting caches (around 32 Kilobytes) |
Must use same DRAM technology as for silicon CPU ? |
So tremendous challenge to build latency tolerant algorithms (as over a factor of 100 difference in CPU and memory speed) but advantage of factor 30-100 less parallelism needed |
Processor in Memory (PIM) Architecture is follow on to J machine (MIT) Execube (IBM -- Peter Kogge) Mosaic (Seitz)
|
One could take in year 2007 each two gigabyte memory chip and alternatively build as a mosaic of
|
12000 chips (Same amount of Silicon as in first design but perhaps more power) gives:
|
Performance data from uP vendors |
Transistor count excludes on-chip caches |
Performance normalized by clock rate |
Conclusion: Simplest is best! (250K Transistor CPU) |
Millions of Transistors (CPU) |
Millions of Transistors (CPU) |
Normalized SPECINTS |
Normalized SPECFLTS |
Fixing 10-20 Terabytes of Memory, we can get |
16000 way parallel natural evolution of today's machines with various architectures from distributed shared memory to clustered heirarchy
|
5000 way parallel Superconducting system with 1 Petaflop performance but terrible imbalance between CPU and memory speeds |
12 million way parallel PIM system with 12 petaflop performance and "distributed memory architecture" as off chip access with have serious penalities |
There are many hybrid and intermediate choices -- these are extreme examples of "pure" architectures |
Storage |
0.5 MB |
0.5 MB |
0.05 MB |
0.128 MB |
Chip |
EXECUBE |
AD SHARC |
TI MVP |
MIT MAP |
Terasys PIM |
First |
Silicon |
1993 |
1994 |
1994 |
1996 |
1993 |
Peak |
50 Mips |
120 Mflops |
2000 Mops |
800 Mflops |
625 M bit |
ops |
0.016 MB |
MB/ |
Perf. |
0.01 |
MB/Mip |
0.005 |
MB/MF |
0.000025 |
MB/Mop |
0.00016 |
MB/MF |
0.000026 |
MB/bit op |
Organization |
16 bit |
SIMD/MIMD CMOS |
Single CPU and |
Memory |
1 CPU, 4 DSP's |
4 Superscalar |
CPU's |
1024 |
16-bit ALU's |
MB per cm2 |
MF per cm2 |
MB/MF ratios |
MB/MF ratios |
Quantum-Mechanical Computers by Seth Lloyd, Scientific American, Oct 95 |
Chapter 6 of The Feynman Lectures on Computation edited by Tony Hey and Robin Allen, Addison-Wesley, 1996 |
Quantum Computing: Dream or Nightmare? Haroche and Raimond, Physics Today, August 96 page 51 |
Basically any physical system can "compute" as one "just" needs a system that gives answers that depend on inputs and all physical systems have this property |
Thus one can build "superconducting" "DNA" or "Quantum" computers exploiting respectively superconducting molecular or quantum mechanical rules |
For a "new technology" computer to be useful, one needs to be able to
|
Conventional computers are built around bit ( taking values 0 or 1) manipulation |
One can build arbitarily complex arithmetic if have some way of implementing NOT and AND |
Quantum Systems naturally represent bits
|
Interactions between quantum systems can cause "spin-flips" or state transitions and so implement arithmetic |
Incident photons can "read" state of system and so give I/O capabilities |
Quantum "bits" called qubits have another property as one has not only
|
Lloyd describes how such coherent states provide new types of computing capabilities
|
Superconductors produce wonderful "wires" which transmit picosecond (10^-12 seconds) pulses at near speed of light
|
Niobium used in constructing such superconducting circuits can be processed by similar fabrication techniques to CMOS |
Josephson Junctions allow picosecond performance switches |
BUT IBM (!969-1983) and Japan (MITI 1981-90) terminated major efforts in this area |
New ideas have resurrected this concept using RSFQ -- Rapid Single Flux Quantum -- approach |
This naturally gives a bit which is 0 or 1 (or in fact n units!) |
This gives interesting circuits of similar structure to CMOS systems but with a clock speed of order 100-300GHz -- factor of 100 better than CMOS which will asymptote at around 1 GHz (= one nanosecond cycle time) |
At least two major problems: |
Semiconductor industry will invest some some $40B in CMOS "plants" and infrastructure
|
Cannot build memory to match CPU speed and current designs have superconducting CPU's (with perhaps 256 Kbytes superconducting memory per processor) but conventional CMOS memory
|
Superconducting technology also has a bad "name" due to IBM termination! |