Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 97. Foils prepared 19 September 1997
Outside Index
Summary of Material
We describe the simple nature of parallel machines like the nCUBE2 |
Technologies include conventional, superconducting and Quantum systems but the former dominates |
Sequential architectures are dominated by memory (cache) issues |
Vector, SIMD MIMD are classic architectures |
There is a growing interest in distributed metacomputers |
Special purpose machines are typically not so successful |
The details of networks is not as important as it used to be! |
Outside Index Summary of Material
http://www.npac.syr.edu/users/gcf/cps615arch97 |
Geoffrey Fox |
Syracuse University NPAC |
111 College Place Syracuse NY 13244 4100 |
3154432163 |
We describe the simple nature of parallel machines like the nCUBE2 |
Technologies include conventional, superconducting and Quantum systems but the former dominates |
Sequential architectures are dominated by memory (cache) issues |
Vector, SIMD MIMD are classic architectures |
There is a growing interest in distributed metacomputers |
Special purpose machines are typically not so successful |
The details of networks is not as important as it used to be! |
Each node is CPU and 6 memory chips -- CPU Chip integrates communication channels with floating, integer and logical CPU functions |
We can choose technology and architecture separately in designing our high performance system |
Technology is like choosing ants people or tanks as basic units in our society analogy
|
In HPCC arena, we can distinguish current technologies
|
Near term technology choices include
|
Further term technology choices include
|
It will cost $40 Billion for next industry investment in CMOS plants and this huge investment makes it hard for new technologies to "break in" |
Architecture is equivalent to organization or design in society analogy
|
We can distinguish formal and informal parallel computers |
Informal parallel computers are typically "metacomputers"
|
Metacomputers are a very important trend which uses similar software and algorithms to conventional "MPP's" but have typically less optimized parameters
|
Formal high performance computers are the classic (basic) object of study and are |
"closely coupled" specially designed collections of compute nodes which have (in principle) been carefully optimized and balanced in the areas of
|
In society, we see a rich set of technologies and architectures
|
With several different communication mechanisms with different trade-offs
|
Quantum-Mechanical Computers by Seth Lloyd, Scientific American, Oct 95 |
Chapter 6 of The Feynman Lectures on Computation edited by Tony Hey and Robin Allen, Addison-Wesley, 1996 |
Quantum Computing: Dream or Nightmare? Haroche and Raimond, Physics Today, August 96 page 51 |
Basically any physical system can "compute" as one "just" needs a system that gives answers that depend on inputs and all physical systems have this property |
Thus one can build "superconducting" "DNA" or "Quantum" computers exploiting respectively superconducting molecular or quantum mechanical rules |
For a "new technology" computer to be useful, one needs to be able to
|
Conventional computers are built around bit ( taking values 0 or 1) manipulation |
One can build arbitarily complex arithmetic if have some way of implementing NOT and AND |
Quantum Systems naturally represent bits
|
Interactions between quantum systems can cause "spin-flips" or state transitions and so implement arithmetic |
Incident photons can "read" state of system and so give I/O capabilities |
Quantum "bits" called qubits have another property as one has not only
|
Lloyd describes how such coherent states provide new types of computing capabilities
|
Superconductors produce wonderful "wires" which transmit picosecond (10^-12 seconds) pulses at near speed of light
|
Niobium used in constructing such superconducting circuits can be processed by similar fabrication techniques to CMOS |
Josephson Junctions allow picosecond performance switches |
BUT IBM (!969-1983) and Japan (MITI 1981-90) terminated major efforts in this area |
New ideas have resurrected this concept using RSFQ -- Rapid Single Flux Quantum -- approach |
This naturally gives a bit which is 0 or 1 (or in fact n units!) |
This gives interesting circuits of similar structure to CMOS systems but with a clock speed of order 100-300GHz -- factor of 100 better than CMOS which will asymptote at around 1 GHz (= one nanosecond cycle time) |
At least two major problems: |
Semiconductor industry will invest some some $40B in CMOS "plants" and infrastructure
|
Cannot build memory to match CPU speed and current designs have superconducting CPU's (with perhaps 256 Kbytes superconducting memory per processor) but conventional CMOS memory
|
Superconducting technology also has a bad "name" due to IBM termination! |
These foils come from summary by David Bailey |
First of all, there seems to be general agreement that Moore's Law will not stop anytime soon. |
The current state of the art in the semiconductor industry is 250 nm, as recently announced by Intel among others. Researchers are confident that current approaches can be used for at least another three generations with their current approach. |
In the years ahead we may even see some manufacturers skip a generation, proceeding directly to significantly smaller feature sizes. This means that the 100 nm technology wall will be reached earlier than previously anticipated. |
Below about 100 nm feature sizes, progress will be more difficult, for various well known reasons, among them the lack of any substance that can be used as a lens for photolithography at such wavelengths. |
Nonetheless, there are a number of "tricks" in the works, including the usage of diffraction gratings, parallel exposures, and others. |
Groups working on these "ultimate" silicon techniques include Frances Houle's group at IBM Almaden and Jeff Baker's group at UC Berkeley. |
Also helping will be some improvements such as better materials and increased wafer sizes. The consensus is that these techniques will be good for another two generations, to about 50 nm. Thus Moore's Law should continue until about 2010 or 2012. |
Below about 50 nm feature sizes, it appears that a completely new device fabrication technology is needed. |
Progress in electron beams and X-rays, which were the leading candidates of a few years ago, has stalled.
|
The researchers I spoke with nonetheless agree that future devices will be electronic for the foreseeable future. |
As Stan Williams of HP points out, electrons are the ideal basis for device technology because they have basically zero size and mass, can travel at large fractions of the speed of light and interact strongly with matter. |
In contrast, DNA computing and the like are still curiosities with no clear path to practical high-speed computing. |
One exciting development, which has emerged just in the past year or two, is nanotubes, i.e. cylindrical buckyballs. |
It was recently discovered that a nanotube can be "tuned" from insulator to semiconductor to conductor just by changing the pitch of the helical structure of carbon atoms. |
Kinks introduced in a tube can be used to form a conductor-semiconductor junction.
|
More than one person with whom Bailey talked emphasized that future research should not focus exclusively on fabrication technology. |
An equally important question is what is worth fabricating at these small feature sizes. This is because conventional transistor-based logic appears increasingly troublesome below about 50 nm. |
Thus a new paradigm in logic design is as much needed as a new paradigm in fabrication technology. |
Several at the meeting were aware of Likharev's rapid single flux quantum (RSFQ) superconducting logic. They agree that it is a very interesting development, and further that it is fairly certain to work well. |
However, they feel that the requirement for liquid helium temperatures is a severe impediment to widespread utilization, and without commodity production it is unlikely to be economic.
|
According to Stan Williams, recently Likarev came to Hewlett Packard. |
He offered HP all of his patents, only asking that they develop the technology. |
But HP management still declined. Williams quipped that if some company decided to produce RSFQ technology, somehow assembling the $100 million or so required in start-up capital, the day after they produce their first devices it would pronounced a great success, but on the second day the company would fold financially. |
One interesting possibility for a future logic system is "quantum dot cellular automata", abbreviated QCA. |
Wolfgang Porod of the University of Notre Dame gave a talk on these devices at Ames workshop. His group has proposed configurations of cells, each of which is a collection of five electron sites arranged in an X, with two free electrons within the cell. |
According to their studies, configurations of these cells can propagate signals, invert signals, form majority vote logic, and thus can be formed to produce practical devices. |
The devices are semi-classical, semi-quantum, in that interactions within a cell can involve quantum tunneling, but interactions between cells involve only classical electromagnetic effects. |
Like RSFQ, initial demonstration devices of QCA logic will need to be super-cooled, assuming fabrication feature sizes in the tens of nanometers. |
But unlike RSFQ (from what we know of RSFQ), the smaller the fabrication dimension, the higher the tolerable temperature, according to thermodynamic analyses. |
Indeed, if one could eventually fabricate QCA devices in the realm of 2-5 nanometers, then they could operate at room temperature! |
Data locality implies CPU finds information it needs in cache which stores most recently accessed information |
This means one reuses a given memory reference in many nearby computations e.g. |
A1 = B*C |
A2 = B*D + B*B |
.... Reuses B |
L3 Cache |
Main |
Memory |
Disk |
Increasing Memory |
Capacity Decreasing |
Memory Speed (factor of 100 difference between processor |
and main memory |
speed) |
Sequential or von Neuman Architecture |
Vector (Super)computers |
Parallel Computers
|
Instructions and data are stored in the same memory for which there is a single link (the von Neumann Bottleneck) to the CPU which decodes and executues instructions |
The CPU can have multiple functional units |
The memory access can be enhanced by use of caches made from faster memory to allow greater bandwidth and lower latency |
Fig 1.14 of Aspects of Computational Science |
Editor Aad van der Steen |
published by NCF |
This design enhances performance by noting that many applications calculate "vector-like" operations
|
This allows one to address two performance problems
|
They are typified by Cray 1, XMP, YMP, C-90, CDC-205, ETA-10 and Japaneses Supercomputers from NEC Fujitsu and Hitachi |
A pipeline for vector addition looks like:
|
Vector machines pipeline data through the CPU |
They are not so popular/relevant as in the past as
|
In fact excellence of say, Cray C-90 is due to its very good memory architecture allowing one to get enough operands to sustain pipeline. |
Most workstation class machines have "good" CPU's but can never get enough data from memory to sustain good performance except for a few cache intensive applications |
Familiar from such everyday activities as getting food in cafeteria where one processes one person per "clock cycle" where |
clock cycle here is maximum time anybody takes at a single "stage" where stage is here component of meal (salad, entrée etc.) |
Note any one person takes about 5 clock cycles in this pipeline but the pipeline processes one person per clock cycle |
Pipeline has problem if there is a "stall" -- here that one of the people wants an entrée which needs to be fetched from kitchen. This delays everybody! |
In computer case, stall is caused typically by data not being ready for a particular instruction |
Three Instructions are shown overlapped -- each starting one clock cycle after last |
Taken from David Patterson CS252 Berkeley Fall 1996 |
3 Functional Units FP Adder, FP Multiplier, FP Divider |
8 Kinds of Stages in FP Units |
Stage Functional Unit Description |
A FP Adder Mantissa ADD stage |
D FP Divider Divide Pipeline stage |
E FP Multiplier Exception Test stage |
M FP Multiplier First stage of multiplier |
N FP Multiplier Second stage of multiplier |
R FP Adder Rounding stage |
S FP Adder Operand Shift stage |
U Unpack Floating point numbers |
Several different pipelines with different lengths! |
Add,Subtract- 4 clocks:U S+A A+R R+S |
Multiply - 8 clocks: U E+M M M M N+A R |
Divide- 38 clocks: U A R D(28) D+A D+R D+R D+A D+R A R |
Square Root- 110 clocks: U E (A+R)(108) A R |
Negate- 2 clocks: U S |
Absolute Value- 2 clocks: U S |
Floating Point Compare- 3 clocks:U A R |
Very high speed computing systems,Proc of IEEE 54,12,p1901-1909(1966) and |
Some Computer Organizations and their Effectiveness, IEEE Trans. on Comp. C-21,948-960(1972) -- both papers by M.J. Flynn |
SISD -- Single Instruction stream, Single Data Stream -- i.e. von Neumann Architecture |
MISD -- Multiple Instruction stream, Single Data Stream -- Not interesting |
SIMD -- Single Instruction stream, Multiple Data Stream |
MIMD -- Multiple Instruction stream and Multiple Data Stream -- dominant parallel system with ~one to ~one match of instruction and data streams. |
Memory Structure of Parallel Machines
|
and Heterogeneous mixtures |
Shared (Global): There is a global memory space, accessible by all processors.
|
Distributed (Local, Message-Passing): All memory is associated with processors.
|
Memory can be accessed directly (analogous to a phone call) as in red lines below or indirectly by message passing (green line below) |
We show two processors in a MIMD machine for distributed (left) or shared(right) memory architectures |
Uniform: All processors take the same time to reach all memory locations. |
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
|
Most NUMA machines these days have two memory access times
|
This simple two level memory access model gets more complicated in proposed 10 year out "petaflop" designs |
SIMD -lockstep synchronization
|
MIMD - Each Processor executes independent instruction streams |
MIMD Synchronization can take several forms
|
MIMD Distributed Memory
|
MIMD with logically shared memory but usually physically distributed. The latter is sometimes called distributed shared memory.
|
A special case of this is a network of workstations (NOW's) or personal computers (metacomputer) |
Issues include:
|
Suppose two processors cache the same variable stored in memory of one of the processors |
One must ensure cache coherence so that when one cache value changes, all do! |
Interconnection Network |
.... |
.... |
Interconnection Network |
Cached Value of same shared variable |
Machines like the SGI Origin 2000 have a distributed shared memory with a so called directory implementation (pioneered in DASH project at Stanford) of cache coherence |
Machines like SGI Cray T3E are distributed memory but do have fast get and put so as to be able to access single variables stored on remote memories
|
Origin 2000 approach does not scale as well as Cray T3E and large Origin 2000 systems must use message passing to link 128 node coherent memory subsystems |
Cray T3E offers a uniform (but not as good for small number of nodes) interface |
Message passing / distributed memory is natural web model |
SIMD -- Single Instruction Multiple Data -- can have logically distributed or shared memory
|
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network |
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, fast two-dimensional mesh and slower general switch for communication |
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of MIMD or SIMD, Sequential or Parallel machines. |
Metacomputers can vary from full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D |
This is a critical future architecture which is intrinsically distributed memory as multi-vendor heterogenity implies that one cannot have special hardware enhanced shared memory
|
Cluster of workstations or PC's |
Heterogeneous MetaComputer System |
One example is an Associative memory - SIMD or MIMD or content addressable memories |
This is an an example of a special purpose "signal" processing machine which can in fact be built from "conventional" SIMD or "MIMD" architectures |
This type of machine is not so popular as most applications are not dominated by computations for which good special purpose devices can be designed |
If only 10% of a problem is say "track-finding" or some special purpose processing, then who cares if you reduce that 10% by a factor of 100
|
N body problems (e.g. Newton's laws for one million stars in a globular cluster) can have succesful special purpose devices |
See GRAPE (GRAvity PipE) machine (Sugimoto et al. Nature 345 page 90,1990)
|
Note GRAPE uses EXACTLY same parallel algorithm that one finds in the books (e.g. Solving Problems on Concurrent Processors) for N-body problems on classic distributed memory MIMD machines |
GRAPE will execute the classic O(N^2) (parallel) N body algorithm BUT this is not the algorithm used in most such computations |
Rather there is the O(N) or O(N)logN so called "fast-multipole" algorithm which uses hierarchical approach
|
So special purpose devices cannot usually take advantage of new nifty algorithms! |
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors.
|
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors.
|
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines.
|
Note that a machine of given granularity can be used on algorithms of the same or finer granularity |
Applications that require petaFLOPS can already be identified
|
The need for ever greater computing power will remain. |
PetaFLOPS systems are right step for the next decade |
Nuclear Weopens Stewardship (ASCI) |
Cryptology and Digital Signal Processing |
Satellite Data Analysis |
Climate and Environmental Modeling |
3-D Protein Molecule Reconstruction |
Real-Time Medical Imaging |
Severe Storm Forecasting |
Design of Advanced Aircraft |
DNA Sequence Matching |
Molecular Simulations for nanotechnology |
Large Scale Economic Modelling |
Intelligent Planetary Spacecraft |
Why does one need a petaflop (1015 operations per second) computer? |
These are problems where quite viscous (oil, pollutants) liquids percolate through the ground |
Very sensitive to details of material |
Most important problems are already solved at some level, but most solutions are insufficient and need improvement in various respects:
|
Oil Resevoir Simulation |
Geological variation occurs down to pore size of rock - almost 10-6 metres - model this (statistically) |
Want to calculate flow between wells which are about 400 metres apart |
103x103x102 = 108 grid elements |
30 species |
104 time steps |
300 separate cases need to be considered |
3x109 words of memory per case |
1012 words total if all cases considered in parallel |
1019 floating point operation |
3 hours on a petaflop computer |
Conventional (Distributed Shared Memory) Silcon
|
Note Memory per Flop is much less than one to one |
Natural scaling says time steps decrease at same rate as spatial intervals and so memory needed goes like (FLOPS in Gigaflops)**.75
|
Superconducting Technology is promising but can it compete with silicon juggernaut? |
Should be able to build a simple 200 Ghz Superconducting CPU with modest superconducting caches (around 32 Kilobytes) |
Must use same DRAM technology as for silicon CPU ? |
So tremendous challenge to build latency tolerant algorithms (as over a factor of 100 difference in CPU and memory speed) but advantage of factor 30-100 less parallelism needed |
Processor in Memory (PIM) Architecture is follow on to J machine (MIT) Execube (IBM -- Peter Kogge) Mosaic (Seitz)
|
One could take in year 2007 each two gigabyte memory chip and alternatively build as a mosaic of
|
12000 chips (Same amount of Silicon as in first design but perhaps more power) gives:
|
Performance data from uP vendors |
Transistor count excludes on-chip caches |
Performance normalized by clock rate |
Conclusion: Simplest is best! (250K Transistor CPU) |
Millions of Transistors (CPU) |
Millions of Transistors (CPU) |
Normalized SPECINTS |
Normalized SPECFLTS |
Fixing 10-20 Terabytes of Memory, we can get |
16000 way parallel natural evolution of today's machines with various architectures from distributed shared memory to clustered heirarchy
|
5000 way parallel Superconducting system with 1 Petaflop performance but terrible imbalance between CPU and memory speeds |
12 million way parallel PIM system with 12 petaflop performance and "distributed memory architecture" as off chip access with have serious penalities |
There are many hybrid and intermediate choices -- these are extreme examples of "pure" architectures |
Storage |
0.5 MB |
0.5 MB |
0.05 MB |
0.128 MB |
Chip |
EXECUBE |
AD SHARC |
TI MVP |
MIT MAP |
Terasys PIM |
First |
Silicon |
1993 |
1994 |
1994 |
1996 |
1993 |
Peak |
50 Mips |
120 Mflops |
2000 Mops |
800 Mflops |
625 M bit |
ops |
0.016 MB |
MB/ |
Perf. |
0.01 |
MB/Mip |
0.005 |
MB/MF |
0.000025 |
MB/Mop |
0.00016 |
MB/MF |
0.000026 |
MB/bit op |
Organization |
16 bit |
SIMD/MIMD CMOS |
Single CPU and |
Memory |
1 CPU, 4 DSP's |
4 Superscalar |
CPU's |
1024 |
16-bit ALU's |
MB per cm2 |
MF per cm2 |
MB/MF ratios |
MB/MF ratios |
The last major architectural feature of a parallel machine is the network or design of hardware/software connecting processors and memories together. |
Bus: All processors (and memory) connected to a common bus or busses.
|
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
|
Switch |
Bus |
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube. |
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial |
Useful terms include: |
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout) |
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault? |
Blocking: Some communication requests may not get through, due to conflicts caused by other requests. |
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor. |
Latency (delay): Maximal time for nonblocked request to be transmitted. |
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task. |
Worm Hole Routing -- Intermediate switch nodes do not wait for full message but allow it to pass throuch in small packets |
Transmission Time for message of n bytes: |
T0 + T1 n where |
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize |
T0 = TS + Td . Number of hops |
T1 is the inverse bandwidth -- it can be made small if pipe is large size. |
In practice TS and T1 are most important and Td is unimportant |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
Square blocks indicate shared memory copy performance |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
From Aspects of Computational Science, Editor Aad van der Steen, published by NCF |
System Communication Speed Computation Speed
|
IBM SP2 40 267 |
Intel iPSC860 2.8 60 |
Intel Paragon 200 75 |
Kendall Square |
KSR-1 17.1 40 |
Meiko CS-2 100 200 |
Parsytec GC 20 25 |
TMC CM-5 20 128 |
Cray T3D 150 300 |
tcomm = 4 or 8 /Speed in Mbytes sec
|
tfloat = 1/Speed in Mflops per sec |
Thus tcomm / tfloat is just 4 X Computation Speed divided by Communication speed |
tcomm / tfloat is 26.7, 85, 1.5, 9.35, 8, 5, 25.6, 8 for the machines SP2, iPSC860, Paragon, KSR-1, Meiko CS2, Parsytec GC, TMC CM5, and Cray T3D respectively |
Latency makes situation worse for small messages and double for 64bit arithmetic natural on large problems! |
See Original Foil |
See Original Foil |
See Original Foil |