Full HTML for

Scripted foilset CPS615 - Overview of Computer Architectures

Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 97. Foils prepared 19 September 1997
Outside Index Summary of Material


We describe the simple nature of parallel machines like the nCUBE2
Technologies include conventional, superconducting and Quantum systems but the former dominates
Sequential architectures are dominated by memory (cache) issues
Vector, SIMD MIMD are classic architectures
There is a growing interest in distributed metacomputers
Special purpose machines are typically not so successful
The details of networks is not as important as it used to be!

Table of Contents for full HTML of CPS615 - Overview of Computer Architectures

Denote Foils where Image Critical
Denote Foils where HTML is sufficient
Denote Foils where Image is not available
Indicates Available audio which is lightpurpleed out if missing
1 Introduction to Computer Architectures CPS615 Fall Semester 97
2 Abstract of CPS615 Architecture Discussion
3 Single nCUBE2 CPU Chip
4 64 Node nCUBE Board
5 Technologies for High Performance Computers
6 Architectures for High Performance Computers - I
7 Architectures for High Performance Computers - II
8 There is no Best Machine!
9 Quantum Computing - I
10 Quantum Computing - II
11 Quantum Computing - III
12 Superconducting Technology -- Past
13 Superconducting Technology -- Present
14 Superconducting Technology -- Problems
15 Ames Summer 97 Workshop on Device Technology -- Moore's Law - I
16 Ames Summer 97 Workshop on Device Technology -- Moore's Law - II
17 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies I
18 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies II
19 Ames Summer 97 Workshop on Device Technology -- New Logic Concepts
20 Ames Summer 97 Workshop on Device Technology -- RSFQ
21 Ames Summer 97 Workshop on Device Technology -- QCA -I
22 Ames Summer 97 Workshop on Device Technology -- QCA -II
23 Sequential Memory Structure
24 Architecture Classes of High Performance Computers
25 von Neuman Architecture in a Nutshell
26 Illustration of Importance of Cache
27 Vector Supercomputers in a Nutshell - I
28 Vector Supercomputing in a picture
29 Vector Supercomputers in a Nutshell - II
30 What is a Pipeline -- Cafeteria Analogy?
31 Instruction Flow in A Simple Machine Pipeline
32 Example of MIPS R4000 Floating Point
33 MIPS R4000 Floating Point Stages
34 Flynn's Classification of HPC Systems
35 Parallel Computer Architecture Memory Structure
36 Comparison of Memory Access Strategies
37 Types of Parallel Memory Architectures -- Physical Characteristics
38 Diagrams of Shared and Distributed Memories
39 Parallel Computer Architecture Control Structure
40 Some Major Hardware Architectures - MIMD
41 MIMD Distributed Memory Architecture
42 Cache Coherent or Not?
43 Choices in Cache Coherence
44 Some Major Hardware Architectures - SIMD
45 SIMD (Single Instruction Multiple Data) Architecture
46 Some Major Hardware Architectures - Mixed
47 Some MetaComputer Systems
48 Comments on Special Purpose Devices
49 The GRAPE N-Body Machine
50 Why isn't GRAPE a Perfect Solution?
51 Granularity of Parallel Components - I
52 Granularity of Parallel Components - II
53 III. Key drivers: The Need for PetaFLOPS Computing
54 10 Possible PetaFlop Applications
55 Petaflop Performance for Flow in Porous Media?
56 Target Flow in Porous Media Problem (Glimm - Petaflop Workshop)
57 NASA's Projection of Memory and Computational Requirements upto Petaflops for Aerospace Applications
58 Supercomputer Architectures in Years 2005-2010 -- I
59 Supercomputer Architectures in Years 2005-2010 -- II
60 Supercomputer Architectures in Years 2005-2010 -- III
61 Performance Per Transistor
62 Comparison of Supercomputer Architectures
63 Current PIM Chips
64 New "Strawman" PIM Processing Node Macro
65 "Strawman" Chip Floorplan
66 SIA-Based PIM Chip Projections
67 Classes of Communication Networks
68 Switch and Bus based Architectures
69 Examples of Interconnection Topologies
70 Useful Concepts in Communication Systems
71 Latency and Bandwidth of a Network
72 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing
73 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).
74 Communication Performance of Some MPP's
75 Implication of Hardware Performance
76 Netlib Benchweb Benchmarks
77 Linpack Benchmarks
78 Java Linpack Benchmarks

Outside Index Summary of Material



HTML version of Scripted Foils prepared 19 September 1997

Foil 1 Introduction to Computer Architectures CPS615 Fall Semester 97

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
http://www.npac.syr.edu/users/gcf/cps615arch97
Geoffrey Fox
Syracuse University NPAC
111 College Place Syracuse NY 13244 4100
3154432163

HTML version of Scripted Foils prepared 19 September 1997

Foil 2 Abstract of CPS615 Architecture Discussion

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
We describe the simple nature of parallel machines like the nCUBE2
Technologies include conventional, superconducting and Quantum systems but the former dominates
Sequential architectures are dominated by memory (cache) issues
Vector, SIMD MIMD are classic architectures
There is a growing interest in distributed metacomputers
Special purpose machines are typically not so successful
The details of networks is not as important as it used to be!

HTML version of Scripted Foils prepared 19 September 1997

Foil 3 Single nCUBE2 CPU Chip

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index Secs 18

HTML version of Scripted Foils prepared 19 September 1997

Foil 4 64 Node nCUBE Board

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index Secs 34
Each node is CPU and 6 memory chips -- CPU Chip integrates communication channels with floating, integer and logical CPU functions

HTML version of Scripted Foils prepared 19 September 1997

Foil 5 Technologies for High Performance Computers

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
We can choose technology and architecture separately in designing our high performance system
Technology is like choosing ants people or tanks as basic units in our society analogy
  • or less frivolously neurons or brains
In HPCC arena, we can distinguish current technologies
  • COTS (Consumer off the shelf) Microprocessors
  • Custom node computer architectures
  • More generally these are all CMOS technologies
Near term technology choices include
  • Gallium Arsenide or Superconducting materials as opposed to Silicon
  • These are faster by a factor of 2 (GaAs) to 300 (Superconducting)
Further term technology choices include
  • DNA (Chemical) or Quantum technologies
It will cost $40 Billion for next industry investment in CMOS plants and this huge investment makes it hard for new technologies to "break in"

HTML version of Scripted Foils prepared 19 September 1997

Foil 6 Architectures for High Performance Computers - I

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Architecture is equivalent to organization or design in society analogy
  • Different models for society (Capitalism etc.) or different types of groupings in a given society
  • Businesses or Armies are more precisely controlled/organized than a crowd at the State Fair
  • We will generalize this to formal (army) and informal (crowds) organizations
We can distinguish formal and informal parallel computers
Informal parallel computers are typically "metacomputers"
  • i.e. a bunch of computers sitting on a department network

HTML version of Scripted Foils prepared 19 September 1997

Foil 7 Architectures for High Performance Computers - II

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Metacomputers are a very important trend which uses similar software and algorithms to conventional "MPP's" but have typically less optimized parameters
  • In particular network latency is higher and bandwidth is lower for an informal HPC
  • Latency is time for zero length communication -- start up time
Formal high performance computers are the classic (basic) object of study and are
"closely coupled" specially designed collections of compute nodes which have (in principle) been carefully optimized and balanced in the areas of
  • Processor (computer) nodes
  • Communication (internal) Network
  • Linkage of Memory and Processors
  • I/O (external network) capabilities
  • Overall Control or Synchronization Structure

HTML version of Scripted Foils prepared 19 September 1997

Foil 8 There is no Best Machine!

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
In society, we see a rich set of technologies and architectures
  • Ant Hills
  • Brains as bunch of neurons
  • Cities as informal bunch of people
  • Armies as formal collections of people
With several different communication mechanisms with different trade-offs
  • One can walk -- low latency, low bandwidth
  • Go by car -- high latency (especially if can't park), reasonable bandwidth
  • Go by air -- higher latency and bandwidth than car
  • Phone -- High speed at long distance but can only communicate modest material (low capacity)

HTML version of Scripted Foils prepared 19 September 1997

Foil 9 Quantum Computing - I

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Quantum-Mechanical Computers by Seth Lloyd, Scientific American, Oct 95
Chapter 6 of The Feynman Lectures on Computation edited by Tony Hey and Robin Allen, Addison-Wesley, 1996
Quantum Computing: Dream or Nightmare? Haroche and Raimond, Physics Today, August 96 page 51
Basically any physical system can "compute" as one "just" needs a system that gives answers that depend on inputs and all physical systems have this property
Thus one can build "superconducting" "DNA" or "Quantum" computers exploiting respectively superconducting molecular or quantum mechanical rules

HTML version of Scripted Foils prepared 19 September 1997

Foil 10 Quantum Computing - II

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
For a "new technology" computer to be useful, one needs to be able to
  • conveniently prepare inputs,
  • conveniently program,
  • reliably produce answer (quicker than other techniques), and
  • conveniently read out answer
Conventional computers are built around bit ( taking values 0 or 1) manipulation
One can build arbitarily complex arithmetic if have some way of implementing NOT and AND
Quantum Systems naturally represent bits
  • A spin (of say an electron or proton) is either up or down
  • A hydrogen atom is either in lowest or (first) excited state etc.

HTML version of Scripted Foils prepared 19 September 1997

Foil 11 Quantum Computing - III

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Interactions between quantum systems can cause "spin-flips" or state transitions and so implement arithmetic
Incident photons can "read" state of system and so give I/O capabilities
Quantum "bits" called qubits have another property as one has not only
  • State |0> and state |1> but also
  • Coherent states such as .7071*(|0> + |1>) which are equally in either state
Lloyd describes how such coherent states provide new types of computing capabilities
  • Natural random number as measuring state of qubit gives answer 0 or 1 randomly with equal probability
  • As Feynman suggests, qubit based computers are natural for large scale simulation of quantum physical systems -- this is "just" analog computing

HTML version of Scripted Foils prepared 19 September 1997

Foil 12 Superconducting Technology -- Past

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Superconductors produce wonderful "wires" which transmit picosecond (10^-12 seconds) pulses at near speed of light
  • Superconducting is lower power and faster than diffusive electron transmission in CMOS
  • At about 0.35micron chip feature size, CMOS transmission time changes from domination by transmission (Distance) issues to resistive (diffusive effects)
Niobium used in constructing such superconducting circuits can be processed by similar fabrication techniques to CMOS
Josephson Junctions allow picosecond performance switches
BUT IBM (!969-1983) and Japan (MITI 1981-90) terminated major efforts in this area

HTML version of Scripted Foils prepared 19 September 1997

Foil 13 Superconducting Technology -- Present

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
New ideas have resurrected this concept using RSFQ -- Rapid Single Flux Quantum -- approach
This naturally gives a bit which is 0 or 1 (or in fact n units!)
This gives interesting circuits of similar structure to CMOS systems but with a clock speed of order 100-300GHz -- factor of 100 better than CMOS which will asymptote at around 1 GHz (= one nanosecond cycle time)

HTML version of Scripted Foils prepared 19 September 1997

Foil 14 Superconducting Technology -- Problems

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
At least two major problems:
Semiconductor industry will invest some some $40B in CMOS "plants" and infrastructure
  • Currently perhaps $100M a year going into superconducting circuit area!
  • How do we "bootstrap" superconducting industry?
Cannot build memory to match CPU speed and current designs have superconducting CPU's (with perhaps 256 Kbytes superconducting memory per processor) but conventional CMOS memory
  • So compared with current computers have a thousand times faster CPU, factor of four smaller cache of CPU speed and same speed basic memory as now
  • Can such machines perform well -- need new algorithms?
  • Can one design new superconducting memories?
Superconducting technology also has a bad "name" due to IBM termination!

HTML version of Scripted Foils prepared 19 September 1997

Foil 15 Ames Summer 97 Workshop on Device Technology -- Moore's Law - I

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
These foils come from summary by David Bailey
First of all, there seems to be general agreement that Moore's Law will not stop anytime soon.
The current state of the art in the semiconductor industry is 250 nm, as recently announced by Intel among others. Researchers are confident that current approaches can be used for at least another three generations with their current approach.
In the years ahead we may even see some manufacturers skip a generation, proceeding directly to significantly smaller feature sizes. This means that the 100 nm technology wall will be reached earlier than previously anticipated.

HTML version of Scripted Foils prepared 19 September 1997

Foil 16 Ames Summer 97 Workshop on Device Technology -- Moore's Law - II

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Below about 100 nm feature sizes, progress will be more difficult, for various well known reasons, among them the lack of any substance that can be used as a lens for photolithography at such wavelengths.
Nonetheless, there are a number of "tricks" in the works, including the usage of diffraction gratings, parallel exposures, and others.
Groups working on these "ultimate" silicon techniques include Frances Houle's group at IBM Almaden and Jeff Baker's group at UC Berkeley.
Also helping will be some improvements such as better materials and increased wafer sizes. The consensus is that these techniques will be good for another two generations, to about 50 nm. Thus Moore's Law should continue until about 2010 or 2012.

HTML version of Scripted Foils prepared 19 September 1997

Foil 17 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies I

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Below about 50 nm feature sizes, it appears that a completely new device fabrication technology is needed.
Progress in electron beams and X-rays, which were the leading candidates of a few years ago, has stalled.
  • No one yet has any good idea how to scale e-beam lithography to economical production, and X-rays seem to destroy everything in their path, including the masks and devices. If something does eventually materialize along these lines, it won't be easy.
The researchers I spoke with nonetheless agree that future devices will be electronic for the foreseeable future.
As Stan Williams of HP points out, electrons are the ideal basis for device technology because they have basically zero size and mass, can travel at large fractions of the speed of light and interact strongly with matter.

HTML version of Scripted Foils prepared 19 September 1997

Foil 18 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies II

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
In contrast, DNA computing and the like are still curiosities with no clear path to practical high-speed computing.
One exciting development, which has emerged just in the past year or two, is nanotubes, i.e. cylindrical buckyballs.
It was recently discovered that a nanotube can be "tuned" from insulator to semiconductor to conductor just by changing the pitch of the helical structure of carbon atoms.
Kinks introduced in a tube can be used to form a conductor-semiconductor junction.
  • Molecular modeling is being used to further explore some of the possibilities.

HTML version of Scripted Foils prepared 19 September 1997

Foil 19 Ames Summer 97 Workshop on Device Technology -- New Logic Concepts

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
More than one person with whom Bailey talked emphasized that future research should not focus exclusively on fabrication technology.
An equally important question is what is worth fabricating at these small feature sizes. This is because conventional transistor-based logic appears increasingly troublesome below about 50 nm.
Thus a new paradigm in logic design is as much needed as a new paradigm in fabrication technology.

HTML version of Scripted Foils prepared 19 September 1997

Foil 20 Ames Summer 97 Workshop on Device Technology -- RSFQ

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Several at the meeting were aware of Likharev's rapid single flux quantum (RSFQ) superconducting logic. They agree that it is a very interesting development, and further that it is fairly certain to work well.
However, they feel that the requirement for liquid helium temperatures is a severe impediment to widespread utilization, and without commodity production it is unlikely to be economic.
  • See later for details and other difficulties!
According to Stan Williams, recently Likarev came to Hewlett Packard.
He offered HP all of his patents, only asking that they develop the technology.
But HP management still declined. Williams quipped that if some company decided to produce RSFQ technology, somehow assembling the $100 million or so required in start-up capital, the day after they produce their first devices it would pronounced a great success, but on the second day the company would fold financially.

HTML version of Scripted Foils prepared 19 September 1997

Foil 21 Ames Summer 97 Workshop on Device Technology -- QCA -I

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
One interesting possibility for a future logic system is "quantum dot cellular automata", abbreviated QCA.
Wolfgang Porod of the University of Notre Dame gave a talk on these devices at Ames workshop. His group has proposed configurations of cells, each of which is a collection of five electron sites arranged in an X, with two free electrons within the cell.
According to their studies, configurations of these cells can propagate signals, invert signals, form majority vote logic, and thus can be formed to produce practical devices.
The devices are semi-classical, semi-quantum, in that interactions within a cell can involve quantum tunneling, but interactions between cells involve only classical electromagnetic effects.

HTML version of Scripted Foils prepared 19 September 1997

Foil 22 Ames Summer 97 Workshop on Device Technology -- QCA -II

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Like RSFQ, initial demonstration devices of QCA logic will need to be super-cooled, assuming fabrication feature sizes in the tens of nanometers.
But unlike RSFQ (from what we know of RSFQ), the smaller the fabrication dimension, the higher the tolerable temperature, according to thermodynamic analyses.
Indeed, if one could eventually fabricate QCA devices in the realm of 2-5 nanometers, then they could operate at room temperature!

HTML version of Scripted Foils prepared 19 September 1997

Foil 23 Sequential Memory Structure

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Data locality implies CPU finds information it needs in cache which stores most recently accessed information
This means one reuses a given memory reference in many nearby computations e.g.
A1 = B*C
A2 = B*D + B*B
.... Reuses B
L3 Cache
Main
Memory
Disk
Increasing Memory
Capacity Decreasing
Memory Speed (factor of 100 difference between processor
and main memory
speed)

HTML version of Scripted Foils prepared 19 September 1997

Foil 24 Architecture Classes of High Performance Computers

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Sequential or von Neuman Architecture
Vector (Super)computers
Parallel Computers
  • with various architectures classified by Flynn's methodology (this is incomplete as only discusses control or synchronization structure )
  • SISD
  • MISD
  • MIMD
  • SIMD
  • Metacomputers

HTML version of Scripted Foils prepared 19 September 1997

Foil 25 von Neuman Architecture in a Nutshell

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Instructions and data are stored in the same memory for which there is a single link (the von Neumann Bottleneck) to the CPU which decodes and executues instructions
The CPU can have multiple functional units
The memory access can be enhanced by use of caches made from faster memory to allow greater bandwidth and lower latency

HTML version of Scripted Foils prepared 19 September 1997

Foil 26 Illustration of Importance of Cache

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Fig 1.14 of Aspects of Computational Science
Editor Aad van der Steen
published by NCF

HTML version of Scripted Foils prepared 19 September 1997

Foil 27 Vector Supercomputers in a Nutshell - I

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
This design enhances performance by noting that many applications calculate "vector-like" operations
  • Such as c(i)=a(i)+b(i) for i=1...N and N quite large
This allows one to address two performance problems
  • Latency in accessing memory (e.g. could take 10-20 clock cycles between requesting a particular memory location and delivery of result to CPU)
  • A complex operation , e.g. a floating point operation, can take a few machine cycles to complete
They are typified by Cray 1, XMP, YMP, C-90, CDC-205, ETA-10 and Japaneses Supercomputers from NEC Fujitsu and Hitachi

HTML version of Scripted Foils prepared 19 September 1997

Foil 28 Vector Supercomputing in a picture

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
A pipeline for vector addition looks like:
  • From Aspects of Computational Science -- Editor Aad van der Steen published by NCF

HTML version of Scripted Foils prepared 19 September 1997

Foil 29 Vector Supercomputers in a Nutshell - II

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Vector machines pipeline data through the CPU
They are not so popular/relevant as in the past as
  • Improved C.P.U. architecture needs fewer cycles than before for each (complex) operation (e.g 4 now not ~100 as in past)
  • 8 Mhz 8087 of Cosmic Cube took 160 to 400 clock cycles to do a full floating point operation in 1983
  • Applications need more flexible pipelines which allow different operations to be executed on consequitive operands as they stream through CPU
  • Modern RISC processors (super scalar) can support such complex pipelines as they have far more logic than CPU's of the past
In fact excellence of say, Cray C-90 is due to its very good memory architecture allowing one to get enough operands to sustain pipeline.
Most workstation class machines have "good" CPU's but can never get enough data from memory to sustain good performance except for a few cache intensive applications

HTML version of Scripted Foils prepared 19 September 1997

Foil 30 What is a Pipeline -- Cafeteria Analogy?

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Familiar from such everyday activities as getting food in cafeteria where one processes one person per "clock cycle" where
clock cycle here is maximum time anybody takes at a single "stage" where stage is here component of meal (salad, entrée etc.)
Note any one person takes about 5 clock cycles in this pipeline but the pipeline processes one person per clock cycle
Pipeline has problem if there is a "stall" -- here that one of the people wants an entrée which needs to be fetched from kitchen. This delays everybody!
In computer case, stall is caused typically by data not being ready for a particular instruction

HTML version of Scripted Foils prepared 19 September 1997

Foil 31 Instruction Flow in A Simple Machine Pipeline

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Three Instructions are shown overlapped -- each starting one clock cycle after last

HTML version of Scripted Foils prepared 19 September 1997

Foil 32 Example of MIPS R4000 Floating Point

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Taken from David Patterson CS252 Berkeley Fall 1996
3 Functional Units FP Adder, FP Multiplier, FP Divider
8 Kinds of Stages in FP Units
Stage Functional Unit Description
A FP Adder Mantissa ADD stage
D FP Divider Divide Pipeline stage
E FP Multiplier Exception Test stage
M FP Multiplier First stage of multiplier
N FP Multiplier Second stage of multiplier
R FP Adder Rounding stage
S FP Adder Operand Shift stage
U Unpack Floating point numbers

HTML version of Scripted Foils prepared 19 September 1997

Foil 33 MIPS R4000 Floating Point Stages

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Several different pipelines with different lengths!
Add,Subtract- 4 clocks:U S+A A+R R+S
Multiply - 8 clocks: U E+M M M M N+A R
Divide- 38 clocks: U A R D(28) D+A D+R D+R D+A D+R A R
Square Root- 110 clocks: U E (A+R)(108) A R
Negate- 2 clocks: U S
Absolute Value- 2 clocks: U S
Floating Point Compare- 3 clocks:U A R

HTML version of Scripted Foils prepared 19 September 1997

Foil 34 Flynn's Classification of HPC Systems

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Very high speed computing systems,Proc of IEEE 54,12,p1901-1909(1966) and
Some Computer Organizations and their Effectiveness, IEEE Trans. on Comp. C-21,948-960(1972) -- both papers by M.J. Flynn
SISD -- Single Instruction stream, Single Data Stream -- i.e. von Neumann Architecture
MISD -- Multiple Instruction stream, Single Data Stream -- Not interesting
SIMD -- Single Instruction stream, Multiple Data Stream
MIMD -- Multiple Instruction stream and Multiple Data Stream -- dominant parallel system with ~one to ~one match of instruction and data streams.

HTML version of Scripted Foils prepared 19 September 1997

Foil 35 Parallel Computer Architecture Memory Structure

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Memory Structure of Parallel Machines
  • Distributed
  • Shared
  • Cached
and Heterogeneous mixtures
Shared (Global): There is a global memory space, accessible by all processors.
  • Processors may also have some local memory.
  • Algorithms may use global data structures efficiently.
  • However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times)
Distributed (Local, Message-Passing): All memory is associated with processors.
  • To retrieve information from another processors' memory a message must be sent there.
  • Algorithms should use distributed data structures.

HTML version of Scripted Foils prepared 19 September 1997

Foil 36 Comparison of Memory Access Strategies

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Memory can be accessed directly (analogous to a phone call) as in red lines below or indirectly by message passing (green line below)
We show two processors in a MIMD machine for distributed (left) or shared(right) memory architectures

HTML version of Scripted Foils prepared 19 September 1997

Foil 37 Types of Parallel Memory Architectures -- Physical Characteristics

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Uniform: All processors take the same time to reach all memory locations.
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
  • DASH (Hennessey at Stanford) is best known example of such a virtual shared memory machine which is logically shared but physically distributed.
  • ALEWIFE from MIT is a similar project
  • TERA (from Burton Smith) is Uniform memory access and logically shared memory machine
Most NUMA machines these days have two memory access times
  • Local memory (divided in registers caches etc) and
  • Nonlocal memory with little or no difference in access time for different nonlocal memories
This simple two level memory access model gets more complicated in proposed 10 year out "petaflop" designs

HTML version of Scripted Foils prepared 19 September 1997

Foil 38 Diagrams of Shared and Distributed Memories

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index

HTML version of Scripted Foils prepared 19 September 1997

Foil 39 Parallel Computer Architecture Control Structure

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
SIMD -lockstep synchronization
  • Each processor executes same instruction stream
MIMD - Each Processor executes independent instruction streams
MIMD Synchronization can take several forms
  • Simplest: program controlled message passing
  • "Flags" (barriers,semaphores) in memory - typical shared memory construct as in locks seen in Java Threads
  • Special hardware - as in cache and its coherency (coordination between nodes)

HTML version of Scripted Foils prepared 19 September 1997

Foil 40 Some Major Hardware Architectures - MIMD

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
MIMD Distributed Memory
  • This is now best illustrated by a collection of computers on a network (i.e. a metacomputer)
MIMD with logically shared memory but usually physically distributed. The latter is sometimes called distributed shared memory.
  • In near future, ALL formal (closely coupled) MPP's will be distributed shared memory
  • Note all computers (e.g. current MIMD distributed memory IBM SP2) allow any node to get at any memory but this is done indirectly -- you send a message
  • In future "closely-coupled" machines, there will be built in hardware supporting the function that any node can directly address all memory of the system
  • This distributed shared memory architecture is currently of great interest to (a major challenge for) parallel compilers

HTML version of Scripted Foils prepared 19 September 1997

Foil 41 MIMD Distributed Memory Architecture

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
A special case of this is a network of workstations (NOW's) or personal computers (metacomputer)
Issues include:
  • Node - CPU, Memory
  • Network - Bandwidth, Memory
  • Hardware Enhanced Access to distributed Memory

HTML version of Scripted Foils prepared 19 September 1997

Foil 42 Cache Coherent or Not?

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Suppose two processors cache the same variable stored in memory of one of the processors
One must ensure cache coherence so that when one cache value changes, all do!
Interconnection Network
....
....
Interconnection Network
Cached Value of same shared variable

HTML version of Scripted Foils prepared 19 September 1997

Foil 43 Choices in Cache Coherence

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Machines like the SGI Origin 2000 have a distributed shared memory with a so called directory implementation (pioneered in DASH project at Stanford) of cache coherence
Machines like SGI Cray T3E are distributed memory but do have fast get and put so as to be able to access single variables stored on remote memories
  • This gives performance advantage of shared memory without the programming advantage but complex hardware of Origin 2000
Origin 2000 approach does not scale as well as Cray T3E and large Origin 2000 systems must use message passing to link 128 node coherent memory subsystems
Cray T3E offers a uniform (but not as good for small number of nodes) interface
Message passing / distributed memory is natural web model

HTML version of Scripted Foils prepared 19 September 1997

Foil 44 Some Major Hardware Architectures - SIMD

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
SIMD -- Single Instruction Multiple Data -- can have logically distributed or shared memory
  • Examples are CM-1,2 from Thinking Machines
  • and AMT DAP and Maspar which are currently focussed entirely on accelerating parts of database indexing
  • This architecture is of decreasing interest as has reduced functionality without significant cost advantage compared to MIMD machines
  • Cost of synchronization in MIMD machines is not high!
  • Main interest of SIMD is flexible bit arithmetic as processors "small" but as transistor densities get higher this also becomes less interesting as full function 64 bit CPU's only use a small fraction of silicon of modern computer

HTML version of Scripted Foils prepared 19 September 1997

Foil 45 SIMD (Single Instruction Multiple Data) Architecture

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, fast two-dimensional mesh and slower general switch for communication

HTML version of Scripted Foils prepared 19 September 1997

Foil 46 Some Major Hardware Architectures - Mixed

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of MIMD or SIMD, Sequential or Parallel machines.
Metacomputers can vary from full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D
This is a critical future architecture which is intrinsically distributed memory as multi-vendor heterogenity implies that one cannot have special hardware enhanced shared memory
  • note that this can be a MIMD collection of SIMD machines if have a set of Maspar's on a network
  • One can think of human brain as a SIMD machine and then a group of people is such a MIMD collection of SIMD processors

HTML version of Scripted Foils prepared 19 September 1997

Foil 47 Some MetaComputer Systems

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Cluster of workstations or PC's
Heterogeneous MetaComputer System

HTML version of Scripted Foils prepared 19 September 1997

Foil 48 Comments on Special Purpose Devices

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
One example is an Associative memory - SIMD or MIMD or content addressable memories
This is an an example of a special purpose "signal" processing machine which can in fact be built from "conventional" SIMD or "MIMD" architectures
This type of machine is not so popular as most applications are not dominated by computations for which good special purpose devices can be designed
If only 10% of a problem is say "track-finding" or some special purpose processing, then who cares if you reduce that 10% by a factor of 100
  • You have only sped up the system by a factor 1.1 not by 100!

HTML version of Scripted Foils prepared 19 September 1997

Foil 49 The GRAPE N-Body Machine

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
N body problems (e.g. Newton's laws for one million stars in a globular cluster) can have succesful special purpose devices
See GRAPE (GRAvity PipE) machine (Sugimoto et al. Nature 345 page 90,1990)
  • Essential reason is that such problems need much less memory per floating point unit than most problems
  • Globular Cluster: 10^6 computations per datum stored
  • Finite Element Iteration: A few computations per datum stored
  • Rule of thumb is that one needs one gigabyte of memory per gigaflop of computation in general problems and this general design puts most cost into memory not into CPU.
Note GRAPE uses EXACTLY same parallel algorithm that one finds in the books (e.g. Solving Problems on Concurrent Processors) for N-body problems on classic distributed memory MIMD machines

HTML version of Scripted Foils prepared 19 September 1997

Foil 50 Why isn't GRAPE a Perfect Solution?

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
GRAPE will execute the classic O(N^2) (parallel) N body algorithm BUT this is not the algorithm used in most such computations
Rather there is the O(N) or O(N)logN so called "fast-multipole" algorithm which uses hierarchical approach
  • On one million stars, fast multipole is a factor of 100-1000 faster than GRAPE algorithm
  • fast multipole works in most but not all N-body problems (in globular clusters, extreme heterogenity makes direct O(N^2) method most attractive)
So special purpose devices cannot usually take advantage of new nifty algorithms!

HTML version of Scripted Foils prepared 19 September 1997

Foil 51 Granularity of Parallel Components - I

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors.
  • Pieces, processors may be heterogeneous. Computation/
  • Communication ratio very high -- Typical of Networked Metacomputing
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors.
  • Processors typically run the same code.(SPMD Style)
  • Computation/communication ratio often hundreds or more.
  • Typical of MIMD Parallel Systems such as SP2, CM5, Paragon, T3D

HTML version of Scripted Foils prepared 19 September 1997

Foil 52 Granularity of Parallel Components - II

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines.
  • Processors often have instructions broadcasted to them.
  • Computation/ Communication ratio often near unity.
  • Typical of SIMD but seen in a few MIMD systems such as Kogge's Execube, Dally's J Machine or commercial Myrianet (Seitz)
  • This is going to be very important in future petaflop architectures as the dense chips of year 2003 onwards favor this Processor in Memory Architecture
  • So many "transistors" in future chips that "small processors" of the "future" will be similar to todays high end microprocessors
  • As chips get denser, not realistic to put processors and memories on separate chips as granularities become too big
Note that a machine of given granularity can be used on algorithms of the same or finer granularity

HTML version of Scripted Foils prepared 19 September 1997

Foil 53 III. Key drivers: The Need for PetaFLOPS Computing

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Applications that require petaFLOPS can already be identified
  • (DOE) Nuclear weapons stewardship
  • (NSA) Cryptology and digital signal processing
  • (NASA and NSF) Satellite data assimilation and climate modeling
The need for ever greater computing power will remain.
PetaFLOPS systems are right step for the next decade

HTML version of Scripted Foils prepared 19 September 1997

Foil 54 10 Possible PetaFlop Applications

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Nuclear Weopens Stewardship (ASCI)
Cryptology and Digital Signal Processing
Satellite Data Analysis
Climate and Environmental Modeling
3-D Protein Molecule Reconstruction
Real-Time Medical Imaging
Severe Storm Forecasting
Design of Advanced Aircraft
DNA Sequence Matching
Molecular Simulations for nanotechnology
Large Scale Economic Modelling
Intelligent Planetary Spacecraft

HTML version of Scripted Foils prepared 19 September 1997

Foil 55 Petaflop Performance for Flow in Porous Media?

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Why does one need a petaflop (1015 operations per second) computer?
These are problems where quite viscous (oil, pollutants) liquids percolate through the ground
Very sensitive to details of material
Most important problems are already solved at some level, but most solutions are insufficient and need improvement in various respects:
  • under resolution of solution details, averaging of local variations and under representation of physical details
  • rapid solutions to allow efficient exploration of system parameters
  • robust and automated solution, to allow integration of results in high level decision, design and control functions
  • inverse problems (history match) to reconstruct missing data require multiple solutions of the direct problem

HTML version of Scripted Foils prepared 19 September 1997

Foil 56 Target Flow in Porous Media Problem (Glimm - Petaflop Workshop)

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Oil Resevoir Simulation
Geological variation occurs down to pore size of rock - almost 10-6 metres - model this (statistically)
Want to calculate flow between wells which are about 400 metres apart
103x103x102 = 108 grid elements
30 species
104 time steps
300 separate cases need to be considered
3x109 words of memory per case
1012 words total if all cases considered in parallel
1019 floating point operation
3 hours on a petaflop computer

HTML version of Scripted Foils prepared 19 September 1997

Foil 57 NASA's Projection of Memory and Computational Requirements upto Petaflops for Aerospace Applications

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index

HTML version of Scripted Foils prepared 19 September 1997

Foil 58 Supercomputer Architectures in Years 2005-2010 -- I

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Conventional (Distributed Shared Memory) Silcon
  • Clock Speed 1GHz
  • 4 eight way parallel Complex RISC nodes per chip
  • 4000 Processing chips gives over 100 tera(fl)ops
  • 8000 2 Gigabyte DRAM gives 16 Terabytes Memory
Note Memory per Flop is much less than one to one
Natural scaling says time steps decrease at same rate as spatial intervals and so memory needed goes like (FLOPS in Gigaflops)**.75
  • If One Gigaflop requires One Gigabyte of memory (Or is it one Teraflop that needs one Terabyte?)

HTML version of Scripted Foils prepared 19 September 1997

Foil 59 Supercomputer Architectures in Years 2005-2010 -- II

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Superconducting Technology is promising but can it compete with silicon juggernaut?
Should be able to build a simple 200 Ghz Superconducting CPU with modest superconducting caches (around 32 Kilobytes)
Must use same DRAM technology as for silicon CPU ?
So tremendous challenge to build latency tolerant algorithms (as over a factor of 100 difference in CPU and memory speed) but advantage of factor 30-100 less parallelism needed

HTML version of Scripted Foils prepared 19 September 1997

Foil 60 Supercomputer Architectures in Years 2005-2010 -- III

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Processor in Memory (PIM) Architecture is follow on to J machine (MIT) Execube (IBM -- Peter Kogge) Mosaic (Seitz)
  • More Interesting in 2007 as processors are be "real" and have nontrivial amount of memory
  • Naturally fetch a complete row (column) of memory at each access - perhaps 1024 bits
One could take in year 2007 each two gigabyte memory chip and alternatively build as a mosaic of
  • One Gigabyte of Memory
  • 1000 250,000 transistor simple CPU's running at 1 Gigaflop each and each with one megabyte of on chip memory
12000 chips (Same amount of Silicon as in first design but perhaps more power) gives:
  • 12 Terabytes of Memory
  • 12 Petaflops performance
  • This design "extrapolates" specialized DSP's , the GRAPE (specialized teraflop N body machine) etc to a "somewhat specialized" system with a general CPU but a special memory poor architecture with particular 2/3D layout

HTML version of Scripted Foils prepared 19 September 1997

Foil 61 Performance Per Transistor

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Performance data from uP vendors
Transistor count excludes on-chip caches
Performance normalized by clock rate
Conclusion: Simplest is best! (250K Transistor CPU)
Millions of Transistors (CPU)
Millions of Transistors (CPU)
Normalized SPECINTS
Normalized SPECFLTS

HTML version of Scripted Foils prepared 19 September 1997

Foil 62 Comparison of Supercomputer Architectures

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Fixing 10-20 Terabytes of Memory, we can get
16000 way parallel natural evolution of today's machines with various architectures from distributed shared memory to clustered heirarchy
  • Peak Performance is 150 Teraflops with memory systems like today but worse with more levels of cache
5000 way parallel Superconducting system with 1 Petaflop performance but terrible imbalance between CPU and memory speeds
12 million way parallel PIM system with 12 petaflop performance and "distributed memory architecture" as off chip access with have serious penalities
There are many hybrid and intermediate choices -- these are extreme examples of "pure" architectures

HTML version of Scripted Foils prepared 19 September 1997

Foil 63 Current PIM Chips

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Storage
0.5 MB
0.5 MB
0.05 MB
0.128 MB
Chip
EXECUBE
AD SHARC
TI MVP
MIT MAP
Terasys PIM
First
Silicon
1993
1994
1994
1996
1993
Peak
50 Mips
120 Mflops
2000 Mops
800 Mflops
625 M bit
ops
0.016 MB
MB/
Perf.
0.01
MB/Mip
0.005
MB/MF
0.000025
MB/Mop
0.00016
MB/MF
0.000026
MB/bit op
Organization
16 bit
SIMD/MIMD CMOS
Single CPU and
Memory
1 CPU, 4 DSP's
4 Superscalar
CPU's
1024
16-bit ALU's

HTML version of Scripted Foils prepared 19 September 1997

Foil 64 New "Strawman" PIM Processing Node Macro

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index

HTML version of Scripted Foils prepared 19 September 1997

Foil 65 "Strawman" Chip Floorplan

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index

HTML version of Scripted Foils prepared 19 September 1997

Foil 66 SIA-Based PIM Chip Projections

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
MB per cm2
MF per cm2
MB/MF ratios
MB/MF ratios

HTML version of Scripted Foils prepared 19 September 1997

Foil 67 Classes of Communication Networks

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
The last major architectural feature of a parallel machine is the network or design of hardware/software connecting processors and memories together.
Bus: All processors (and memory) connected to a common bus or busses.
  • Memory access fairly uniform, but not very scalable due to contention
  • Bus machines can be NUMA if memory consists of directly accessed local memory as well as memory banks accessed by Bus. The Bus accessed memories can be local memories on other processors
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
  • Switches might have queues and "combining logic", which improve functionality but increase latency.
  • Switch settings may be determined by message headers or preset by controller.
  • Connections can be packet-switched (messages no longer than some fixed size) or circuit-switched (connection remains as long as needed)
  • Usually NUMA, blocking, often scalable and upgradable

HTML version of Scripted Foils prepared 19 September 1997

Foil 68 Switch and Bus based Architectures

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Switch
Bus

HTML version of Scripted Foils prepared 19 September 1997

Foil 69 Examples of Interconnection Topologies

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube.
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial

HTML version of Scripted Foils prepared 19 September 1997

Foil 70 Useful Concepts in Communication Systems

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Useful terms include:
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout)
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault?
Blocking: Some communication requests may not get through, due to conflicts caused by other requests.
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor.
Latency (delay): Maximal time for nonblocked request to be transmitted.
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task.
Worm Hole Routing -- Intermediate switch nodes do not wait for full message but allow it to pass throuch in small packets

HTML version of Scripted Foils prepared 19 September 1997

Foil 71 Latency and Bandwidth of a Network

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Transmission Time for message of n bytes:
T0 + T1 n where
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize
T0 = TS + Td . Number of hops
T1 is the inverse bandwidth -- it can be made small if pipe is large size.
In practice TS and T1 are most important and Td is unimportant

HTML version of Scripted Foils prepared 19 September 1997

Foil 72 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

HTML version of Scripted Foils prepared 19 September 1997

Foil 73 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
Square blocks indicate shared memory copy performance
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

HTML version of Scripted Foils prepared 19 September 1997

Foil 74 Communication Performance of Some MPP's

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
From Aspects of Computational Science, Editor Aad van der Steen, published by NCF
System Communication Speed Computation Speed
    • Mbytes/sec(per link) Mflops/sec(per node)
IBM SP2 40 267
Intel iPSC860 2.8 60
Intel Paragon 200 75
Kendall Square
KSR-1 17.1 40
Meiko CS-2 100 200
Parsytec GC 20 25
TMC CM-5 20 128
Cray T3D 150 300

HTML version of Scripted Foils prepared 19 September 1997

Foil 75 Implication of Hardware Performance

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
tcomm = 4 or 8 /Speed in Mbytes sec
  • as 4 or 8 bytes in a floating point word
tfloat = 1/Speed in Mflops per sec
Thus tcomm / tfloat is just 4 X Computation Speed divided by Communication speed
tcomm / tfloat is 26.7, 85, 1.5, 9.35, 8, 5, 25.6, 8 for the machines SP2, iPSC860, Paragon, KSR-1, Meiko CS2, Parsytec GC, TMC CM5, and Cray T3D respectively
Latency makes situation worse for small messages and double for 64bit arithmetic natural on large problems!

HTML version of Scripted Foils prepared 19 September 1997

Foil 76 Netlib Benchweb Benchmarks

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
See Original Foil

HTML version of Scripted Foils prepared 19 September 1997

Foil 77 Linpack Benchmarks

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
See Original Foil

HTML version of Scripted Foils prepared 19 September 1997

Foil 78 Java Linpack Benchmarks

From Master Set of Foils for 1997 Session of CPS615 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *
Full HTML Index
See Original Foil

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sun Feb 22 1998