Full HTML for

Scripted foilset CPS615 - Overview of Computer Architectures

Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 98. Foils prepared 17 November 1998
Outside Index Summary of Material Various HPCC Resource Lists for Foil 2


This presentation came from material developed by David Culler and Jack Dongarra available on the Web
See summary of Saleh Elmohamed and Ken Hawick at http://nhse.npac.syr.edu/hpccsurvey/
We discuss several examples in detail including T3E, Origin 2000, Sun E10000 and Tera MTA
These are used to illustrate major architecture types
We discuss key sequential architecture issues including cache structure
We also discuss technologies from today's commodities through Petaflop ideas and Quantum Computing

Table of Contents for full HTML of CPS615 - Overview of Computer Architectures

Denote Foils where Image Critical
Denote Foils where Image has important information
Denote Foils where HTML is sufficient
Denote Foils where Image is not available
denotes presence of Additional linked information which is lightpurpleed out if missing

1 Various HPCC Resource Lists for Foil 1 Computer Architecture for Computational Science
2 Various HPCC Resource Lists for Foil 2 Abstract of Computer Architecture Overview
3 Some NPAC Parallel Machines
4 Technologies for High Performance Computers
5 Architectures for High Performance Computers - I
6 Architectures for High Performance Computers - II
7 There is no Best Machine!
8 Architectural Trends I
9 Architectural Trends
10 3 Classes of VLSI Design?
11 Ames Summer 97 Workshop on Device Technology -- Moore's Law - I
12 Ames Summer 97 Workshop on Device Technology -- Moore's Law - II
13 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies I
14 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies II
15 Architectural Trends: Bus-based SMPs
16 Bus Bandwidth
17 Economics
18 Important High Performance Computing Architectures
19 Some General Issues Addressed by High Performance Architectures
20 Architecture Classes of High Performance Computers
21 Flynn's Classification of HPC Systems
22 Description of Linpack as used by Top500 List Raw Uniprocessor Performance: Cray v. Microprocessor LINPACK n by n Matrix Solves
23 Description of Linpack as used by Top500 List Raw Parallel Performance: LINPACK
24 Description of Linpack as used by Top500 List Linear Linpack HPC Performance versus Time
25 Top 500 Supercomputer List from which you can get customized sublists Top 10 Supercomputers November 1998
26 Top 500 and Jack Dongarra Links for Foil 26 Distribution of 500 Fastest Computers
27 Top 500 and Jack Dongarra Links for Foil 27 CPU Technology used in Top 500 versus Time
28 Top 500 and Jack Dongarra Links for Foil 28 Geographical Distribution of Top 500 Supercomputers versus time
29 Top 500 and Jack Dongarra Links for Foil 29 Node Technology used in Top 500 Supercomputers versus Time
30 Top 500 and Jack Dongarra Links for Foil 30 Total Performance in Top 500 Supercomputers versus Time and Manufacturer
31 Top 500 and Jack Dongarra Links for Foil 31 Number of Top 500 Systems as a function of time and Manufacturer
32 Top 500 and Jack Dongarra Links for Foil 32 Total Number of Top 500 Systems Installed June 98 versus Manufacturer
33 Netlib Benchweb Benchmarks
34 Linpack Benchmarks
35 Java Linpack Benchmarks
36 Java Numerics
37 von Neuman Architecture in a Nutshell
38 What is a Pipeline -- Cafeteria Analogy?
39 Instruction Flow in A Simple Machine Pipeline
40 Example of MIPS R4000 Floating Point
41 MIPS R4000 Floating Point Stages
42 Illustration of Importance of Cache
43 Sequential Memory Structure
44 Cache Issues I
45 Cache Issues II
46 Spatial versus Temporal Locality I
47 Spatial versus Temporal Locality II
48 Interesting Article from SC97 proceedings on T3E performance Cray/SGI memory latencies
49 NPAC Summary of T3E Architecture Architecture of Cray T3E
50 NPAC Summary of T3E Architecture T3E Messaging System
51 Interesting Article from SC97 proceedings on T3E performance Cray T3E Cache Structure
52 Interesting Article from SC97 proceedings on T3E performance Cray T3E Cache Performance
53 Interesting Article from SC97 proceedings on T3E performance Finite Difference Example for T3E Cache Use I
54 Interesting Article from SC97 proceedings on T3E performance Finite Difference Example for T3E Cache Use II
55 Interesting Article from SC97 proceedings on T3E performance How to use Cache in Example I
56 Interesting Article from SC97 proceedings on T3E performance How to use Cache in Example II
57 NPAC Survey of Cray Systems Cray Vector Supercomputers
58 Vector Supercomputers in a Nutshell - I
59 Vector Supercomputing in a picture
60 Vector Supercomputers in a Nutshell - II
61 Parallel Computer Architecture Memory Structure
62 Comparison of Memory Access Strategies
63 Types of Parallel Memory Architectures -- Physical Characteristics
64 Diagrams of Shared and Distributed Memories
65 Parallel Computer Architecture Control Structure
66 Mark2 Hypercube built by JPL(1985) Cosmic Cube (1983) built by Caltech (Chuck Seitz)
67 64 Ncube Processors (each with 6 memory chips) on a large board
68 ncube1 Chip -- integrated CPU and communication channels
69 More Details on IBM SP2 Example of Message Passing System: IBM SP-2
70 Example of Message Passing System: Intel Paragon
71 ASCI Red Intel Supercomputer at Sandia ASCI Red -- Intel Supercomputer at Sandia
72 Parallel Computer Memory Structure
73 Cache Coherent or Not?
74 Cache Coherence
75 NPAC Summary of Origin 2000 Architecture SGI Origin 2000 I
76 NPAC Summary of Origin 2000 Architecture SGI Origin II
77 NPAC Summary of Origin 2000 Architecture SGI Origin Block Diagram
78 NPAC Summary of Origin 2000 Architecture SGI Origin III
79 NPAC Summary of Origin 2000 Architecture SGI Origin 2 Processor Node Board
80 NCSA Performance Measurements Performance of NCSA 128 node SGI Origin 2000
81 Summary of Cache Coherence Approaches
82 Some Major Hardware Architectures - SIMD
83 SIMD (Single Instruction Multiple Data) Architecture
84 Examples of Some SIMD machines
85 Computer Museum Entry for Connection Machine SIMD CM 2 from Thinking Machines
86 Computer Museum Entry for Connection Machine Official Thinking Machines Specification of CM2
87 Some Major Hardware Architectures - Mixed
88 Some MetaComputer Systems
89 Clusters of PC's 1986-1998
90 NCSA Performance Measurements HP Kayak PC (300 MHz Intel Pentium II) vs Origin 2000
91 Comments on Special Purpose Devices
92 Description of GRAPE 4 5 and 6 Machines:1 to 200 Teraflops The GRAPE N-Body Machine
93 Description of GRAPE 4 5 and 6 Machines:1 to 200 Teraflops Why isn't GRAPE a Perfect Solution?
94 Grape Special Purpose Machine GRAPE Special Purpose Machines
95 Special Purpose Physics Machines for Foil 100 Quantum ChromoDynamics (QCD) Special Purpose Machines
96 Granularity of Parallel Components - I
97 Granularity of Parallel Components - II
98 Classes of Communication Networks
99 Switch and Bus based Architectures
100 Examples of Interconnection Topologies
101 Useful Concepts in Communication Systems
102 Latency and Bandwidth of a Network
103 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing
104 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).
105 Communication Performance of Some MPP's
106 Implication of Hardware Performance
107 NASA Ames Comparison of MPI on Origin 2000 and Sun E10000 MPI Bandwidth on SGI Origin and Sun Shared Memory Machines
108 NASA Ames Comparison of MPI on Origin 2000 and Sun E10000 Latency Measurements on Origin and Sun for MPI
109 Two Basic Programming Models
110 Shared Address Space Architectures
111 Shared Address Space Model
112 Communication Hardware
113 History -- Mainframe
114 History -- Minicomputer
115 Scalable Interconnects
116 Message Passing Architectures
117 Message-Passing Abstraction e.g. MPI
118 First Message-Passing Machines
119 SMP Example: Intel Pentium Pro Quad
120 Descriptions of Sun HPC Systems for Foil 69 Sun E10000 in a Nutshell
121 Descriptions of Sun HPC Systems for Foil 70 Sun Enterprise Systems E6000/10000
122 Descriptions of Sun HPC Systems for Foil 71 Starfire E10000 Architecture I
123 Descriptions of Sun HPC Systems for Foil 72 Starfire E10000 Architecture II
124 Descriptions of Sun HPC Systems for Foil 73 Sun Enterprise E6000/6500 Architecture
125 Descriptions of Sun HPC Systems for Foil 74 Sun's Evaluation of E10000 Characteristics I
126 Descriptions of Sun HPC Systems for Foil 75 Sun's Evaluation of E10000 Characteristics II
127 Descriptions of Sun HPC Systems for Foil 76 Scalability of E1000
128 Consider Scientific Supercomputing
129 Toward Architectural Convergence
130 Convergence: Generic Parallel Architecture
131 Tera Architecture and System Links for Foil 79 Tera Multithreaded Supercomputer
132 Tera Architecture and System Links for Foil 80 Tera Computer at San Diego Supercomputer Center
133 Tera Architecture and System Links for Foil 81 Overview of the Tera MTA I
134 Tera Architecture and System Links for Foil 82 Overview of the Tera MTA II
135 Tera Architecture and System Links for Foil 83 Tera 1 Processor Architecture from H. Bokhari (ICASE)
136 Tera Architecture and System Links for Foil 84 Tera Processor Characteristics
137 Tera Architecture and System Links for Foil 85 Tera System Diagram
138 Tera Architecture and System Links for Foil 86 Interconnect / Communications System of Tera I
139 Tera Architecture and System Links for Foil 87 Interconnect / Communications System of Tera II
140 Tera Architecture and System Links for Foil 88 T90/Tera MTA Hardware Comparison
141 Tera Architecture and System Links for Foil 89 Tera Configurations / Performance
142 Tera Architecture and System Links for Foil 90 Performance of MTA wrt T90 and in parallel
143 Tera Architecture and System Links for Foil 91 Tera MTA Performance on NAS Benchmarks Compared to T90
144 Cache Only COMA Machines
145 III. Key drivers: The Need for PetaFLOPS Computing
146 10 Possible PetaFlop Applications
147 Petaflop Performance for Flow in Porous Media?
148 Target Flow in Porous Media Problem (Glimm - Petaflop Workshop)
149 NASA's Projection of Memory and Computational Requirements upto Petaflops for Aerospace Applications
150 Supercomputer Architectures in Years 2005-2010 -- I
151 Supercomputer Architectures in Years 2005-2010 -- II
152 Supercomputer Architectures in Years 2005-2010 -- III
153 Performance Per Transistor
154 Comparison of Supercomputer Architectures
155 Current PIM Chips
156 New "Strawman" PIM Processing Node Macro
157 "Strawman" Chip Floorplan
158 SIA-Based PIM Chip Projections
159 Quantum Computing - I
160 Quantum Computing - II
161 Quantum Computing - III
162 Superconducting Technology -- Past
163 Superconducting Technology -- Present
164 Superconducting Technology -- Problems

Outside Index Summary of Material



HTML version of Scripted Foils prepared 17 November 1998

Foil 1 Computer Architecture for Computational Science

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Various HPCC Resource Lists for Foil 1
Fall Semester 98 1998
Geoffrey Fox
Northeast Parallel Architectures Center
Syracuse University
111 College Place
Syracuse NY
gcf@npac.syr.edu

HTML version of Scripted Foils prepared 17 November 1998

Foil 2 Abstract of Computer Architecture Overview

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Various HPCC Resource Lists for Foil 2
This presentation came from material developed by David Culler and Jack Dongarra available on the Web
See summary of Saleh Elmohamed and Ken Hawick at http://nhse.npac.syr.edu/hpccsurvey/
We discuss several examples in detail including T3E, Origin 2000, Sun E10000 and Tera MTA
These are used to illustrate major architecture types
We discuss key sequential architecture issues including cache structure
We also discuss technologies from today's commodities through Petaflop ideas and Quantum Computing

HTML version of Scripted Foils prepared 17 November 1998

Foil 3 Some NPAC Parallel Machines

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
CM5
nCUBE
Intel iPSC2
Workstation Cluster of Digital alpha machines
NCSA 1024 nodes
NPAC

HTML version of Scripted Foils prepared 17 November 1998

Foil 4 Technologies for High Performance Computers

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
We can choose technology and architecture separately in designing our high performance system
Technology is like choosing ants people or tanks as basic units in our society analogy
  • or less frivolously neurons or brains
In HPCC arena, we can distinguish current technologies
  • COTS (Consumer off the shelf) Microprocessors
  • Custom node computer architectures
  • More generally these are all CMOS technologies
Near term technology choices include
  • Gallium Arsenide or Superconducting materials as opposed to Silicon
  • These are faster by a factor of 2 (GaAs) to 300 (Superconducting)
Further term technology choices include
  • DNA (Chemical) or Quantum technologies
It will cost $40 Billion for next industry investment in CMOS plants and this huge investment makes it hard for new technologies to "break in"

HTML version of Scripted Foils prepared 17 November 1998

Foil 5 Architectures for High Performance Computers - I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Architecture is equivalent to organization or design in society analogy
  • Different models for society (Capitalism etc.) or different types of groupings in a given society
  • Businesses or Armies are more precisely controlled/organized than a crowd at the State Fair
  • We will generalize this to formal (army) and informal (crowds) organizations
We can distinguish formal and informal parallel computers
Informal parallel computers are typically "metacomputers"
  • i.e. a bunch of computers sitting on a department network

HTML version of Scripted Foils prepared 17 November 1998

Foil 6 Architectures for High Performance Computers - II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Metacomputers are a very important trend which uses similar software and algorithms to conventional "MPP's" but have typically less optimized parameters
  • In particular network latency is higher and bandwidth is lower for an informal HPC
  • Latency is time for zero length communication -- start up time
Formal high performance computers are the classic (basic) object of study and are
"closely coupled" specially designed collections of compute nodes which have (in principle) been carefully optimized and balanced in the areas of
  • Processor (computer) nodes
  • Communication (internal) Network
  • Linkage of Memory and Processors
  • I/O (external network) capabilities
  • Overall Control or Synchronization Structure

HTML version of Scripted Foils prepared 17 November 1998

Foil 7 There is no Best Machine!

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
In society, we see a rich set of technologies and architectures
  • Ant Hills
  • Brains as bunch of neurons
  • Cities as informal bunch of people
  • Armies as formal collections of people
With several different communication mechanisms with different trade-offs
  • One can walk -- low latency, low bandwidth
  • Go by car -- high latency (especially if can't park), reasonable bandwidth
  • Go by air -- higher latency and bandwidth than car
  • Phone -- High speed at long distance but can only communicate modest material (low capacity)

HTML version of Scripted Foils prepared 17 November 1998

Foil 8 Architectural Trends I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Architecture translates technology's gifts to performance and capability
Resolves the tradeoff between parallelism and locality
  • Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect
  • Tradeoffs may change with scale and technology advances
Four generations of architectural history: tube, transistor, IC, VLSI
  • Here focus only on VLSI generation
  • Future generations COULD be Quantum and Superconducting technology
Greatest delineation within VLSI generation has been in type of parallelism exploited

HTML version of Scripted Foils prepared 17 November 1998

Foil 9 Architectural Trends

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Greatest trend in VLSI generation is increase in parallelism
  • Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
    • slows after 32 bit
    • adoption of 64-bit now under way, 128-bit far (not performance issue)
    • important inflection point when 32-bit microprocessor and cache fit on a chip
  • Mid 80s to mid 90s: instruction level parallelism
    • pipelining and simple instruction sets, + compiler advances (RISC)
    • on-chip caches and functional units => superscalar execution
    • greater sophistication: out of order execution, speculation, prediction
      • to deal with control transfer and latency problems
    • Next step: thread level parallelism

HTML version of Scripted Foils prepared 17 November 1998

Foil 10 3 Classes of VLSI Design?

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
How good is instruction-level parallelism?
Thread-level parallelism needed in future microprocessors to use available transistors?
Threads need classic coarse grain data or functional parallelism
transistors
Exponential Improvement is (an example of) Moore's Law

HTML version of Scripted Foils prepared 17 November 1998

Foil 11 Ames Summer 97 Workshop on Device Technology -- Moore's Law - I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
These foils come from summary by David Bailey
First of all, there seems to be general agreement that Moore's Law will not stop anytime soon.
The current state of the art in the semiconductor industry is 250 nm, as recently announced by Intel among others. Researchers are confident that current approaches can be used for at least another three generations with their current approach.
In the years ahead we may even see some manufacturers skip a generation, proceeding directly to significantly smaller feature sizes. This means that the 100 nm technology wall will be reached earlier than previously anticipated.

HTML version of Scripted Foils prepared 17 November 1998

Foil 12 Ames Summer 97 Workshop on Device Technology -- Moore's Law - II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Below about 100 nm feature sizes, progress will be more difficult, for various well known reasons, among them the lack of any substance that can be used as a lens for photolithography at such wavelengths.
Nonetheless, there are a number of "tricks" in the works, including the usage of diffraction gratings, parallel exposures, and others.
Groups working on these "ultimate" silicon techniques include Frances Houle's group at IBM Almaden and Jeff Baker's group at UC Berkeley.
Also helping will be some improvements such as better materials and increased wafer sizes. The consensus is that these techniques will be good for another two generations, to about 50 nm. Thus Moore's Law should continue until about 2010 or 2012.

HTML version of Scripted Foils prepared 17 November 1998

Foil 13 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Below about 50 nm feature sizes, it appears that a completely new device fabrication technology is needed.
Progress in electron beams and X-rays, which were the leading candidates of a few years ago, has stalled.
  • No one yet has any good idea how to scale e-beam lithography to economical production, and X-rays seem to destroy everything in their path, including the masks and devices. If something does eventually materialize along these lines, it won't be easy.
The researchers I spoke with nonetheless agree that future devices will be electronic for the foreseeable future.
As Stan Williams of HP points out, electrons are the ideal basis for device technology because they have basically zero size and mass, can travel at large fractions of the speed of light and interact strongly with matter.

HTML version of Scripted Foils prepared 17 November 1998

Foil 14 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
In contrast, DNA computing and the like are still curiosities with no clear path to practical high-speed computing.
One exciting development, which has emerged just in the past year or two, is nanotubes, i.e. cylindrical buckyballs.
It was recently discovered that a nanotube can be "tuned" from insulator to semiconductor to conductor just by changing the pitch of the helical structure of carbon atoms.
Kinks introduced in a tube can be used to form a conductor-semiconductor junction.
  • Molecular modeling is being used to further explore some of the possibilities.

HTML version of Scripted Foils prepared 17 November 1998

Foil 15 Architectural Trends: Bus-based SMPs

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Shared memory SMP's dominate server and enterprise market, moving down to desktop as cost (size) of a single processor decreased
Faster processors began to saturate bus, then bus technology advanced
Today, range of sizes for bus-based systems is desktop (2-8) to 64 on large servers
Cray 6400 becomes

HTML version of Scripted Foils prepared 17 November 1998

Foil 16 Bus Bandwidth

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index

HTML version of Scripted Foils prepared 17 November 1998

Foil 17 Economics

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Commodity microprocessors not only fast but CHEAP
  • Development cost is tens of millions of dollars (5-100M typical)
  • BUT, many more are sold compared to supercomputers
  • Crucial to take advantage of the investment, and use the commodity building block
  • Exotic parallel architectures no more than special-purpose
Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors
Standardization by Intel makes small, bus-based SMPs commodity
Desktop: few smaller processors versus one larger one?
  • Multiprocessor on a chip

HTML version of Scripted Foils prepared 17 November 1998

Foil 18 Important High Performance Computing Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Pipelined High Performance Workstations/PC's
Vector Supercomputer
MIMD Distributed Memory machine; heterogeneous clusters; metacomputers
SMP Symmetric Multiprocessors
Distributed Shared Memory NUMA
Special Purpose Machines
SIMD
MTA or Multithreaded architectures
COMA or Cache Only Memories
The Past?
The Future?

HTML version of Scripted Foils prepared 17 November 1998

Foil 19 Some General Issues Addressed by High Performance Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Pragmatic issues connected with cost of designing and building new systems -- use commodity hardware and software if possible
  • Supports use of PC clusters, metacomputing
Data locality and bandwidth to memory -- caches, vector registers -- sequential and parallel
Programming model -- shared address or data parallel -- explicit or implicit
Forms of parallelism -- data, control, functional; macroscopic, microscopic, instruction level

HTML version of Scripted Foils prepared 17 November 1998

Foil 20 Architecture Classes of High Performance Computers

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Sequential or von Neuman Architecture
Vector (Super)computers
Parallel Computers
  • with various architectures classified by Flynn's methodology (this is incomplete as only discusses control or synchronization structure )
  • SISD
  • MISD
  • MIMD
  • SIMD
  • Metacomputers

HTML version of Scripted Foils prepared 17 November 1998

Foil 21 Flynn's Classification of HPC Systems

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Very high speed computing systems,Proc of IEEE 54,12,p1901-1909(1966) and
Some Computer Organizations and their Effectiveness, IEEE Trans. on Comp. C-21,948-960(1972) -- both papers by M.J. Flynn
SISD -- Single Instruction stream, Single Data Stream -- i.e. von Neumann Architecture
MISD -- Multiple Instruction stream, Single Data Stream -- Not interesting
SIMD -- Single Instruction stream, Multiple Data Stream
MIMD -- Multiple Instruction stream and Multiple Data Stream -- dominant parallel system with ~one to ~one match of instruction and data streams.

HTML version of Scripted Foils prepared 17 November 1998

Foil 22 Raw Uniprocessor Performance: Cray v. Microprocessor LINPACK n by n Matrix Solves

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Description of Linpack as used by Top500 List

HTML version of Scripted Foils prepared 17 November 1998

Foil 23 Raw Parallel Performance: LINPACK

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Description of Linpack as used by Top500 List
Linpack Gflops
l
n
Cray vector machines
Microprocessor MPP
Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
Since 1993, Cray produces MPPs too T3D, T3E

HTML version of Scripted Foils prepared 17 November 1998

Foil 24 Linear Linpack HPC Performance versus Time

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Description of Linpack as used by Top500 List

HTML version of Scripted Foils prepared 17 November 1998

Foil 25 Top 10 Supercomputers November 1998

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Top 500 Supercomputer List from which you can get customized sublists
Machine Mflops Place/Country Year # PROCS
1 Intel ASCI Red 1338000 Sandia National Lab USA 1997 9152
2 SGI T3E1200 891500 Classified USA 1998 1084
3 SGI T3E900 815100 Classified USA 1997 1324
4 SGI ASCI Blue 690900 LANL USA 1998 6144
Mountain
5 SGI T3E900 552920 UK Met Office UK 1997 876
6 IBM ASCI Blue 547000 LLNL USA 1998 3904
Pacific
7 IBM ASCI Blue 547000 LLNL USA 1998 1952
Pacific
8 SGI T3E1200 509900 UK Centre for Science UK 1998 612
9 SGI T3E900 449000 NAVOCEANO USA 1997 700
10 SGI T3E 448600 NASA/GSFC USA 1998 1084

HTML version of Scripted Foils prepared 17 November 1998

Foil 26 Distribution of 500 Fastest Computers

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Top 500 and Jack Dongarra Links for Foil 26
(Parallel Vector)

HTML version of Scripted Foils prepared 17 November 1998

Foil 27 CPU Technology used in Top 500 versus Time

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Top 500 and Jack Dongarra Links for Foil 27

HTML version of Scripted Foils prepared 17 November 1998

Foil 28 Geographical Distribution of Top 500 Supercomputers versus time

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Top 500 and Jack Dongarra Links for Foil 28

HTML version of Scripted Foils prepared 17 November 1998

Foil 29 Node Technology used in Top 500 Supercomputers versus Time

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Top 500 and Jack Dongarra Links for Foil 29

HTML version of Scripted Foils prepared 17 November 1998

Foil 30 Total Performance in Top 500 Supercomputers versus Time and Manufacturer

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Top 500 and Jack Dongarra Links for Foil 30

HTML version of Scripted Foils prepared 17 November 1998

Foil 31 Number of Top 500 Systems as a function of time and Manufacturer

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Top 500 and Jack Dongarra Links for Foil 31

HTML version of Scripted Foils prepared 17 November 1998

Foil 32 Total Number of Top 500 Systems Installed June 98 versus Manufacturer

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Top 500 and Jack Dongarra Links for Foil 32

HTML version of Scripted Foils prepared 17 November 1998

Foil 33 Netlib Benchweb Benchmarks

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
See Original Foil

HTML version of Scripted Foils prepared 17 November 1998

Foil 34 Linpack Benchmarks

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
See Original Foil

HTML version of Scripted Foils prepared 17 November 1998

Foil 35 Java Linpack Benchmarks

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
See Original Foil

HTML version of Scripted Foils prepared 17 November 1998

Foil 36 Java Numerics

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
See Original Foil

HTML version of Scripted Foils prepared 17 November 1998

Foil 37 von Neuman Architecture in a Nutshell

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Instructions and data are stored in the same memory for which there is a single link (the von Neumann Bottleneck) to the CPU which decodes and executues instructions
The CPU can have multiple functional units
The memory access can be enhanced by use of caches made from faster memory to allow greater bandwidth and lower latency

HTML version of Scripted Foils prepared 17 November 1998

Foil 38 What is a Pipeline -- Cafeteria Analogy?

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Familiar from such everyday activities as getting food in cafeteria where one processes one person per "clock cycle" where
clock cycle here is maximum time anybody takes at a single "stage" where stage is here component of meal (salad, entrée etc.)
Note any one person takes about 5 clock cycles in this pipeline but the pipeline processes one person per clock cycle
Pipeline has problem if there is a "stall" -- here that one of the people wants an entrée which needs to be fetched from kitchen. This delays everybody!
In computer case, stall is caused typically by data not being ready for a particular instruction

HTML version of Scripted Foils prepared 17 November 1998

Foil 39 Instruction Flow in A Simple Machine Pipeline

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Three Instructions are shown overlapped -- each starting one clock cycle after last

HTML version of Scripted Foils prepared 17 November 1998

Foil 40 Example of MIPS R4000 Floating Point

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Taken from David Patterson CS252 Berkeley Fall 1996
3 Functional Units FP Adder, FP Multiplier, FP Divider
8 Kinds of Stages in FP Units
Stage Functional Unit Description
A FP Adder Mantissa ADD stage
D FP Divider Divide Pipeline stage
E FP Multiplier Exception Test stage
M FP Multiplier First stage of multiplier
N FP Multiplier Second stage of multiplier
R FP Adder Rounding stage
S FP Adder Operand Shift stage
U Unpack Floating point numbers

HTML version of Scripted Foils prepared 17 November 1998

Foil 41 MIPS R4000 Floating Point Stages

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Several different pipelines with different lengths!
Add,Subtract- 4 clocks:U S+A A+R R+S
Multiply - 8 clocks: U E+M M M M N+A R
Divide- 38 clocks: U A R D(28) D+A D+R D+R D+A D+R A R
Square Root- 110 clocks: U E (A+R)(108) A R
Negate- 2 clocks: U S
Absolute Value- 2 clocks: U S
Floating Point Compare- 3 clocks:U A R
SGI Workstations at NPAC 1995

HTML version of Scripted Foils prepared 17 November 1998

Foil 42 Illustration of Importance of Cache

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Fig 1.14 of Aspects of Computational Science
Editor Aad van der Steen
published by NCF

HTML version of Scripted Foils prepared 17 November 1998

Foil 43 Sequential Memory Structure

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Data locality implies CPU finds information it needs in cache which stores most recently accessed information
This means one reuses a given memory reference in many nearby computations e.g.
A1 = B*C
A2 = B*D + B*B
.... Reuses B
L3 Cache
Main
Memory
Disk
Increasing Memory
Capacity Decreasing
Memory Speed (factor of 100 difference between processor
and main memory
speed)

HTML version of Scripted Foils prepared 17 November 1998

Foil 44 Cache Issues I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
As shown above, caches are familiar in the real world -- here to support movement of food from manufacturer to your larder. It would be inconvenient to drive to the store for every item needed -- it is more convenient to cache items in your larder
Caches store instructions and data -- often in separate caches
Cache have a total size but also the cache line size which is minimum unit transferred into cache -- this due to spatial locality is often quite big.
They also have a "write-back" strategy to define when information is written back from cache into primary memory
Factory
Level 3 Middleman Warehouse
Level 2 Middleman Warehouse
Local Supermarket
Your Larder
CPU -- The Frying pan

HTML version of Scripted Foils prepared 17 November 1998

Foil 45 Cache Issues II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Finally caches have a mapping strategy which moves tells you where to write a given word into cache and when to overwrite with another data value fetched from main memory
Direct mapped caches hash each word of main memory into a unique location in cache
  • e.g. If cache has size N bytes, then one could hash memory location m to m mod(N)
Fully associative caches remove that word in cache which was unreferenced for longest time and then stores new value into this spot
Set associative caches combine these ideas. They have 2 to 4 (to ..) locations for each hash value and replace oldest reference in that group.
  • This avoids problems when two data values must be in cache but hash to same value

HTML version of Scripted Foils prepared 17 November 1998

Foil 46 Spatial versus Temporal Locality I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
In classic loop Do I =2,N Do J=2,N FI(I,J) =.25*(FI(I+1,J)+FI(I-1,J)+FI(I,J+1)+FI(I,J-1))
We see spatial locality -- if (I,J) accessed so are neighboring points stored near (I,J) in memory
Spatial locality is essential for distributed memory as ensures that after data decomposition, most data you need is stored in same processor and so communication is modest (surface over volume)
  • Spatial locality also exploited by vector machines
Temporal locality says that if you use FI(I,J) in one loop it is used in next J index value as FI(I,(J+1)-1)
Temporal locality makes cache machines work well as ensures after a data value stored into cache, it is used multiple times

HTML version of Scripted Foils prepared 17 November 1998

Foil 47 Spatial versus Temporal Locality II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
If first (main memory access) takes time T1 and each subsequent (i.e. cache ) access takes time T2 with T2 << T1, and a data value is accessed l times while in cache, then average access time is: T2 + T1/l
Temporal locality ensures l big
Spatial locality helps here as fetch all data in a cache line (say 128 bytes) in the time T1. Thus one can effectively reduce T1 by a further factor equal to number of words in a cache line (perhaps 16)

HTML version of Scripted Foils prepared 17 November 1998

Foil 48 Cray/SGI memory latencies

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Interesting Article from SC97 proceedings on T3E performance
System Memory Clock Ratio FP ops FP ops to cover
latency speed per clock memory
[ns] [ns] period latency
CDC 7600 275 27.5 10 1 10
CRAY 1 150 12.5 12 2 24
CRAY 120 8.5 14 2 28
X-MP
SGI Power
Challenge ~760 13.3 57 4 228
CRAY
T3E-900 ~280 2.2 126 2 252
This and following foils from Performance of the CRAY T3E Multiprocessor by Anderson, Brooks, Grassi and Scott at http://www.cray.com/products/systems/crayt3e/1200/performance.html

HTML version of Scripted Foils prepared 17 November 1998

Foil 49 Architecture of Cray T3E

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NPAC Summary of T3E Architecture
Air cooled T3E
T3E Torus Communication Network
T3E Node with Digital Alpha Chip

HTML version of Scripted Foils prepared 17 November 1998

Foil 50 T3E Messaging System

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NPAC Summary of T3E Architecture
This is innovative as supports "get" and "put" where memory controller converts off processor memory reference to a message with greater convenience of use

HTML version of Scripted Foils prepared 17 November 1998

Foil 51 Cray T3E Cache Structure

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Interesting Article from SC97 proceedings on T3E performance
Each CRAY T3E processor contains an 8 KB direct-mapped primary data cache (Dcache), an 8 KB instruction cache, and a 96 KB 3-way associative secondary cache (Scache) which is used for both data and instructions.
The Scache has a random replacement policy and is write-allocate and write-back, meaning that a cacheable store request to an address that is not in the cache causes that address to be loaded into the Scache, then modified and tagged as dirty for write-back later.
Write-back of dirty cache lines occurs only when the line is removed from the Scache, either by the Scache controller to make room for a new cache line, or by the back-map to maintain coherence of the caches with the local memory and/or registers.

HTML version of Scripted Foils prepared 17 November 1998

Foil 52 Cray T3E Cache Performance

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Interesting Article from SC97 proceedings on T3E performance
Peak data transfer rates on the CRAY T3E-900
Type of access Latency Bandwidth
in CPU cycles [MB/s]
Dcache load 2 7200
Scache load 8-10 7200
Dcache or
Scache store -- 3600
These rates correspond to the maximum instruction issue rate of two loads per CPU cycle or one store per CPU cycle.

HTML version of Scripted Foils prepared 17 November 1998

Foil 53 Finite Difference Example for T3E Cache Use I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Interesting Article from SC97 proceedings on T3E performance
From http://www.cray.com/products/systems/crayt3e/1200/performance.html
REAL*8 AA(513,513), DD(513,513)
REAL*8 X (513,513), Y (513,513)
REAL*8 RX(513,513), RY(513,513)
DO J = 2,N-1
DO I = 2,N-1
XX = X(I+1,J)-X(I-1,J)
YX = Y(I+1,J)-Y(I-1,J)
XY = X(I,J+1)-X(I,J-1)
YY = Y(I,J+1)-Y(I,J-1)
A = 0.25 * (XY*XY+YY*YY)
B = 0.25* (XX*XX+YX*YX)
C = 0.125 * (XX*XY+YX*YY)
Continued on Next Page

HTML version of Scripted Foils prepared 17 November 1998

Foil 54 Finite Difference Example for T3E Cache Use II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Interesting Article from SC97 proceedings on T3E performance
AA(I,J) = -B
DD(I,J) = B+B+A*REL
PXX = X(I+1,J)-2.*X(I,J)+X(I-1,J)
QXX = Y(I+1,J)-2.*Y(I,J)+Y(I-1,J)
PYY = X(I,J+1)-2.*X(I,J)+X(I,J-1)
QYY = Y(I,J+1)-2.*Y(I,J)+Y(I,J-1)
PXY = X(I+1,J+1)-X(I+1,J-1)-X(I-1,J+1)+X(I-1,J-1)
QXY = Y(I+1,J+1)-Y(I+1,J-1)-Y(I-1,J+1)+Y(I-1,J-1)
RX(I,J) = A*PXX+B*PYY-C*PXY
RY(I,J) = A*QXX+BCa*QYY-C*QXY
END DO
END DO
Continued from Previous Page

HTML version of Scripted Foils prepared 17 November 1998

Foil 55 How to use Cache in Example I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Interesting Article from SC97 proceedings on T3E performance
The inner loop of this kernel has 47 floating point operations, 18 array reads and 4 array writes.
  • The reads are of two 9-point stencils (2d neighbors including diagonals) centered at X(I,J) and Y(I,J), and the writes consist of unit-stride stores to the independent arrays AA, DD, RX and RY.
  • The two 9-point stencil array references should exhibit good temporal locality provided we can hold three contiguous columns of X and Y simultaneously in the Scache.
  • In addition, we need to make sure that the writes to AA, DD, RX, and RY do not interfere with X and Y in the Scache.
9 point stencil

HTML version of Scripted Foils prepared 17 November 1998

Foil 56 How to use Cache in Example II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Interesting Article from SC97 proceedings on T3E performance
Since all six arrays are the same size and are accessed at the same rate, we can ensure that they do not interfere with each other if they do not conflict initially. For this example, it is convenient to optimize for only one 4096-word set of the 3-way set associative Scache.
  • One possible alignment of the arrays is shown in figure.

HTML version of Scripted Foils prepared 17 November 1998

Foil 57 Cray Vector Supercomputers

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NPAC Survey of Cray Systems
CRAY-1 supercomputer
Cray's first supercomputer. Introduced in 1976, this system had a peak performance of 133 megaflops. The first system was installed at Alamos National Laboratory.
CRAY T90 systems are available in three models: the CRAY T94 system, offered in air- or liquid-cooled systems, that scales up to four processors;
the CRAY T916 system, a liquid-cooled system that scales up to 16 processors; and the top-of-the-line CRAY T932 system, also liquid-cooled
with up to 32 processors and has a peak performance of over 60 gigaflops

HTML version of Scripted Foils prepared 17 November 1998

Foil 58 Vector Supercomputers in a Nutshell - I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
This design enhances performance by noting that many applications calculate "vector-like" operations
  • Such as c(i)=a(i)+b(i) for i=1...N and N quite large
This allows one to address two performance problems
  • Latency in accessing memory (e.g. could take 10-20 clock cycles between requesting a particular memory location and delivery of result to CPU)
  • A complex operation , e.g. a floating point operation, can take a few machine cycles to complete
They are typified by Cray 1, XMP, YMP, C-90, CDC-205, ETA-10 and Japaneses Supercomputers from NEC Fujitsu and Hitachi

HTML version of Scripted Foils prepared 17 November 1998

Foil 59 Vector Supercomputing in a picture

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
A pipeline for vector addition looks like:
  • From Aspects of Computational Science -- Editor Aad van der Steen published by NCF

HTML version of Scripted Foils prepared 17 November 1998

Foil 60 Vector Supercomputers in a Nutshell - II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Vector machines pipeline data through the CPU
They are not so popular/relevant as in the past as
  • Improved C.P.U. architecture needs fewer cycles than before for each (complex) operation (e.g 4 now not ~100 as in past)
  • 8 Mhz 8087 of Cosmic Cube took 160 to 400 clock cycles to do a full floating point operation in 1983
  • Applications need more flexible pipelines which allow different operations to be executed on consequitive operands as they stream through CPU
  • Modern RISC processors (super scalar) can support such complex pipelines as they have far more logic than CPU's of the past
In fact excellence of say, Cray C-90 is due to its very good memory architecture allowing one to get enough operands to sustain pipeline.
Most workstation class machines have "good" CPU's but can never get enough data from memory to sustain good performance except for a few cache intensive applications

HTML version of Scripted Foils prepared 17 November 1998

Foil 61 Parallel Computer Architecture Memory Structure

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Memory Structure of Parallel Machines
  • Distributed
  • Shared
  • Cached
and Heterogeneous mixtures
Shared (Global): There is a global memory space, accessible by all processors.
  • Processors may also have some local memory.
  • Algorithms may use global data structures efficiently.
  • However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times)
Distributed (Local, Message-Passing): All memory is associated with processors.
  • To retrieve information from another processors' memory a message must be sent there.
  • Algorithms should use distributed data structures.

HTML version of Scripted Foils prepared 17 November 1998

Foil 62 Comparison of Memory Access Strategies

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Memory can be accessed directly (analogous to a phone call) as in red lines below or indirectly by message passing (green line below)
We show two processors in a MIMD machine for distributed (left) or shared(right) memory architectures

HTML version of Scripted Foils prepared 17 November 1998

Foil 63 Types of Parallel Memory Architectures -- Physical Characteristics

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Uniform: All processors take the same time to reach all memory locations.
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
  • DASH (Hennessey at Stanford) is best known example of such a virtual shared memory machine which is logically shared but physically distributed.
  • ALEWIFE from MIT is a similar project
  • TERA (from Burton Smith) is Uniform memory access and logically shared memory machine
Most NUMA machines these days have two memory access times
  • Local memory (divided in registers caches etc) and
  • Nonlocal memory with little or no difference in access time for different nonlocal memories
This simple two level memory access model gets more complicated in proposed 10 year out "petaflop" designs

HTML version of Scripted Foils prepared 17 November 1998

Foil 64 Diagrams of Shared and Distributed Memories

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index

HTML version of Scripted Foils prepared 17 November 1998

Foil 65 Parallel Computer Architecture Control Structure

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
SIMD -lockstep synchronization
  • Each processor executes same instruction stream
MIMD - Each Processor executes independent instruction streams
MIMD Synchronization can take several forms
  • Simplest: program controlled message passing
  • "Flags" (barriers,semaphores) in memory - typical shared memory construct as in locks seen in Java Threads
  • Special hardware - as in cache and its coherency (coordination between nodes)

HTML version of Scripted Foils prepared 17 November 1998

Foil 66 Mark2 Hypercube built by JPL(1985) Cosmic Cube (1983) built by Caltech (Chuck Seitz)

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Hypercube Topology for 8 machines

HTML version of Scripted Foils prepared 17 November 1998

Foil 67 64 Ncube Processors (each with 6 memory chips) on a large board

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
See next foil for basic chip -- these were connected in a hypercube

HTML version of Scripted Foils prepared 17 November 1998

Foil 68 ncube1 Chip -- integrated CPU and communication channels

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
This and related transputer design were very iunnovative but failed as could not exploit commodity microprocessor design economies

HTML version of Scripted Foils prepared 17 November 1998

Foil 69 Example of Message Passing System: IBM SP-2

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index More Details on IBM SP2
Made out of essentially complete RS6000 workstations
Network interface integrated in I/O bus
Most successful MIMD Distributed Memory Machine
L
2
$
Power 2
CPU
Memory
contr
oller
4-way
interleaved
DRAM
General inter
connection
network formed fr
om
8-port switches
NIC

HTML version of Scripted Foils prepared 17 November 1998

Foil 70 Example of Message Passing System: Intel Paragon

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Memory bus (64-bit, 50 MHz)
i860
L
1
$
NI
DMA
i860
L
1
$
Driver
Mem
ctrl
4-way
interleaved
DRAM
Intel
Paragon
node
8 bits,
175 MHz,
bidir
ectional
2D grid network
with pr
ocessing node
attached to every switch
Sandia'
s Intel Paragon XP/S-based Super
computer
2D Grid Network with Processing Node Attached to Every Switch
Powerful i860 node made this first "serious" MIMD distributed memory machine

HTML version of Scripted Foils prepared 17 November 1998

Foil 71 ASCI Red -- Intel Supercomputer at Sandia

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index ASCI Red Intel Supercomputer at Sandia
Full System by Artist and camera
ASCI Red Interconnect

HTML version of Scripted Foils prepared 17 November 1998

Foil 72 Parallel Computer Memory Structure

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
For both parallel and sequential computers, cost is accessing remote memories with some form of "communication"
Data locality addresses in both cases
Differences are quantitative size of effect and what is done by user and what automatically
Main
Memory
Interconnection Network
....
....
Main
Memory
Main
Memory
Slow
Can be Very Slow

HTML version of Scripted Foils prepared 17 November 1998

Foil 73 Cache Coherent or Not?

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Suppose two processors cache the same variable stored in memory of one of the processors
One must ensure cache coherence so that when one cache value changes, all do!
....
....
System Interconnection Network
L3 Cache
Main
Memory
Main
Memory
Cached Value of same shared variable
Board level Interconnection Network
Board level Interconnection Network

HTML version of Scripted Foils prepared 17 November 1998

Foil 74 Cache Coherence

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
There are 4 approaches to Cache Coherence -- the difficulty of maintaining correct caches in a parallel environment
1) Ignore problem in hardware -- let user and/or software cope with this chore -- this is approach followed in machines like T3E,SP2 and all explicit parallel programming models
2)Snoopy Buses. This is approach used in most SMP's where caches (at a given level) share a special bus also connected to memory. When a request is made in a give cache, this is broadcast on the bus, so that caches with a more recent value can respond
3)Scalable Coherent Interface (SCI). This differs from snoopy bus by using a fast serial connection which pipes requests through all processors. This is standard developed by high energy physics community.
4)Directory Schemes. These have a directory on each processor which keeps track of which cache line is where and which is up to date. The directories on each node are connected and communicate with each when a memory location is accessed

HTML version of Scripted Foils prepared 17 November 1998

Foil 75 SGI Origin 2000 I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NPAC Summary of Origin 2000 Architecture
The system comes in two versions, deskside or a rack system.
The deskside has between 1 to 4 node cards (1 to up to 8 CPUs).
The rack system has 1 to 64 node cards for a total between 2 to 128 CPUs.
Each node is based on the 64-bit MIPS RISC R10000 architecture.

HTML version of Scripted Foils prepared 17 November 1998

Foil 76 SGI Origin II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NPAC Summary of Origin 2000 Architecture
Also, each node has two primary caches (each 32KB two-way set-associative) and one secondary L2 cache (1 or 4MB two-way set associative) per CPU.
Each node has hardware cache coherency using a directory system and a maximum bandwidth of 780MB/sec.
The entire system (Cray Origin2000) has up to 512 such nodes, that is, up to 1024 processors.
  • For a 195-MHz R10000 processor, peak performance per processor is 390 MFLOPS or 780 MIPS (4 instructions per cycle), leading to an aggregate peak performance of almost 500 GFLOPS in a maximally sized machine.

HTML version of Scripted Foils prepared 17 November 1998

Foil 77 SGI Origin Block Diagram

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NPAC Summary of Origin 2000 Architecture

HTML version of Scripted Foils prepared 17 November 1998

Foil 78 SGI Origin III

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NPAC Summary of Origin 2000 Architecture
The SysAD (system address and data) bus of the previous figure connecting the two processors has a
  • peak bandwidth of 780 MB/s.
  • The same also for the Hub's connection to memory.
  • Memory bandwidth for data is about 670 MB/s.
The Hub's connections to the off-board net router chip and Xbow I/O interface are 1.56 GB/s each.

HTML version of Scripted Foils prepared 17 November 1998

Foil 79 SGI Origin 2 Processor Node Board

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NPAC Summary of Origin 2000 Architecture

HTML version of Scripted Foils prepared 17 November 1998

Foil 80 Performance of NCSA 128 node SGI Origin 2000

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NCSA Performance Measurements
RIEMANN is a general-purpose, higher-order accurate, Eulerian gas Dynamics code based on Gudunov schemes. Dinshaw Balsara
Laplace : the solution of sparse linear systems resulting from Navier-Stokes - Laplace equations. Danesh Tafti
QMC (Quantum Monte Carlo) Lubos Mitas
PPM (Piece-wise Parabolic Method)
MATVEC (Matrix-Vector Multiply) Dave McWilliams
Cactus Numerical Relativity Ed Seidel

HTML version of Scripted Foils prepared 17 November 1998

Foil 81 Summary of Cache Coherence Approaches

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Machines like the SGI Origin 2000 have a distributed shared memory with a so called directory implementation (pioneered in DASH project at Stanford) of cache coherence
Machines like SGI Cray T3E are distributed memory but do have fast get and put so as to be able to access single variables stored on remote memories
  • This gives performance advantage of shared memory without the programming advantage but complex hardware of Origin 2000
Origin 2000 approach does not scale as well as Cray T3E and large Origin 2000 systems must use message passing to link 128 node coherent memory subsystems
Cray T3E offers a uniform (but not as good for small number of nodes) interface
Pure Message passing / distributed memory is natural web model

HTML version of Scripted Foils prepared 17 November 1998

Foil 82 Some Major Hardware Architectures - SIMD

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
SIMD -- Single Instruction Multiple Data -- can have logically distributed or shared memory
  • Examples are CM-1,2 from Thinking Machines
  • and AMT DAP and Maspar which are currently focussed entirely on accelerating parts of database indexing
  • This architecture is of decreasing interest as has reduced functionality without significant cost advantage compared to MIMD machines
  • Cost of synchronization in MIMD machines is not high!
  • Main interest of SIMD is flexible bit arithmetic as processors "small" but as transistor densities get higher this also becomes less interesting as full function 64 bit CPU's only use a small fraction of silicon of modern computer

HTML version of Scripted Foils prepared 17 November 1998

Foil 83 SIMD (Single Instruction Multiple Data) Architecture

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, fast two-dimensional mesh and slower general switch for communication

HTML version of Scripted Foils prepared 17 November 1998

Foil 84 Examples of Some SIMD machines

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Maspar (DECmpp) SIMD Machine at NPAC 1995 We had two such machines with 8K and 16K nodes respectively
For a short time Digital resold the Maspar as the DECmpp
ICL DAP 4096 Processors circa 1978

HTML version of Scripted Foils prepared 17 November 1998

Foil 85 SIMD CM 2 from Thinking Machines

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Computer Museum Entry for Connection Machine
Disk Vault

HTML version of Scripted Foils prepared 17 November 1998

Foil 86 Official Thinking Machines Specification of CM2

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Computer Museum Entry for Connection Machine

HTML version of Scripted Foils prepared 17 November 1998

Foil 87 Some Major Hardware Architectures - Mixed

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of MIMD or SIMD, Sequential or Parallel machines.
Metacomputers can vary from full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D
This is a critical future architecture which is intrinsically distributed memory as multi-vendor heterogenity implies that one cannot have special hardware enhanced shared memory
  • note that this can be a MIMD collection of SIMD machines if have a set of Maspar's on a network
  • One can think of human brain as a SIMD machine and then a group of people is such a MIMD collection of SIMD processors

HTML version of Scripted Foils prepared 17 November 1998

Foil 88 Some MetaComputer Systems

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Cluster of workstations or PC's
Heterogeneous MetaComputer System

HTML version of Scripted Foils prepared 17 November 1998

Foil 89 Clusters of PC's 1986-1998

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
PCcube using serial ports on 80286 machines as REU Undergraduate Project 1986
Naegling at Caltech with Tom Sterling and John Salmon 1998 120 Pentium Pro Processors
Beowulf at Goddard Space Flight Center

HTML version of Scripted Foils prepared 17 November 1998

Foil 90 HP Kayak PC (300 MHz Intel Pentium II) vs Origin 2000

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NCSA Performance Measurements
NCSA measured some single processor results on HP Kayak PC with 300 MHz Intel Pentium II. The compiler is a Digital compiler with optimization level of 4. For comparison, also included are results from the Origin 2000. This is a CFD Application
http://www.ncsa.uiuc.edu/SCD/Perf/Tuning/sp_perf/
In general the Origin is twice as fast - For the HP-Kayak there is a sharp decline going from 64x64 to 128x128 matrices, while on the Origin the decline is more gradual and usually gets memory bound beyond 256x256. This is a result of the smaller cache on the Intel chip.

HTML version of Scripted Foils prepared 17 November 1998

Foil 91 Comments on Special Purpose Devices

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
One example is an Associative memory - SIMD or MIMD or content addressable memories
This is an an example of a special purpose "signal" processing machine which can in fact be built from "conventional" SIMD or "MIMD" architectures
This type of machine is not so popular as most applications are not dominated by computations for which good special purpose devices can be designed
If only 10% of a problem is say "track-finding" or some special purpose processing, then who cares if you reduce that 10% by a factor of 100
  • You have only sped up the system by a factor 1.1 not by 100!

HTML version of Scripted Foils prepared 17 November 1998

Foil 92 The GRAPE N-Body Machine

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Description of GRAPE 4 5 and 6 Machines:1 to 200 Teraflops
N body problems (e.g. Newton's laws for one million stars in a globular cluster) can have succesful special purpose devices
See GRAPE (GRAvity PipE) machine (Sugimoto et al. Nature 345 page 90,1990)
  • Essential reason is that such problems need much less memory per floating point unit than most problems
  • Globular Cluster: 10^6 computations per datum stored
  • Finite Element Iteration: A few computations per datum stored
  • Rule of thumb is that one needs one gigabyte of memory per gigaflop of computation in general problems and this general design puts most cost into memory not into CPU.
Note GRAPE uses EXACTLY same parallel algorithm that one finds in the books (e.g. Solving Problems on Concurrent Processors) for N-body problems on classic distributed memory MIMD machines

HTML version of Scripted Foils prepared 17 November 1998

Foil 93 Why isn't GRAPE a Perfect Solution?

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Description of GRAPE 4 5 and 6 Machines:1 to 200 Teraflops
GRAPE will execute the classic O(N^2) (parallel) N body algorithm BUT this is not the algorithm used in most such computations
Rather there is the O(N) or O(N)logN so called "fast-multipole" algorithm which uses hierarchical approach
  • On one million stars, fast multipole is a factor of 100-1000 faster than GRAPE algorithm
  • fast multipole works in most but not all N-body problems (in globular clusters, extreme heterogenity makes direct O(N^2) method most attractive)
So special purpose devices cannot usually take advantage of new nifty algorithms!

HTML version of Scripted Foils prepared 17 November 1998

Foil 94 GRAPE Special Purpose Machines

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Grape Special Purpose Machine
GRAPE 4 1.08 Teraflops
GRAPE 6 200 Teraflops

HTML version of Scripted Foils prepared 17 November 1998

Foil 95 Quantum ChromoDynamics (QCD) Special Purpose Machines

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Special Purpose Physics Machines for Foil 100
Here the case for special purpose machines is less compelling than for GRAPE as QCD is "just" regular (easy to parallelize; lowish communication) and extremely floating point intensive.
We illustrate with two machines which are classic MIMD distributed memory architecture with optimized nodes/communication networks
BNL/Columbia QCDSP 400 Megaflops
Univ. of CP-PACS (general physics) 600 Megaflops

HTML version of Scripted Foils prepared 17 November 1998

Foil 96 Granularity of Parallel Components - I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors.
  • Pieces, processors may be heterogeneous. Computation/
  • Communication ratio very high -- Typical of Networked Metacomputing
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors.
  • Processors typically run the same code.(SPMD Style)
  • Computation/communication ratio often hundreds or more.
  • Typical of MIMD Parallel Systems such as SP2, CM5, Paragon, T3D

HTML version of Scripted Foils prepared 17 November 1998

Foil 97 Granularity of Parallel Components - II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines.
  • Processors often have instructions broadcasted to them.
  • Computation/ Communication ratio often near unity.
  • Typical of SIMD but seen in a few MIMD systems such as Kogge's Execube, Dally's J Machine or commercial Myrianet (Seitz)
  • This is going to be very important in future petaflop architectures as the dense chips of year 2003 onwards favor this Processor in Memory Architecture
  • So many "transistors" in future chips that "small processors" of the "future" will be similar to todays high end microprocessors
  • As chips get denser, not realistic to put processors and memories on separate chips as granularities become too big
Note that a machine of given granularity can be used on algorithms of the same or finer granularity

HTML version of Scripted Foils prepared 17 November 1998

Foil 98 Classes of Communication Networks

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
The last major architectural feature of a parallel machine is the network or design of hardware/software connecting processors and memories together.
Bus: All processors (and memory) connected to a common bus or busses.
  • Memory access fairly uniform, but not very scalable due to contention
  • Bus machines can be NUMA if memory consists of directly accessed local memory as well as memory banks accessed by Bus. The Bus accessed memories can be local memories on other processors
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
  • Switches might have queues and "combining logic", which improve functionality but increase latency.
  • Switch settings may be determined by message headers or preset by controller.
  • Connections can be packet-switched (messages no longer than some fixed size) or circuit-switched (connection remains as long as needed)
  • Usually NUMA, blocking, often scalable and upgradable

HTML version of Scripted Foils prepared 17 November 1998

Foil 99 Switch and Bus based Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Switch
Bus

HTML version of Scripted Foils prepared 17 November 1998

Foil 100 Examples of Interconnection Topologies

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube.
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial

HTML version of Scripted Foils prepared 17 November 1998

Foil 101 Useful Concepts in Communication Systems

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Useful terms include:
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout)
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault?
Blocking: Some communication requests may not get through, due to conflicts caused by other requests.
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor.
Latency (delay): Maximal time for nonblocked request to be transmitted.
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task.
Worm Hole Routing -- Intermediate switch nodes do not wait for full message but allow it to pass throuch in small packets

HTML version of Scripted Foils prepared 17 November 1998

Foil 102 Latency and Bandwidth of a Network

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Transmission Time for message of n bytes:
T0 + T1 n where
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize
T0 = TS + Td . Number of hops
T1 is the inverse bandwidth -- it can be made small if pipe is large size.
In practice TS and T1 are most important and Td is unimportant

HTML version of Scripted Foils prepared 17 November 1998

Foil 103 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

HTML version of Scripted Foils prepared 17 November 1998

Foil 104 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Square blocks indicate shared memory copy performance
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

HTML version of Scripted Foils prepared 17 November 1998

Foil 105 Communication Performance of Some MPP's

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
From Aspects of Computational Science, Editor Aad van der Steen, published by NCF
System Communication Speed Computation Speed
    • Mbytes/sec(per link) Mflops/sec(per node)
IBM SP2 40 267
Intel iPSC860 2.8 60
Intel Paragon 200 75
Kendall Square
KSR-1 17.1 40
Meiko CS-2 100 200
Parsytec GC 20 25
TMC CM-5 20 128
Cray T3D 150 300

HTML version of Scripted Foils prepared 17 November 1998

Foil 106 Implication of Hardware Performance

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
tcomm = 4 or 8 /Speed in Mbytes sec
  • as 4 or 8 bytes in a floating point word
tfloat = 1/Speed in Mflops per sec
Thus tcomm / tfloat is just 4 X Computation Speed divided by Communication speed
tcomm / tfloat is 26.7, 85, 1.5, 9.35, 8, 5, 25.6, 8 for the machines SP2, iPSC860, Paragon, KSR-1, Meiko CS2, Parsytec GC, TMC CM5, and Cray T3D respectively
Latency makes situation worse for small messages and double for 64bit arithmetic natural on large problems!

HTML version of Scripted Foils prepared 17 November 1998

Foil 107 MPI Bandwidth on SGI Origin and Sun Shared Memory Machines

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NASA Ames Comparison of MPI on Origin 2000 and Sun E10000
Results from Samson Cheung of NAS (Ames)
Origin 2000: mpirun -np 2 ./lbw -B -n 4000 -D
Total transfer time: 57.281 sec.
Transfer rate: 69.8 transfers/sec.
Message size: 1000000 bytes
Bandwidth: 139.664 10e+06 Bytes/sec
Sun E10000: tmrun -np 2 ./lbw -B -n 4000 -D
Total transfer time: 54.487 sec.
Transfer rate: 73.4 transfers/sec.
Message size: 1000000 bytes
Bandwidth: 146.825 10e+06 Bytes/sec

HTML version of Scripted Foils prepared 17 November 1998

Foil 108 Latency Measurements on Origin and Sun for MPI

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index NASA Ames Comparison of MPI on Origin 2000 and Sun E10000
Origin 2000: mpirun -np 2 ./lbw -L -n 2500000 -D
Total transfer time: 52.597 sec.
Transfer rate: 47531.4 transfers/sec.
Message size: 40 bytes
Latency: 10.5 microsec.
Sun E10000: tmrun -np 2 ./lbw -L -n 2500000 -D
Total transfer time: 34.585 sec.
Transfer rate: 72286.7 transfers/sec.
Message size: 40 bytes
Latency: 6.9 microsec.
Note: Origin CPU about twice performance of Sun CPU

HTML version of Scripted Foils prepared 17 November 1998

Foil 109 Two Basic Programming Models

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Shared Address Space or Shared Memory
  • Natural as extends uniprocessor programming models
Explicitly parallel or Data Parallel
  • Either user or compiler explicitly inserts message passing
  • Requires modification of existing codes
  • Natural for metacomputing or a bunch of PC's on the web or in your machine

HTML version of Scripted Foils prepared 17 November 1998

Foil 110 Shared Address Space Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Any processor can directly reference any memory location
  • Communication occurs implicitly as result of loads and stores
Convenient:
  • existing programs run out of the box ....
  • Similar programming model to time-sharing on uniprocessors
    • Except processes run on different processors
    • Good throughput on multiprogrammed workloads
Naturally provided on wide range of platforms
  • History dates at least to precursors of mainframes in early 60s
  • Wide range of scale: few to hundreds of processors
Popularly known as shared memory machines or model
  • This term ambiguous: memory may be physically distributed among processors

HTML version of Scripted Foils prepared 17 November 1998

Foil 111 Shared Address Space Model

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Process: virtual address space plus one or more threads of control
Portions of address spaces of processes are shared
Writes to shared address visible to other threads (in other processes too)
Natural extension of uniprocessor's model:
  • conventional memory operations for communication;
  • special atomic operations for synchronization
OS uses shared memory to coordinate processes

HTML version of Scripted Foils prepared 17 November 1998

Foil 112 Communication Hardware

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Memory capacity increased by adding modules, I/O by controllers
  • Add processors for processing!
  • For higher-throughput multiprogramming, or parallel programs
Communication is natural extension of uniprocessor
Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort

HTML version of Scripted Foils prepared 17 November 1998

Foil 113 History -- Mainframe

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
"Mainframe" approach
  • Motivated by multiprogramming
  • Extends crossbar used for mem bw and I/O
  • Originally processor cost limited to small sizes
    • later, cost of crossbar limited
  • Bandwidth scales with Number of processors p
  • High incremental cost; use multistage instead

HTML version of Scripted Foils prepared 17 November 1998

Foil 114 History -- Minicomputer

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
"Minicomputer" approach
  • Almost all microprocessor systems have a bus
  • Motivated by multiprogramming, TP
  • Used heavily for parallel computing
  • Called symmetric multiprocessor (SMP)
  • Latency larger than for uniprocessor
  • Bus is bandwidth bottleneck
    • caching is key: coherence problem
  • Low incremental cost

HTML version of Scripted Foils prepared 17 November 1998

Foil 115 Scalable Interconnects

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Problem is interconnect: cost (crossbar) or bandwidth (bus)
  • Hypercube was an attempt at compromise between these two
Dance-hall: bandwidth still scalable, but lower cost than crossbar
  • latencies to memory uniform, but uniformly large
Distributed memory or non-uniform memory access (NUMA)
  • Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)
Caching shared (particularly nonlocal) data?

HTML version of Scripted Foils prepared 17 November 1998

Foil 116 Message Passing Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Complete computer as building block, including I/O
  • Communication via explicit I/O operations
  • Natural for web computing ....
Programming model:
  • Directly access only private address space (local memory)
  • Off processor access via explicit messages (send/receive)
High-level block diagram similar to distributed-memory SAS (Shared Address Space)
  • But off pr integrated at IO level; needn't be into memory system
  • Like networks of workstations (clusters), but tighter integration with specialized I/O
  • Easier to build than scalable SAS
Programming model more removed from basic hardware operations (?)
  • But natural for some problems as exposes parallelism rather than writing a parallel algorithm in a sequential language and letting some bunch of ignorant threads uncover what you knew anyway .... (SAS model?)

HTML version of Scripted Foils prepared 17 November 1998

Foil 117 Message-Passing Abstraction e.g. MPI

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Send specifies buffer to be transmitted and receiving process
Receive specifies sending process and application storage to receive into
Memory to memory copy, but need to name processes
Optional tag on send and matching rule on receive
User process names local data and entities in process/tag space too
In simplest form, the send/recv match achieves pairwise synch event but also collective communication
Many overheads: copying, buffer management, protection

HTML version of Scripted Foils prepared 17 November 1998

Foil 118 First Message-Passing Machines

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Early machines such as Caltech Hypercube and first commercial Intel and nCUBE designs used a FIFO on each node to store and forward messages
  • computing and message processing equivalenced
  • Hardware close to programming Model; synchronous message passing
  • Later Replaced by DMA, enabling non-blocking communication
    • Buffered by system at destination until needed
Diminishing role of topology in modern machines
  • Store and forward routing: topology important
  • Introduction of pipelined (worm hole) routing made topology largely irrelevant
  • Cost is in node-network interface
  • Simplifies programming
64 node n=6 hypercube in Caltech Computer Science Dept.

HTML version of Scripted Foils prepared 17 November 1998

Foil 119 SMP Example: Intel Pentium Pro Quad

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Commodity SMP
All coherence and multiprocessing glue in processor module
Highly integrated, targeted at high volume
Low latency and bandwidth
P-Pr
o bus (64-bit data, 36-bit addr
ess, 66 MHz)
CPU
Bus interface
MIU
P-Pr
o
module
P-Pr
o
module
P-Pr
o
module
256-KB
L
2
$
Interrupt
contr
oller
PCI
bridge
PCI
bridge
Memory
contr
oller
1-, 2-, or 4-way
interleaved
DRAM
PCI bus
PCI bus
PCI
I/O
car
ds

HTML version of Scripted Foils prepared 17 November 1998

Foil 120 Sun E10000 in a Nutshell

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Descriptions of Sun HPC Systems for Foil 69
It is perhaps most successful SMP (Symmetric Multiprocessor) with applicability to commercial and scientific market
Its SMP characteristics are seen in its low uniform latency
One cannot build huge E10000 systems (only up to 64 nodes and each node is slower than Origin)
One should cluster multiple E10000's to build a supercomputer

HTML version of Scripted Foils prepared 17 November 1998

Foil 121 Sun Enterprise Systems E6000/10000

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Descriptions of Sun HPC Systems for Foil 70
These are very successful in commercial server market for database and related applications
E10000 acquired from Cray via SGI is also called Starfire
TPCD Benchmarks 1000Gbyte(Terabyte) November 1998
These measure typical large scale database queries
System Database Power 5 Year Total $/perf
Metric System Cost ($/QphD)
(QppD)
Sun Starfire (SMP) Oracle 27,024.6 $9,660.193 $776
IBM RS/6000 (MPP) DB2 19,137.5 $11,380,178 $797
Sun 4x6000 (Clus) Informix 12,931.9 $11,766,932 $1,353
NCR 5150 (MPP) Teradata 12,149.2 $14,495,886 $2,103

HTML version of Scripted Foils prepared 17 November 1998

Foil 122 Starfire E10000 Architecture I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Descriptions of Sun HPC Systems for Foil 71
The Starfire server houses a group of system boards interconnected by a centerplane.
  • A single cabinet holds up to 16 of these system boards, each of which can be independently configured with processors, memory and I/O channels, as described on following page.

HTML version of Scripted Foils prepared 17 November 1998

Foil 123 Starfire E10000 Architecture II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Descriptions of Sun HPC Systems for Foil 72
On each system board there us up to four 336 MHz UltraSPARC microprocessor modules with supporting two level/4 Mbyte cache per module (64 per Starfire system).
There can be four memory banks with a capacity of up to 4 Gbytes per system board (64 Gbytes per Starfire server).
There are two SBuses per board, each with slots for up to two adapters for networking and I/O (32 SBuses or 64 slots per system).
  • Or two PCI busses per board - each accommodating one adapter. Starfire can have a mix of SBus and PCI adapters.

HTML version of Scripted Foils prepared 17 November 1998

Foil 124 Sun Enterprise E6000/6500 Architecture

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Descriptions of Sun HPC Systems for Foil 73

HTML version of Scripted Foils prepared 17 November 1998

Foil 125 Sun's Evaluation of E10000 Characteristics I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Descriptions of Sun HPC Systems for Foil 74
This is aimed at commercial customers where Origin 2000 is not a major player

HTML version of Scripted Foils prepared 17 November 1998

Foil 126 Sun's Evaluation of E10000 Characteristics II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Descriptions of Sun HPC Systems for Foil 75

HTML version of Scripted Foils prepared 17 November 1998

Foil 127 Scalability of E1000

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Descriptions of Sun HPC Systems for Foil 76
The Starfire scales to smaller sizes than machines like the IBM SP2 and Cray T3E but significantly larger sizes than competing SMP's

HTML version of Scripted Foils prepared 17 November 1998

Foil 128 Consider Scientific Supercomputing

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Proving ground and driver for innovative architecture and techniques
  • Market smaller (about 1%) relative to commercial as multiprocessors become mainstream
  • Dominated by vector machines starting in 70s
  • Microprocessors have made huge gains in floating-point performance
    • high clock rates
    • pipelined floating point units (e.g., multiply-add every cycle)
    • instruction-level parallelism
    • effective use of caches (e.g., automatic blocking)
  • Economics of commodity microprocessors
Large-scale multiprocessors replace vector supercomputers
  • Well under way already

HTML version of Scripted Foils prepared 17 November 1998

Foil 129 Toward Architectural Convergence

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Evolution and role of software have blurred boundary
  • Send/receive supported on SAS machines via buffers
  • Can construct global address space on MP using hashing
  • Page-based (or finer-grained) shared virtual memory
Hardware organization converging too
  • Tighter NI (Network Interface) integration even for MP (low-latency, high-bandwidth)
  • At lower level, even hardware SAS passes hardware messages
Even clusters of workstations/SMPs are parallel systems
  • Emergence of fast system area networks (SAN)
Programming models distinct, but organizations converging
  • Nodes connected by general network and communication assists
  • Implementations also converging, at least in high-end machines

HTML version of Scripted Foils prepared 17 November 1998

Foil 130 Convergence: Generic Parallel Architecture

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Node: processor(s), memory system, plus communication assist
  • Network interface and communication controller
Scalable network
Convergence allows lots of innovation, now within framework
  • Integration of assist with node, what operations, how efficiently...
A generic modern multiprocessor (Culler)

HTML version of Scripted Foils prepared 17 November 1998

Foil 131 Tera Multithreaded Supercomputer

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 79
This uses a clever idea developed over many years by Burton Smith who originally used it in the Denelcor system which was one of the first MIMD machines over 15 years ago
MTA(multithreaded architectures) are designed to hide the different access times of memory and CPU cycle time.
  • We used caches in conventional architectures
  • MTA uses a strategy that is typically used with coarser grain functional parallelism -- namely it switches to another task while current one is waiting for memory
  • Burtom emphasizes that hiding memory latency always implicitly requires parallelism equal to ratio of memory access to CPU operation speed

HTML version of Scripted Foils prepared 17 November 1998

Foil 132 Tera Computer at San Diego Supercomputer Center

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 80
First 2 processor system

HTML version of Scripted Foils prepared 17 November 1998

Foil 133 Overview of the Tera MTA I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 81
Tera computer system is a shared memory multiprocessor.
  • From its specification, it also implements the true shared memory programming model for which the performance of the system does not depend on the placement of data in memory.
  • i.e. a true uniform memory system (UMA) whereas Sun E10000 is almost UMA and Origin 2000 NUMA
The Tera is a multi-processor which potentially can accommodate up to 256 processors.
  • The system runs stand-alone and requires no front end.
    • Network connection to workstations and other computer systems is accomplished via 32- or 64-bit HIPPI channels.
    • All data path widths are 64 bits, including the processor-network interface.
The clock speed is nominally 333 Mhz, giving each processor a data path bandwidth of one billion 64-bit results per second and a peak performance of one gigaflops.

HTML version of Scripted Foils prepared 17 November 1998

Foil 134 Overview of the Tera MTA II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 82
The Tera Processors are multithreaded (called a stream) and each processor switches context every cycle among as many as 128 hardware threads, thereby hiding up to 128 cycles (384 ns) of memory latency.
Each processor executes a 21 stage pipeline and so can have 21 separate streams executing simultaneously
Each stream can issue as many as eight memory references without waiting for earlier ones to finish, further augmenting the memory latency tolerance of the processor.
A stream implements a load-store architecture with three addressing modes and 31 general-purpose 64-bit registers.
  • Switching between such streams ("threads") is fully supported by hardware
The peak memory bandwidth is 2.67 gigabytes per second.

HTML version of Scripted Foils prepared 17 November 1998

Foil 135 Tera 1 Processor Architecture from H. Bokhari (ICASE)

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 83

HTML version of Scripted Foils prepared 17 November 1998

Foil 136 Tera Processor Characteristics

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 84
From Bokhari (ICASE)

HTML version of Scripted Foils prepared 17 November 1998

Foil 137 Tera System Diagram

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 85
From Bokhari (ICASE)

HTML version of Scripted Foils prepared 17 November 1998

Foil 138 Interconnect / Communications System of Tera I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 86
The interconnection net is a sparsely populated 3-D packet switched containing p^(3/2) nodes, where p is the number of processors.
These nodes are toroidally connected in three dimensions to form a p^(1/2)-ary three-cube, and processor and memory resources are attached to some of the nodes.
The latency of a node is three cycles: a message spends two cycles in the node logic proper and one on the wire that connects the node to its neighbors.
A p-processor system has worst-case one-way latency of 4.5p^(1/2) cycles.
Messages are assigned random priorities and then routed in priority order. Under heavy load, some messages are derouted by this process. The randomization at each node insures that each packet eventually reaches its destination.
  • Randomization ensures good "worst-case" performance. This is a strategy which is well supported by theory (E.g. Les Valiant Harvard)

HTML version of Scripted Foils prepared 17 November 1998

Foil 139 Interconnect / Communications System of Tera II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 87
A node has four ports (five if a resource is attached).
Each port simultaneously transmits and receives an entire 164-bit packet every 3 ns clock cycle.
Of the 164 bits, 64 are data, so the data bandwidth per port is 2.67 GB/s in each direction.
The network bisection bandwidth is 2.67p GB/s. (p is number of processors)
The network routing nodes contain no buffers other than those required for the pipeline.
  • Instead, all messages are immediately routed to an output port.

HTML version of Scripted Foils prepared 17 November 1998

Foil 140 T90/Tera MTA Hardware Comparison

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 88
From Allan Snavely at SDSC

HTML version of Scripted Foils prepared 17 November 1998

Foil 141 Tera Configurations / Performance

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 89
The overall hardware configuration of the system:
Processors 16 64 256
Peak Gflops 16 64 256
Memory, Gbytes 16-32 64-128 256-512
HIPPI channels 32 128 512
I/O, Gbytes/sec 6.2 25 102

HTML version of Scripted Foils prepared 17 November 1998

Foil 142 Performance of MTA wrt T90 and in parallel

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 90
Relative MTA and T90 Performance

HTML version of Scripted Foils prepared 17 November 1998

Foil 143 Tera MTA Performance on NAS Benchmarks Compared to T90

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Tera Architecture and System Links for Foil 91
From Allan Snavely at SDSC
NAS Benchmarks
255 Mhz
300 Mhz

HTML version of Scripted Foils prepared 17 November 1998

Foil 144 Cache Only COMA Machines

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
KSR-1 (Colorado)
Cache Only Machines "only" have cache and are typified by the Kendall Square KSR-1,2 machines. Although this company is no longer in business, the basic architecture is interesting and could still be used in future important machines
In this class of machine one has a NUMA architecture with memory attached to a given processor being lower access cost
In simplest COMA architectures, think of this memory as the cache and when needed migrate data to this cache
However in conventional machines, all data has a natural "original home"; in (simple) COMA, the home of the data moves when it is accessed and one hopes that data is "attracted" through access to "correct" processor

HTML version of Scripted Foils prepared 17 November 1998

Foil 145 III. Key drivers: The Need for PetaFLOPS Computing

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Applications that require petaFLOPS can already be identified
  • (DOE) Nuclear weapons stewardship
  • (NSA) Cryptology and digital signal processing
  • (NASA and NSF) Satellite data assimilation and climate modeling
The need for ever greater computing power will remain.
PetaFLOPS systems are right step for the next decade

HTML version of Scripted Foils prepared 17 November 1998

Foil 146 10 Possible PetaFlop Applications

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Nuclear Weopens Stewardship (ASCI)
Cryptology and Digital Signal Processing
Satellite Data Analysis
Climate and Environmental Modeling
3-D Protein Molecule Reconstruction
Real-Time Medical Imaging
Severe Storm Forecasting
Design of Advanced Aircraft
DNA Sequence Matching
Molecular Simulations for nanotechnology
Large Scale Economic Modelling
Intelligent Planetary Spacecraft

HTML version of Scripted Foils prepared 17 November 1998

Foil 147 Petaflop Performance for Flow in Porous Media?

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Why does one need a petaflop (1015 operations per second) computer?
These are problems where quite viscous (oil, pollutants) liquids percolate through the ground
Very sensitive to details of material
Most important problems are already solved at some level, but most solutions are insufficient and need improvement in various respects:
  • under resolution of solution details, averaging of local variations and under representation of physical details
  • rapid solutions to allow efficient exploration of system parameters
  • robust and automated solution, to allow integration of results in high level decision, design and control functions
  • inverse problems (history match) to reconstruct missing data require multiple solutions of the direct problem

HTML version of Scripted Foils prepared 17 November 1998

Foil 148 Target Flow in Porous Media Problem (Glimm - Petaflop Workshop)

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Oil Resevoir Simulation
Geological variation occurs down to pore size of rock - almost 10-6 metres - model this (statistically)
Want to calculate flow between wells which are about 400 metres apart
103x103x102 = 108 grid elements
30 species
104 time steps
300 separate cases need to be considered
3x109 words of memory per case
1012 words total if all cases considered in parallel
1019 floating point operation
3 hours on a petaflop computer

HTML version of Scripted Foils prepared 17 November 1998

Foil 149 NASA's Projection of Memory and Computational Requirements upto Petaflops for Aerospace Applications

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index

HTML version of Scripted Foils prepared 17 November 1998

Foil 150 Supercomputer Architectures in Years 2005-2010 -- I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Conventional (Distributed Shared Memory) Silcon
  • Clock Speed 1GHz
  • 4 eight way parallel Complex RISC nodes per chip
  • 4000 Processing chips gives over 100 tera(fl)ops
  • 8000 2 Gigabyte DRAM gives 16 Terabytes Memory
Note Memory per Flop is much less than one to one
Natural scaling says time steps decrease at same rate as spatial intervals and so memory needed goes like (FLOPS in Gigaflops)**.75
  • If One Gigaflop requires One Gigabyte of memory (Or is it one Teraflop that needs one Terabyte?)

HTML version of Scripted Foils prepared 17 November 1998

Foil 151 Supercomputer Architectures in Years 2005-2010 -- II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Superconducting Technology is promising but can it compete with silicon juggernaut?
Should be able to build a simple 200 Ghz Superconducting CPU with modest superconducting caches (around 32 Kilobytes)
Must use same DRAM technology as for silicon CPU ?
So tremendous challenge to build latency tolerant algorithms (as over a factor of 100 difference in CPU and memory speed) but advantage of factor 30-100 less parallelism needed

HTML version of Scripted Foils prepared 17 November 1998

Foil 152 Supercomputer Architectures in Years 2005-2010 -- III

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Processor in Memory (PIM) Architecture is follow on to J machine (MIT) Execube (IBM -- Peter Kogge) Mosaic (Seitz)
  • More Interesting in 2007 as processors are be "real" and have nontrivial amount of memory
  • Naturally fetch a complete row (column) of memory at each access - perhaps 1024 bits
One could take in year 2007 each two gigabyte memory chip and alternatively build as a mosaic of
  • One Gigabyte of Memory
  • 1000 250,000 transistor simple CPU's running at 1 Gigaflop each and each with one megabyte of on chip memory
12000 chips (Same amount of Silicon as in first design but perhaps more power) gives:
  • 12 Terabytes of Memory
  • 12 Petaflops performance
  • This design "extrapolates" specialized DSP's , the GRAPE (specialized teraflop N body machine) etc to a "somewhat specialized" system with a general CPU but a special memory poor architecture with particular 2/3D layout

HTML version of Scripted Foils prepared 17 November 1998

Foil 153 Performance Per Transistor

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Performance data from uP vendors
Transistor count excludes on-chip caches
Performance normalized by clock rate
Conclusion: Simplest is best! (250K Transistor CPU)
Millions of Transistors (CPU)
Millions of Transistors (CPU)
Normalized SPECINTS
Normalized SPECFLTS

HTML version of Scripted Foils prepared 17 November 1998

Foil 154 Comparison of Supercomputer Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Fixing 10-20 Terabytes of Memory, we can get
16000 way parallel natural evolution of today's machines with various architectures from distributed shared memory to clustered heirarchy
  • Peak Performance is 150 Teraflops with memory systems like today but worse with more levels of cache
5000 way parallel Superconducting system with 1 Petaflop performance but terrible imbalance between CPU and memory speeds
12 million way parallel PIM system with 12 petaflop performance and "distributed memory architecture" as off chip access with have serious penalities
There are many hybrid and intermediate choices -- these are extreme examples of "pure" architectures

HTML version of Scripted Foils prepared 17 November 1998

Foil 155 Current PIM Chips

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Storage
0.5 MB
0.5 MB
0.05 MB
0.128 MB
Chip
EXECUBE
AD SHARC
TI MVP
MIT MAP
Terasys PIM
First
Silicon
1993
1994
1994
1996
1993
Peak
50 Mips
120 Mflops
2000 Mops
800 Mflops
625 M bit
ops
0.016 MB
MB/
Perf.
0.01
MB/Mip
0.005
MB/MF
0.000025
MB/Mop
0.00016
MB/MF
0.000026
MB/bit op
Organization
16 bit
SIMD/MIMD CMOS
Single CPU and
Memory
1 CPU, 4 DSP's
4 Superscalar
CPU's
1024
16-bit ALU's

HTML version of Scripted Foils prepared 17 November 1998

Foil 156 New "Strawman" PIM Processing Node Macro

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index

HTML version of Scripted Foils prepared 17 November 1998

Foil 157 "Strawman" Chip Floorplan

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index

HTML version of Scripted Foils prepared 17 November 1998

Foil 158 SIA-Based PIM Chip Projections

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
MB per cm2
MF per cm2
MB/MF ratios
MB/MF ratios

HTML version of Scripted Foils prepared 17 November 1998

Foil 159 Quantum Computing - I

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Quantum-Mechanical Computers by Seth Lloyd, Scientific American, Oct 95
Chapter 6 of The Feynman Lectures on Computation edited by Tony Hey and Robin Allen, Addison-Wesley, 1996
Quantum Computing: Dream or Nightmare? Haroche and Raimond, Physics Today, August 96 page 51
Basically any physical system can "compute" as one "just" needs a system that gives answers that depend on inputs and all physical systems have this property
Thus one can build "superconducting" "DNA" or "Quantum" computers exploiting respectively superconducting molecular or quantum mechanical rules

HTML version of Scripted Foils prepared 17 November 1998

Foil 160 Quantum Computing - II

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
For a "new technology" computer to be useful, one needs to be able to
  • conveniently prepare inputs,
  • conveniently program,
  • reliably produce answer (quicker than other techniques), and
  • conveniently read out answer
Conventional computers are built around bit ( taking values 0 or 1) manipulation
One can build arbitarily complex arithmetic if have some way of implementing NOT and AND
Quantum Systems naturally represent bits
  • A spin (of say an electron or proton) is either up or down
  • A hydrogen atom is either in lowest or (first) excited state etc.

HTML version of Scripted Foils prepared 17 November 1998

Foil 161 Quantum Computing - III

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Interactions between quantum systems can cause "spin-flips" or state transitions and so implement arithmetic
Incident photons can "read" state of system and so give I/O capabilities
Quantum "bits" called qubits have another property as one has not only
  • State |0> and state |1> but also
  • Coherent states such as .7071*(|0> + |1>) which are equally in either state
Lloyd describes how such coherent states provide new types of computing capabilities
  • Natural random number as measuring state of qubit gives answer 0 or 1 randomly with equal probability
  • As Feynman suggests, qubit based computers are natural for large scale simulation of quantum physical systems -- this is "just" analog computing

HTML version of Scripted Foils prepared 17 November 1998

Foil 162 Superconducting Technology -- Past

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Superconductors produce wonderful "wires" which transmit picosecond (10^-12 seconds) pulses at near speed of light
  • Superconducting is lower power and faster than diffusive electron transmission in CMOS
  • At about 0.35micron chip feature size, CMOS transmission time changes from domination by transmission (Distance) issues to resistive (diffusive effects)
Niobium used in constructing such superconducting circuits can be processed by similar fabrication techniques to CMOS
Josephson Junctions allow picosecond performance switches
BUT IBM (!969-1983) and Japan (MITI 1981-90) terminated major efforts in this area

HTML version of Scripted Foils prepared 17 November 1998

Foil 163 Superconducting Technology -- Present

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
New ideas have resurrected this concept using RSFQ -- Rapid Single Flux Quantum -- approach
This naturally gives a bit which is 0 or 1 (or in fact n units!)
This gives interesting circuits of similar structure to CMOS systems but with a clock speed of order 100-300GHz -- factor of 100 better than CMOS which will asymptote at around 1 GHz (= one nanosecond cycle time)

HTML version of Scripted Foils prepared 17 November 1998

Foil 164 Superconducting Technology -- Problems

From Master Foilset for HPC Achitecture Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
At least two major problems:
Semiconductor industry will invest some some $40B in CMOS "plants" and infrastructure
  • Currently perhaps $100M a year going into superconducting circuit area!
  • How do we "bootstrap" superconducting industry?
Cannot build memory to match CPU speed and current designs have superconducting CPU's (with perhaps 256 Kbytes superconducting memory per processor) but conventional CMOS memory
  • So compared with current computers have a thousand times faster CPU, factor of four smaller cache of CPU speed and same speed basic memory as now
  • Can such machines perform well -- need new algorithms?
  • Can one design new superconducting memories?
Superconducting technology also has a bad "name" due to IBM termination!

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Mon Apr 12 1999