Full HTML for Basic Master Foilset for HPC Achitecture Overview

Full HTML for

Basic foilset Master Foilset for HPC Achitecture Overview

Given by Geoffrey C. Fox at CPS615 Introduction to Computational Science on Fall Semester 1998. Foils prepared 17 November 98
Outside Index Summary of Material

This presentation came from material developed by David Culler and Jack Dongarra available on the Web

See summary of Saleh Elmohamed and Ken Hawick at http://nhse.npac.syr.edu/hpccsurvey/

We discuss several examples in detail including T3E, Origin 2000, Sun E10000 and Tera MTA

These are used to illustrate major architecture types

We discuss key sequential architecture issues including cache structure

We also discuss technologies from today's commodities through Petaflop ideas and Quantum Computing

Table of Contents for full HTML of Master Foilset for HPC Achitecture Overview

Denote Foils where Image Critical

Denote Foils where Image has important information

Denote Foils where HTML is sufficient

denotes presence of Additional linked information which is greyed out if missing

1

Computer Architecture for Computational Science
2

Abstract of Computer Architecture Overview
3

Some NPAC Parallel Machines
4

Architectural Trends I
5

Architectural Trends
6

3 Classes of VLSI Design?
7

Architectural Trends: Bus-based SMPs
8

Bus Bandwidth
9

Economics
10

Important High Performance Computing Architectures
11

Some General Issues Addressed by High Performance Architectures
12

What is a Pipeline -- Cafeteria Analogy?
13

Example of MIPS R4000 Floating Point
14

MIPS R4000 Floating Point Stages
15

Sequential Memory Structure
16

Cache Issues I
17

Cache Issues II
18

Spatial versus Temporal Locality I
19

Spatial versus Temporal Locality II
20

Parallel Computer Memory Structure
21

Cache Coherence
22

Description of Linpack as used by Top500 List

Raw Uniprocessor Performance: Cray v. Microprocessor LINPACK n by n Matrix Solves
23

Raw Parallel Performance: LINPACK
24

Linear Linpack HPC Performance versus Time
25

Top 500 Supercomputer List from which you can get customized sublists

Top 10 Supercomputers November 1998
26

Top 500 and Jack Dongarra Links for Foil 26

Distribution of 500 Fastest Computers
27

Top 500 and Jack Dongarra Links for Foil 27

CPU Technology used in Top 500 versus Time
28

Top 500 and Jack Dongarra Links for Foil 28

Geographical Distribution of Top 500 Supercomputers versus time
29

Top 500 and Jack Dongarra Links for Foil 29

Node Technology used in Top 500 Supercomputers versus Time
30

Top 500 and Jack Dongarra Links for Foil 30

Total Performance in Top 500 Supercomputers versus Time and Manufacturer
31

Top 500 and Jack Dongarra Links for Foil 31

Number of Top 500 Systems as a function of time and Manufacturer
32

Top 500 and Jack Dongarra Links for Foil 32

Total Number of Top 500 Systems Installed June 98 versus Manufacturer
33

Two Basic Programming Models
34

Shared Address Space Architectures
35

Shared Address Space Model
36

Communication Hardware
37

History -- Mainframe
38

History -- Minicomputer
39

Scalable Interconnects
40

Message Passing Architectures
41

Message-Passing Abstraction e.g. MPI
42

First Message-Passing Machines
43

Mark2 Hypercube built by JPL(1985) Cosmic Cube (1983) built by Caltech (Chuck Seitz)
44

64 Ncube Processors (each with 6 memory chips) on a large board
45

ncube1 Chip -- integrated CPU and communication channels
46

Example of Message Passing System: IBM SP-2
47

Example of Message Passing System: Intel Paragon
48

Clusters of PC's 1986-1998
49

HP Kayak PC (300 MHz Intel Pentium II) vs Origin 2000
50

Cray Vector Supercomputers
51

Interesting Article from SC97 proceedings on T3E performance

Cray/SGI memory latencies
52

Architecture of Cray T3E
53

T3E Messaging System
54

Cray T3E Cache Structure
55

Cray T3E Cache Performance
56

Finite Difference Example for T3E Cache Use I
57

Finite Difference Example for T3E Cache Use II
58

How to use Cache in Example I
59

How to use Cache in Example II
60

NPAC Summary of Origin 2000 Architecture

SGI Origin 2000 I
61

SGI Origin II
62

SGI Origin Block Diagram
63

SGI Origin III
64

SGI Origin 2 Processor Node Board
65

Performance of NCSA 128 node SGI Origin 2000
66

Cache Coherent or Not?
67

Summary of Cache Coherence Approaches
68

SMP Example: Intel Pentium Pro Quad
69

Descriptions of Sun HPC Systems for Foil 69

Sun E10000 in a Nutshell
70

Descriptions of Sun HPC Systems for Foil 70

Sun Enterprise Systems E6000/10000
71

Descriptions of Sun HPC Systems for Foil 71

Starfire E10000 Architecture I
72

Descriptions of Sun HPC Systems for Foil 72

Starfire E10000 Architecture II
73

Descriptions of Sun HPC Systems for Foil 73

Sun Enterprise E6000/6500 Architecture
74

Descriptions of Sun HPC Systems for Foil 74

Sun's Evaluation of E10000 Characteristics I
75

Descriptions of Sun HPC Systems for Foil 75

Sun's Evaluation of E10000 Characteristics II
76

Descriptions of Sun HPC Systems for Foil 76

Scalability of E1000
77

NASA Ames Comparison of MPI on Origin 2000 and Sun E10000

MPI Bandwidth on SGI Origin and Sun Shared Memory Machines
78

Latency Measurements on Origin and Sun for MPI
79

Tera Architecture and System Links for Foil 79

Tera Multithreaded Supercomputer
80

Tera Architecture and System Links for Foil 80

Tera Computer at San Diego Supercomputer Center
81

Tera Architecture and System Links for Foil 81

Overview of the Tera MTA I
82

Tera Architecture and System Links for Foil 82

Overview of the Tera MTA II
83

Tera Architecture and System Links for Foil 83

Tera 1 Processor Architecture from H. Bokhari (ICASE)
84

Tera Architecture and System Links for Foil 84

Tera Processor Characteristics
85

Tera Architecture and System Links for Foil 85

Tera System Diagram
86

Tera Architecture and System Links for Foil 86

Interconnect / Communications System of Tera I
87

Tera Architecture and System Links for Foil 87

Interconnect / Communications System of Tera II
88

Tera Architecture and System Links for Foil 88

T90/Tera MTA Hardware Comparison
89

Tera Architecture and System Links for Foil 89

Tera Configurations / Performance
90

Tera Architecture and System Links for Foil 90

Performance of MTA wrt T90 and in parallel
91

Tera Architecture and System Links for Foil 91

Tera MTA Performance on NAS Benchmarks Compared to T90
92

Cache Only COMA Machines
93

Examples of Some SIMD machines
94

Consider Scientific Supercomputing
95

Toward Architectural Convergence
96

Convergence: Generic Parallel Architecture
97

Computer Museum Entry for Connection Machine

SIMD CM 2 from Thinking Machines
98

Official Thinking Machines Specification of CM2
99

GRAPE Special Purpose Machines
100

Special Purpose Physics Machines for Foil 100

Quantum ChromoDynamics (QCD) Special Purpose Machines
101

ASCI Red -- Intel Supercomputer at Sandia

Outside Index Summary of Material

HTML version of Basic Foils prepared 17 November 98

Foil 1 Computer Architecture for Computational Science

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Fall Semester 98 1998

Geoffrey Fox

Northeast Parallel Architectures Center

Syracuse University

111 College Place

Syracuse NY

gcf@npac.syr.edu

HTML version of Basic Foils prepared 17 November 98

Foil 2 Abstract of Computer Architecture Overview

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

This presentation came from material developed by David Culler and Jack Dongarra available on the Web

See summary of Saleh Elmohamed and Ken Hawick at http://nhse.npac.syr.edu/hpccsurvey/

We discuss several examples in detail including T3E, Origin 2000, Sun E10000 and Tera MTA

These are used to illustrate major architecture types

We discuss key sequential architecture issues including cache structure

We also discuss technologies from today's commodities through Petaflop ideas and Quantum Computing

HTML version of Basic Foils prepared 17 November 98

Foil 3 Some NPAC Parallel Machines

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

CM5

nCUBE

Intel iPSC2

Workstation Cluster of Digital alpha machines

NCSA 1024 nodes

NPAC

HTML version of Basic Foils prepared 17 November 98

Foil 4 Architectural Trends I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Architecture translates technology's gifts to performance and capability

Resolves the tradeoff between parallelism and locality

Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect
Tradeoffs may change with scale and technology advances

Four generations of architectural history: tube, transistor, IC, VLSI

Here focus only on VLSI generation
Future generations COULD be Quantum and Superconducting technology

Greatest delineation within VLSI generation has been in type of parallelism exploited

HTML version of Basic Foils prepared 17 November 98

Foil 5 Architectural Trends

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Greatest trend in VLSI generation is increase in parallelism

Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
- slows after 32 bit
- adoption of 64-bit now under way, 128-bit far (not performance issue)
- important inflection point when 32-bit microprocessor and cache fit on a chip
Mid 80s to mid 90s: instruction level parallelism
- pipelining and simple instruction sets, + compiler advances (RISC)
- on-chip caches and functional units => superscalar execution
- greater sophistication: out of order execution, speculation, prediction
  - to deal with control transfer and latency problems
- Next step: thread level parallelism

HTML version of Basic Foils prepared 17 November 98

Foil 6 3 Classes of VLSI Design?

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

How good is instruction-level parallelism?

Thread-level parallelism needed in future microprocessors to use available transistors?

Threads need classic coarse grain data or functional parallelism

transistors

Exponential Improvement is (an example of) Moore's Law

HTML version of Basic Foils prepared 17 November 98

Foil 7 Architectural Trends: Bus-based SMPs

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Shared memory SMP's dominate server and enterprise market, moving down to desktop as cost (size) of a single processor decreased

Faster processors began to saturate bus, then bus technology advanced

Today, range of sizes for bus-based systems is desktop (2-8) to 64 on large servers

Cray 6400 becomes

HTML version of Basic Foils prepared 17 November 98

Foil 8 Bus Bandwidth

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 9 Economics

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Commodity microprocessors not only fast but CHEAP

Development cost is tens of millions of dollars (5-100M typical)
BUT, many more are sold compared to supercomputers
Crucial to take advantage of the investment, and use the commodity building block
Exotic parallel architectures no more than special-purpose

Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors

Standardization by Intel makes small, bus-based SMPs commodity

Desktop: few smaller processors versus one larger one?

Multiprocessor on a chip

HTML version of Basic Foils prepared 17 November 98

Foil 10 Important High Performance Computing Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Pipelined High Performance Workstations/PC's

Vector Supercomputer

MIMD Distributed Memory machine; heterogeneous clusters; metacomputers

SMP Symmetric Multiprocessors

Distributed Shared Memory NUMA

Special Purpose Machines

SIMD

MTA or Multithreaded architectures

COMA or Cache Only Memories

The Past?

The Future?

HTML version of Basic Foils prepared 17 November 98

Foil 11 Some General Issues Addressed by High Performance Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Pragmatic issues connected with cost of designing and building new systems -- use commodity hardware and software if possible

Supports use of PC clusters, metacomputing

Data locality and bandwidth to memory -- caches, vector registers -- sequential and parallel

Programming model -- shared address or data parallel -- explicit or implicit

Forms of parallelism -- data, control, functional; macroscopic, microscopic, instruction level

HTML version of Basic Foils prepared 17 November 98

Foil 12 What is a Pipeline -- Cafeteria Analogy?

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Familiar from such everyday activities as getting food in cafeteria where one processes one person per "clock cycle" where

clock cycle here is maximum time anybody takes at a single "stage" where stage is here component of meal (salad, entr�e etc.)

Note any one person takes about 5 clock cycles in this pipeline but the pipeline processes one person per clock cycle

Pipeline has problem if there is a "stall" -- here that one of the people wants an entr�e which needs to be fetched from kitchen. This delays everybody!

In computer case, stall is caused typically by data not being ready for a particular instruction

HTML version of Basic Foils prepared 17 November 98

Foil 13 Example of MIPS R4000 Floating Point

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Taken from David Patterson CS252 Berkeley Fall 1996

3 Functional Units FP Adder, FP Multiplier, FP Divider

8 Kinds of Stages in FP Units

Stage Functional Unit Description

A FP Adder Mantissa ADD stage

D FP Divider Divide Pipeline stage

E FP Multiplier Exception Test stage

M FP Multiplier First stage of multiplier

N FP Multiplier Second stage of multiplier

R FP Adder Rounding stage

S FP Adder Operand Shift stage

U Unpack Floating point numbers

HTML version of Basic Foils prepared 17 November 98

Foil 14 MIPS R4000 Floating Point Stages

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Several different pipelines with different lengths!

Add,Subtract- 4 clocks:U S+A A+R R+S

Multiply - 8 clocks: U E+M M M M N+A R

Divide- 38 clocks: U A R D(28) D+A D+R D+R D+A D+R A R

Square Root- 110 clocks: U E (A+R)(108) A R

Negate- 2 clocks: U S

Absolute Value- 2 clocks: U S

Floating Point Compare- 3 clocks:U A R

SGI Workstations at NPAC 1995

HTML version of Basic Foils prepared 17 November 98

Foil 15 Sequential Memory Structure

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Data locality implies CPU finds information it needs in cache which stores most recently accessed information

This means one reuses a given memory reference in many nearby computations e.g.

A1 = B*C

A2 = B*D + B*B

.... Reuses B

L3 Cache

Main

Memory

Disk

Increasing Memory

Capacity Decreasing

Memory Speed (factor of 100 difference between processor

and main memory

speed)

HTML version of Basic Foils prepared 17 November 98

Foil 16 Cache Issues I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

As shown above, caches are familiar in the real world -- here to support movement of food from manufacturer to your larder. It would be inconvenient to drive to the store for every item needed -- it is more convenient to cache items in your larder

Caches store instructions and data -- often in separate caches

Cache have a total size but also the cache line size which is minimum unit transferred into cache -- this due to spatial locality is often quite big.

They also have a "write-back" strategy to define when information is written back from cache into primary memory

Factory

Level 3 Middleman Warehouse

Level 2 Middleman Warehouse

Local Supermarket

Your Larder

CPU -- The Frying pan

HTML version of Basic Foils prepared 17 November 98

Foil 17 Cache Issues II

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Finally caches have a mapping strategy which moves tells you where to write a given word into cache and when to overwrite with another data value fetched from main memory

Direct mapped caches hash each word of main memory into a unique location in cache

e.g. If cache has size N bytes, then one could hash memory location m to m mod(N)

Fully associative caches remove that word in cache which was unreferenced for longest time and then stores new value into this spot

Set associative caches combine these ideas. They have 2 to 4 (to ..) locations for each hash value and replace oldest reference in that group.

This avoids problems when two data values must be in cache but hash to same value

HTML version of Basic Foils prepared 17 November 98

Foil 18 Spatial versus Temporal Locality I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

In classic loop Do I =2,N Do J=2,N FI(I,J) =.25*(FI(I+1,J)+FI(I-1,J)+FI(I,J+1)+FI(I,J-1))

We see spatial locality -- if (I,J) accessed so are neighboring points stored near (I,J) in memory

Spatial locality is essential for distributed memory as ensures that after data decomposition, most data you need is stored in same processor and so communication is modest (surface over volume)

Spatial locality also exploited by vector machines

Temporal locality says that if you use FI(I,J) in one loop it is used in next J index value as FI(I,(J+1)-1)

Temporal locality makes cache machines work well as ensures after a data value stored into cache, it is used multiple times

HTML version of Basic Foils prepared 17 November 98

Foil 19 Spatial versus Temporal Locality II

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

If first (main memory access) takes time T1 and each subsequent (i.e. cache ) access takes time T2 with T2 << T1, and a data value is accessed l times while in cache, then average access time is: T2 + T1/l

Temporal locality ensures l big

Spatial locality helps here as fetch all data in a cache line (say 128 bytes) in the time T1. Thus one can effectively reduce T1 by a further factor equal to number of words in a cache line (perhaps 16)

HTML version of Basic Foils prepared 17 November 98

Foil 20 Parallel Computer Memory Structure

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

For both parallel and sequential computers, cost is accessing remote memories with some form of "communication"

Data locality addresses in both cases

Differences are quantitative size of effect and what is done by user and what automatically

Main

Memory

Interconnection Network

....

Main

Memory

Main

Memory

Slow

Can be Very Slow

HTML version of Basic Foils prepared 17 November 98

Foil 21 Cache Coherence

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

There are 4 approaches to Cache Coherence -- the difficulty of maintaining correct caches in a parallel environment

1) Ignore problem in hardware -- let user and/or software cope with this chore -- this is approach followed in machines like T3E,SP2 and all explicit parallel programming models

2)Snoopy Buses. This is approach used in most SMP's where caches (at a given level) share a special bus also connected to memory. When a request is made in a give cache, this is broadcast on the bus, so that caches with a more recent value can respond

3)Scalable Coherent Interface (SCI). This differs from snoopy bus by using a fast serial connection which pipes requests through all processors. This is standard developed by high energy physics community.

4)Directory Schemes. These have a directory on each processor which keeps track of which cache line is where and which is up to date. The directories on each node are connected and communicate with each when a memory location is accessed

HTML version of Basic Foils prepared 17 November 98

Foil 22 Raw Uniprocessor Performance: Cray v. Microprocessor LINPACK n by n Matrix Solves

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 23 Raw Parallel Performance: LINPACK

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Linpack Gflops

Cray vector machines

Microprocessor MPP

Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)

Since 1993, Cray produces MPPs too T3D, T3E

HTML version of Basic Foils prepared 17 November 98

Foil 24 Linear Linpack HPC Performance versus Time

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 25 Top 10 Supercomputers November 1998

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Machine Mflops Place/Country Year # PROCS

1 Intel ASCI Red 1338000 Sandia National Lab USA 1997 9152

2 SGI T3E1200 891500 Classified USA 1998 1084

3 SGI T3E900 815100 Classified USA 1997 1324

4 SGI ASCI Blue 690900 LANL USA 1998 6144

Mountain

5 SGI T3E900 552920 UK Met Office UK 1997 876

6 IBM ASCI Blue 547000 LLNL USA 1998 3904

Pacific

7 IBM ASCI Blue 547000 LLNL USA 1998 1952

Pacific

8 SGI T3E1200 509900 UK Centre for Science UK 1998 612

9 SGI T3E900 449000 NAVOCEANO USA 1997 700

10 SGI T3E 448600 NASA/GSFC USA 1998 1084

HTML version of Basic Foils prepared 17 November 98

Foil 26 Distribution of 500 Fastest Computers

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

(Parallel Vector)

HTML version of Basic Foils prepared 17 November 98

Foil 27 CPU Technology used in Top 500 versus Time

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 28 Geographical Distribution of Top 500 Supercomputers versus time

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 29 Node Technology used in Top 500 Supercomputers versus Time

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 30 Total Performance in Top 500 Supercomputers versus Time and Manufacturer

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 31 Number of Top 500 Systems as a function of time and Manufacturer

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 32 Total Number of Top 500 Systems Installed June 98 versus Manufacturer

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 33 Two Basic Programming Models

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Shared Address Space or Shared Memory

Natural as extends uniprocessor programming models

Explicitly parallel or Data Parallel

Either user or compiler explicitly inserts message passing
Requires modification of existing codes
Natural for metacomputing or a bunch of PC's on the web or in your machine

HTML version of Basic Foils prepared 17 November 98

Foil 34 Shared Address Space Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Any processor can directly reference any memory location

Communication occurs implicitly as result of loads and stores

Convenient:

existing programs run out of the box ....
Similar programming model to time-sharing on uniprocessors
- Except processes run on different processors
- Good throughput on multiprogrammed workloads

Naturally provided on wide range of platforms

History dates at least to precursors of mainframes in early 60s
Wide range of scale: few to hundreds of processors

Popularly known as shared memory machines or model

This term ambiguous: memory may be physically distributed among processors

HTML version of Basic Foils prepared 17 November 98

Foil 35 Shared Address Space Model

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Process: virtual address space plus one or more threads of control

Portions of address spaces of processes are shared

Writes to shared address visible to other threads (in other processes too)

Natural extension of uniprocessor's model:

conventional memory operations for communication;
special atomic operations for synchronization

OS uses shared memory to coordinate processes

HTML version of Basic Foils prepared 17 November 98

Foil 36 Communication Hardware

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Memory capacity increased by adding modules, I/O by controllers

Add processors for processing!
For higher-throughput multiprogramming, or parallel programs

Communication is natural extension of uniprocessor

Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort

HTML version of Basic Foils prepared 17 November 98

Foil 37 History -- Mainframe

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

"Mainframe" approach

Motivated by multiprogramming
Extends crossbar used for mem bw and I/O
Originally processor cost limited to small sizes
- later, cost of crossbar limited
Bandwidth scales with Number of processors p
High incremental cost; use multistage instead

HTML version of Basic Foils prepared 17 November 98

Foil 38 History -- Minicomputer

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

"Minicomputer" approach

Almost all microprocessor systems have a bus
Motivated by multiprogramming, TP
Used heavily for parallel computing
Called symmetric multiprocessor (SMP)
Latency larger than for uniprocessor
Bus is bandwidth bottleneck
- caching is key: coherence problem
Low incremental cost

HTML version of Basic Foils prepared 17 November 98

Foil 39 Scalable Interconnects

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Problem is interconnect: cost (crossbar) or bandwidth (bus)

Hypercube was an attempt at compromise between these two

Dance-hall: bandwidth still scalable, but lower cost than crossbar

latencies to memory uniform, but uniformly large

Distributed memory or non-uniform memory access (NUMA)

Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response)

Caching shared (particularly nonlocal) data?

HTML version of Basic Foils prepared 17 November 98

Foil 40 Message Passing Architectures

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Complete computer as building block, including I/O

Communication via explicit I/O operations
Natural for web computing ....

Programming model:

Directly access only private address space (local memory)
Off processor access via explicit messages (send/receive)

High-level block diagram similar to distributed-memory SAS (Shared Address Space)

But off pr integrated at IO level; needn't be into memory system
Like networks of workstations (clusters), but tighter integration with specialized I/O
Easier to build than scalable SAS

Programming model more removed from basic hardware operations (?)

But natural for some problems as exposes parallelism rather than writing a parallel algorithm in a sequential language and letting some bunch of ignorant threads uncover what you knew anyway .... (SAS model?)

HTML version of Basic Foils prepared 17 November 98

Foil 41 Message-Passing Abstraction e.g. MPI

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Send specifies buffer to be transmitted and receiving process

Receive specifies sending process and application storage to receive into

Memory to memory copy, but need to name processes

Optional tag on send and matching rule on receive

User process names local data and entities in process/tag space too

In simplest form, the send/recv match achieves pairwise synch event but also collective communication

Many overheads: copying, buffer management, protection

HTML version of Basic Foils prepared 17 November 98

Foil 42 First Message-Passing Machines

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Early machines such as Caltech Hypercube and first commercial Intel and nCUBE designs used a FIFO on each node to store and forward messages

computing and message processing equivalenced
Hardware close to programming Model; synchronous message passing
Later Replaced by DMA, enabling non-blocking communication
- Buffered by system at destination until needed

Diminishing role of topology in modern machines

Store and forward routing: topology important
Introduction of pipelined (worm hole) routing made topology largely irrelevant
Cost is in node-network interface
Simplifies programming

64 node n=6 hypercube in Caltech Computer Science Dept.

HTML version of Basic Foils prepared 17 November 98

Foil 43 Mark2 Hypercube built by JPL(1985) Cosmic Cube (1983) built by Caltech (Chuck Seitz)

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Hypercube Topology for 8 machines

HTML version of Basic Foils prepared 17 November 98

Foil 44 64 Ncube Processors (each with 6 memory chips) on a large board

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

See next foil for basic chip -- these were connected in a hypercube

HTML version of Basic Foils prepared 17 November 98

Foil 45 ncube1 Chip -- integrated CPU and communication channels

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

This and related transputer design were very iunnovative but failed as could not exploit commodity microprocessor design economies

HTML version of Basic Foils prepared 17 November 98

Foil 46 Example of Message Passing System: IBM SP-2

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Made out of essentially complete RS6000 workstations

Network interface integrated in I/O bus

Most successful MIMD Distributed Memory Machine

Power 2

CPU

Memory

contr

oller

4-way

interleaved

DRAM

General inter

connection

network formed fr

8-port switches

NIC

HTML version of Basic Foils prepared 17 November 98

Foil 47 Example of Message Passing System: Intel Paragon

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Memory bus (64-bit, 50 MHz)

i860

DMA

i860

Driver

Mem

ctrl

4-way

interleaved

DRAM

Intel

Paragon

node

8 bits,

175 MHz,

bidir

ectional

2D grid network

with pr

ocessing node

attached to every switch

Sandia'

s Intel Paragon XP/S-based Super

computer

2D Grid Network with Processing Node Attached to Every Switch

Powerful i860 node made this first "serious" MIMD distributed memory machine

HTML version of Basic Foils prepared 17 November 98

Foil 48 Clusters of PC's 1986-1998

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

PCcube using serial ports on 80286 machines as REU Undergraduate Project 1986

Naegling at Caltech with Tom Sterling and John Salmon 1998 120 Pentium Pro Processors

Beowulf at Goddard Space Flight Center

HTML version of Basic Foils prepared 17 November 98

Foil 49 HP Kayak PC (300 MHz Intel Pentium II) vs Origin 2000

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

NCSA measured some single processor results on HP Kayak PC with 300 MHz Intel Pentium II. The compiler is a Digital compiler with optimization level of 4. For comparison, also included are results from the Origin 2000. This is a CFD Application

http://www.ncsa.uiuc.edu/SCD/Perf/Tuning/sp_perf/

In general the Origin is twice as fast - For the HP-Kayak there is a sharp decline going from 64x64 to 128x128 matrices, while on the Origin the decline is more gradual and usually gets memory bound beyond 256x256. This is a result of the smaller cache on the Intel chip.

HTML version of Basic Foils prepared 17 November 98

Foil 50 Cray Vector Supercomputers

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

CRAY-1 supercomputer

Cray's first supercomputer. Introduced in 1976, this system had a peak performance of 133 megaflops. The first system was installed at Alamos National Laboratory.

CRAY T90 systems are available in three models: the CRAY T94 system, offered in air- or liquid-cooled systems, that scales up to four processors;

the CRAY T916 system, a liquid-cooled system that scales up to 16 processors; and the top-of-the-line CRAY T932 system, also liquid-cooled

with up to 32 processors and has a peak performance of over 60 gigaflops

HTML version of Basic Foils prepared 17 November 98

Foil 51 Cray/SGI memory latencies

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

System Memory Clock Ratio FP ops FP ops to cover

latency speed per clock memory

[ns] [ns] period latency

CDC 7600 275 27.5 10 1 10

CRAY 1 150 12.5 12 2 24

CRAY 120 8.5 14 2 28

X-MP

SGI Power

Challenge ~760 13.3 57 4 228

CRAY

T3E-900 ~280 2.2 126 2 252

This and following foils from Performance of the CRAY T3E Multiprocessor by Anderson, Brooks, Grassi and Scott at http://www.cray.com/products/systems/crayt3e/1200/performance.html

HTML version of Basic Foils prepared 17 November 98

Foil 52 Architecture of Cray T3E

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Air cooled T3E

T3E Torus Communication Network

T3E Node with Digital Alpha Chip

HTML version of Basic Foils prepared 17 November 98

Foil 53 T3E Messaging System

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

This is innovative as supports "get" and "put" where memory controller converts off processor memory reference to a message with greater convenience of use

HTML version of Basic Foils prepared 17 November 98

Foil 54 Cray T3E Cache Structure

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Each CRAY T3E processor contains an 8 KB direct-mapped primary data cache (Dcache), an 8 KB instruction cache, and a 96 KB 3-way associative secondary cache (Scache) which is used for both data and instructions.

The Scache has a random replacement policy and is write-allocate and write-back, meaning that a cacheable store request to an address that is not in the cache causes that address to be loaded into the Scache, then modified and tagged as dirty for write-back later.

Write-back of dirty cache lines occurs only when the line is removed from the Scache, either by the Scache controller to make room for a new cache line, or by the back-map to maintain coherence of the caches with the local memory and/or registers.

HTML version of Basic Foils prepared 17 November 98

Foil 55 Cray T3E Cache Performance

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Peak data transfer rates on the CRAY T3E-900

Type of access Latency Bandwidth

in CPU cycles [MB/s]

Dcache load 2 7200

Scache load 8-10 7200

Dcache or

Scache store -- 3600

These rates correspond to the maximum instruction issue rate of two loads per CPU cycle or one store per CPU cycle.

HTML version of Basic Foils prepared 17 November 98

Foil 56 Finite Difference Example for T3E Cache Use I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

From http://www.cray.com/products/systems/crayt3e/1200/performance.html

REAL*8 AA(513,513), DD(513,513)

REAL*8 X (513,513), Y (513,513)

REAL*8 RX(513,513), RY(513,513)

DO J = 2,N-1

DO I = 2,N-1

XX = X(I+1,J)-X(I-1,J)

YX = Y(I+1,J)-Y(I-1,J)

XY = X(I,J+1)-X(I,J-1)

YY = Y(I,J+1)-Y(I,J-1)

A = 0.25 * (XY*XY+YY*YY)

B = 0.25* (XX*XX+YX*YX)

C = 0.125 * (XX*XY+YX*YY)

Continued on Next Page

HTML version of Basic Foils prepared 17 November 98

Foil 57 Finite Difference Example for T3E Cache Use II

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

AA(I,J) = -B

DD(I,J) = B+B+A*REL

PXX = X(I+1,J)-2.*X(I,J)+X(I-1,J)

QXX = Y(I+1,J)-2.*Y(I,J)+Y(I-1,J)

PYY = X(I,J+1)-2.*X(I,J)+X(I,J-1)

QYY = Y(I,J+1)-2.*Y(I,J)+Y(I,J-1)

PXY = X(I+1,J+1)-X(I+1,J-1)-X(I-1,J+1)+X(I-1,J-1)

QXY = Y(I+1,J+1)-Y(I+1,J-1)-Y(I-1,J+1)+Y(I-1,J-1)

RX(I,J) = A*PXX+B*PYY-C*PXY

RY(I,J) = A*QXX+BCa*QYY-C*QXY

END DO

Continued from Previous Page

HTML version of Basic Foils prepared 17 November 98

Foil 58 How to use Cache in Example I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

The inner loop of this kernel has 47 floating point operations, 18 array reads and 4 array writes.

The reads are of two 9-point stencils (2d neighbors including diagonals) centered at X(I,J) and Y(I,J), and the writes consist of unit-stride stores to the independent arrays AA, DD, RX and RY.
The two 9-point stencil array references should exhibit good temporal locality provided we can hold three contiguous columns of X and Y simultaneously in the Scache.
In addition, we need to make sure that the writes to AA, DD, RX, and RY do not interfere with X and Y in the Scache.

9 point stencil

HTML version of Basic Foils prepared 17 November 98

Foil 59 How to use Cache in Example II

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Since all six arrays are the same size and are accessed at the same rate, we can ensure that they do not interfere with each other if they do not conflict initially. For this example, it is convenient to optimize for only one 4096-word set of the 3-way set associative Scache.

One possible alignment of the arrays is shown in figure.

HTML version of Basic Foils prepared 17 November 98

Foil 60 SGI Origin 2000 I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

The system comes in two versions, deskside or a rack system.

The deskside has between 1 to 4 node cards (1 to up to 8 CPUs).

The rack system has 1 to 64 node cards for a total between 2 to 128 CPUs.

Each node is based on the 64-bit MIPS RISC R10000 architecture.

HTML version of Basic Foils prepared 17 November 98

Foil 61 SGI Origin II

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Also, each node has two primary caches (each 32KB two-way set-associative) and one secondary L2 cache (1 or 4MB two-way set associative) per CPU.

Each node has hardware cache coherency using a directory system and a maximum bandwidth of 780MB/sec.

The entire system (Cray Origin2000) has up to 512 such nodes, that is, up to 1024 processors.

For a 195-MHz R10000 processor, peak performance per processor is 390 MFLOPS or 780 MIPS (4 instructions per cycle), leading to an aggregate peak performance of almost 500 GFLOPS in a maximally sized machine.

HTML version of Basic Foils prepared 17 November 98

Foil 62 SGI Origin Block Diagram

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 63 SGI Origin III

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

The SysAD (system address and data) bus of the previous figure connecting the two processors has a

peak bandwidth of 780 MB/s.
The same also for the Hub's connection to memory.
Memory bandwidth for data is about 670 MB/s.

The Hub's connections to the off-board net router chip and Xbow I/O interface are 1.56 GB/s each.

HTML version of Basic Foils prepared 17 November 98

Foil 64 SGI Origin 2 Processor Node Board

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 65 Performance of NCSA 128 node SGI Origin 2000

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

RIEMANN is a general-purpose, higher-order accurate, Eulerian gas Dynamics code based on Gudunov schemes. Dinshaw Balsara

Laplace : the solution of sparse linear systems resulting from Navier-Stokes - Laplace equations. Danesh Tafti

QMC (Quantum Monte Carlo) Lubos Mitas

PPM (Piece-wise Parabolic Method)

MATVEC (Matrix-Vector Multiply) Dave McWilliams

Cactus Numerical Relativity Ed Seidel

HTML version of Basic Foils prepared 17 November 98

Foil 66 Cache Coherent or Not?

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Suppose two processors cache the same variable stored in memory of one of the processors

One must ensure cache coherence so that when one cache value changes, all do!

....

System Interconnection Network

L3 Cache

Main

Memory

Main

Memory

Cached Value of same shared variable

Board level Interconnection Network

HTML version of Basic Foils prepared 17 November 98

Foil 67 Summary of Cache Coherence Approaches

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Machines like the SGI Origin 2000 have a distributed shared memory with a so called directory implementation (pioneered in DASH project at Stanford) of cache coherence

Machines like SGI Cray T3E are distributed memory but do have fast get and put so as to be able to access single variables stored on remote memories

This gives performance advantage of shared memory without the programming advantage but complex hardware of Origin 2000

Origin 2000 approach does not scale as well as Cray T3E and large Origin 2000 systems must use message passing to link 128 node coherent memory subsystems

Cray T3E offers a uniform (but not as good for small number of nodes) interface

Pure Message passing / distributed memory is natural web model

HTML version of Basic Foils prepared 17 November 98

Foil 68 SMP Example: Intel Pentium Pro Quad

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Commodity SMP

All coherence and multiprocessing glue in processor module

Highly integrated, targeted at high volume

Low latency and bandwidth

P-Pr

o bus (64-bit data, 36-bit addr

ess, 66 MHz)

CPU

Bus interface

MIU

P-Pr

module

P-Pr

module

P-Pr

module

256-KB

Interrupt

contr

oller

PCI

bridge

PCI

bridge

Memory

contr

oller

1-, 2-, or 4-way

interleaved

DRAM

PCI bus

PCI

I/O

car

HTML version of Basic Foils prepared 17 November 98

Foil 69 Sun E10000 in a Nutshell

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

It is perhaps most successful SMP (Symmetric Multiprocessor) with applicability to commercial and scientific market

Its SMP characteristics are seen in its low uniform latency

One cannot build huge E10000 systems (only up to 64 nodes and each node is slower than Origin)

One should cluster multiple E10000's to build a supercomputer

HTML version of Basic Foils prepared 17 November 98

Foil 70 Sun Enterprise Systems E6000/10000

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

These are very successful in commercial server market for database and related applications

E10000 acquired from Cray via SGI is also called Starfire

TPCD Benchmarks 1000Gbyte(Terabyte) November 1998

These measure typical large scale database queries

System Database Power 5 Year Total $/perf

Metric System Cost ($/QphD)

(QppD)

Sun Starfire (SMP) Oracle 27,024.6 $9,660.193 $776

IBM RS/6000 (MPP) DB2 19,137.5 $11,380,178 $797

Sun 4x6000 (Clus) Informix 12,931.9 $11,766,932 $1,353

NCR 5150 (MPP) Teradata 12,149.2 $14,495,886 $2,103

HTML version of Basic Foils prepared 17 November 98

Foil 71 Starfire E10000 Architecture I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

The Starfire server houses a group of system boards interconnected by a centerplane.

A single cabinet holds up to 16 of these system boards, each of which can be independently configured with processors, memory and I/O channels, as described on following page.

HTML version of Basic Foils prepared 17 November 98

Foil 72 Starfire E10000 Architecture II

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

On each system board there us up to four 336 MHz UltraSPARC microprocessor modules with supporting two level/4 Mbyte cache per module (64 per Starfire system).

There can be four memory banks with a capacity of up to 4 Gbytes per system board (64 Gbytes per Starfire server).

There are two SBuses per board, each with slots for up to two adapters for networking and I/O (32 SBuses or 64 slots per system).

Or two PCI busses per board - each accommodating one adapter. Starfire can have a mix of SBus and PCI adapters.

HTML version of Basic Foils prepared 17 November 98

Foil 73 Sun Enterprise E6000/6500 Architecture

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 74 Sun's Evaluation of E10000 Characteristics I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

This is aimed at commercial customers where Origin 2000 is not a major player

HTML version of Basic Foils prepared 17 November 98

Foil 75 Sun's Evaluation of E10000 Characteristics II

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 76 Scalability of E1000

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

The Starfire scales to smaller sizes than machines like the IBM SP2 and Cray T3E but significantly larger sizes than competing SMP's

HTML version of Basic Foils prepared 17 November 98

Foil 77 MPI Bandwidth on SGI Origin and Sun Shared Memory Machines

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Results from Samson Cheung of NAS (Ames)

Origin 2000: mpirun -np 2 ./lbw -B -n 4000 -D

Total transfer time: 57.281 sec.

Transfer rate: 69.8 transfers/sec.

Message size: 1000000 bytes

Bandwidth: 139.664 10e+06 Bytes/sec

Sun E10000: tmrun -np 2 ./lbw -B -n 4000 -D

Total transfer time: 54.487 sec.

Transfer rate: 73.4 transfers/sec.

Message size: 1000000 bytes

Bandwidth: 146.825 10e+06 Bytes/sec

HTML version of Basic Foils prepared 17 November 98

Foil 78 Latency Measurements on Origin and Sun for MPI

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Origin 2000: mpirun -np 2 ./lbw -L -n 2500000 -D

Total transfer time: 52.597 sec.

Transfer rate: 47531.4 transfers/sec.

Message size: 40 bytes

Latency: 10.5 microsec.

Sun E10000: tmrun -np 2 ./lbw -L -n 2500000 -D

Total transfer time: 34.585 sec.

Transfer rate: 72286.7 transfers/sec.

Message size: 40 bytes

Latency: 6.9 microsec.

Note: Origin CPU about twice performance of Sun CPU

HTML version of Basic Foils prepared 17 November 98

Foil 79 Tera Multithreaded Supercomputer

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

This uses a clever idea developed over many years by Burton Smith who originally used it in the Denelcor system which was one of the first MIMD machines over 15 years ago

MTA(multithreaded architectures) are designed to hide the different access times of memory and CPU cycle time.

We used caches in conventional architectures
MTA uses a strategy that is typically used with coarser grain functional parallelism -- namely it switches to another task while current one is waiting for memory
Burtom emphasizes that hiding memory latency always implicitly requires parallelism equal to ratio of memory access to CPU operation speed

HTML version of Basic Foils prepared 17 November 98

Foil 80 Tera Computer at San Diego Supercomputer Center

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

First 2 processor system

HTML version of Basic Foils prepared 17 November 98

Foil 81 Overview of the Tera MTA I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Tera computer system is a shared memory multiprocessor.

From its specification, it also implements the true shared memory programming model for which the performance of the system does not depend on the placement of data in memory.
i.e. a true uniform memory system (UMA) whereas Sun E10000 is almost UMA and Origin 2000 NUMA

The Tera is a multi-processor which potentially can accommodate up to 256 processors.

The system runs stand-alone and requires no front end.
- Network connection to workstations and other computer systems is accomplished via 32- or 64-bit HIPPI channels.
- All data path widths are 64 bits, including the processor-network interface.

The clock speed is nominally 333 Mhz, giving each processor a data path bandwidth of one billion 64-bit results per second and a peak performance of one gigaflops.

HTML version of Basic Foils prepared 17 November 98

Foil 82 Overview of the Tera MTA II

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

The Tera Processors are multithreaded (called a stream) and each processor switches context every cycle among as many as 128 hardware threads, thereby hiding up to 128 cycles (384 ns) of memory latency.

Each processor executes a 21 stage pipeline and so can have 21 separate streams executing simultaneously

Each stream can issue as many as eight memory references without waiting for earlier ones to finish, further augmenting the memory latency tolerance of the processor.

A stream implements a load-store architecture with three addressing modes and 31 general-purpose 64-bit registers.

Switching between such streams ("threads") is fully supported by hardware

The peak memory bandwidth is 2.67 gigabytes per second.

HTML version of Basic Foils prepared 17 November 98

Foil 83 Tera 1 Processor Architecture from H. Bokhari (ICASE)

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 84 Tera Processor Characteristics

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

From Bokhari (ICASE)

HTML version of Basic Foils prepared 17 November 98

Foil 85 Tera System Diagram

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

From Bokhari (ICASE)

HTML version of Basic Foils prepared 17 November 98

Foil 86 Interconnect / Communications System of Tera I

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

The interconnection net is a sparsely populated 3-D packet switched containing p^(3/2) nodes, where p is the number of processors.

These nodes are toroidally connected in three dimensions to form a p^(1/2)-ary three-cube, and processor and memory resources are attached to some of the nodes.

The latency of a node is three cycles: a message spends two cycles in the node logic proper and one on the wire that connects the node to its neighbors.

A p-processor system has worst-case one-way latency of 4.5p^(1/2) cycles.

Messages are assigned random priorities and then routed in priority order. Under heavy load, some messages are derouted by this process. The randomization at each node insures that each packet eventually reaches its destination.

Randomization ensures good "worst-case" performance. This is a strategy which is well supported by theory (E.g. Les Valiant Harvard)

HTML version of Basic Foils prepared 17 November 98

Foil 87 Interconnect / Communications System of Tera II

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

A node has four ports (five if a resource is attached).

Each port simultaneously transmits and receives an entire 164-bit packet every 3 ns clock cycle.

Of the 164 bits, 64 are data, so the data bandwidth per port is 2.67 GB/s in each direction.

The network bisection bandwidth is 2.67p GB/s. (p is number of processors)

The network routing nodes contain no buffers other than those required for the pipeline.

Instead, all messages are immediately routed to an output port.

HTML version of Basic Foils prepared 17 November 98

Foil 88 T90/Tera MTA Hardware Comparison

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

From Allan Snavely at SDSC

HTML version of Basic Foils prepared 17 November 98

Foil 89 Tera Configurations / Performance

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

The overall hardware configuration of the system:

Processors 16 64 256

Peak Gflops 16 64 256

Memory, Gbytes 16-32 64-128 256-512

HIPPI channels 32 128 512

I/O, Gbytes/sec 6.2 25 102

HTML version of Basic Foils prepared 17 November 98

Foil 90 Performance of MTA wrt T90 and in parallel

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Relative MTA and T90 Performance

HTML version of Basic Foils prepared 17 November 98

Foil 91 Tera MTA Performance on NAS Benchmarks Compared to T90

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

From Allan Snavely at SDSC

NAS Benchmarks

255 Mhz

300 Mhz

HTML version of Basic Foils prepared 17 November 98

Foil 92 Cache Only COMA Machines

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

KSR-1 (Colorado)

Cache Only Machines "only" have cache and are typified by the Kendall Square KSR-1,2 machines. Although this company is no longer in business, the basic architecture is interesting and could still be used in future important machines

In this class of machine one has a NUMA architecture with memory attached to a given processor being lower access cost

In simplest COMA architectures, think of this memory as the cache and when needed migrate data to this cache

However in conventional machines, all data has a natural "original home"; in (simple) COMA, the home of the data moves when it is accessed and one hopes that data is "attracted" through access to "correct" processor

HTML version of Basic Foils prepared 17 November 98

Foil 93 Examples of Some SIMD machines

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Maspar (DECmpp) SIMD Machine at NPAC 1995 We had two such machines with 8K and 16K nodes respectively

For a short time Digital resold the Maspar as the DECmpp

ICL DAP 4096 Processors circa 1978

HTML version of Basic Foils prepared 17 November 98

Foil 94 Consider Scientific Supercomputing

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Proving ground and driver for innovative architecture and techniques

Market smaller (about 1%) relative to commercial as multiprocessors become mainstream
Dominated by vector machines starting in 70s
Microprocessors have made huge gains in floating-point performance
- high clock rates
- pipelined floating point units (e.g., multiply-add every cycle)
- instruction-level parallelism
- effective use of caches (e.g., automatic blocking)
Economics of commodity microprocessors

Large-scale multiprocessors replace vector supercomputers

Well under way already

HTML version of Basic Foils prepared 17 November 98

Foil 95 Toward Architectural Convergence

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Evolution and role of software have blurred boundary

Send/receive supported on SAS machines via buffers
Can construct global address space on MP using hashing
Page-based (or finer-grained) shared virtual memory

Hardware organization converging too

Tighter NI (Network Interface) integration even for MP (low-latency, high-bandwidth)
At lower level, even hardware SAS passes hardware messages

Even clusters of workstations/SMPs are parallel systems

Emergence of fast system area networks (SAN)

Programming models distinct, but organizations converging

Nodes connected by general network and communication assists
Implementations also converging, at least in high-end machines

HTML version of Basic Foils prepared 17 November 98

Foil 96 Convergence: Generic Parallel Architecture

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Node: processor(s), memory system, plus communication assist

Network interface and communication controller

Scalable network

Convergence allows lots of innovation, now within framework

Integration of assist with node, what operations, how efficiently...

A generic modern multiprocessor (Culler)

HTML version of Basic Foils prepared 17 November 98

Foil 97 SIMD CM 2 from Thinking Machines

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Disk Vault

HTML version of Basic Foils prepared 17 November 98

Foil 98 Official Thinking Machines Specification of CM2

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

HTML version of Basic Foils prepared 17 November 98

Foil 99 GRAPE Special Purpose Machines

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

GRAPE 4 1.08 Teraflops

GRAPE 6 200 Teraflops

HTML version of Basic Foils prepared 17 November 98

Foil 100 Quantum ChromoDynamics (QCD) Special Purpose Machines

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Here the case for special purpose machines is less compelling than for GRAPE as QCD is "just" regular (easy to parallelize; lowish communication) and extremely floating point intensive.

We illustrate with two machines which are classic MIMD distributed memory architecture with optimized nodes/communication networks

BNL/Columbia QCDSP 400 Megaflops

Univ. of CP-PACS (general physics) 600 Megaflops

HTML version of Basic Foils prepared 17 November 98

Foil 101 ASCI Red -- Intel Supercomputer at Sandia

From Master Foilset for HPC Achitecture Overview CPS615 Introduction to Computational Science -- Fall Semester 1998. *

Full HTML Index

Full System by Artist and camera

ASCI Red Interconnect

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sun Apr 11 1999