Full HTML for Basic Master Set B of Overview Material on Parallel Computing for CPS615

Full HTML for

Basic foilset Master Set B of Overview Material on Parallel Computing for CPS615

Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared 21 October 1995
Outside Index Summary of Material Secs 59

Parallel Computer and Network Architecture

Overview of Issues including synchronization, granularity and 3 classes of architectures

More details on networks

More details on system architectures

Table of Contents for full HTML of Master Set B of Overview Material on Parallel Computing for CPS615

Denote Foils where Image Critical

Denote Foils where Image has important information

Denote Foils where HTML is sufficient

Indicates Available audio which is greyed out if missing

1

Computational Science CPS615
Simulation Track Overview
Foilsets B 1995
2

Abstract of CPS615 Foilsets B 1995
3

Overview of
Parallel Hardware Architecture
4

3 Major Basic Hardware Architectures
5

Examples of the Three Current Concurrent Supercomputer Architectures
6

Parallel Computer Architecture Issues
7

General Types of Synchronization
8

Granularity of Parallel Components
9

Types of Parallel Memory Architectures
-- Logical Structure
10

Types of Parallel Memory Architectures -- Physical Characteristics
11

Diagrams of Shared and Distributed Memories
12

Classes of Communication Network include ...
13

Survey of Issues in Communication Networks
14

Glossary of Useful Concepts in Communication Systems
15

Switch and Bus based Architectures
16

Point to Point Networks (Store and Forward) -- I
17

Examples of Interconnection Topologies
18

Degree and Diameter of Ring and Mesh(Torus) Architectures
19

Degree and Diameter of Hypercube and Tree Architectures
20

Point to Point Networks (Store and Forward) -- II
21

Latency and Bandwidth of a Network
22

Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing
23

Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).
24

Switches versus Processor Networks
25

Circuit Switched Networks
26

Let's Return to General Parallel Architectures in more detail
27

Overview of Computer Architecture Issues
28

Some Global Computer Architecture Issues
29

Two General Real World Architectural Issues
30

MIMD Distributed Memory Architecture
31

Some MIMD Architecture Issues
32

SIMD (Single Instruction Multiple Data) Architecture
33

SIMD Architecture Issues
34

Shared Memory Architecture
35

The General Structure of a full sized CRAY C-90
36

The General Structure of a NEC SX-3
Classic Vector Supercomputer
37

Comparison of MIMD and SIMD Parallelism seen on Classic Vector Supercomputers
38

Shared versus Distributed Memory
39

What will happen in the year 2015 with .05 micron feature size and Petaflop Supercomputers using CMOS
40

CMOS Technology and Parallel Processor Chip Projections
41

Processor Chip Requirements for a Petaflop Machine Using 0.05 Micron Technology
42

Three Designs for a Year 2015 Petaflops machine with 0.05 micron technology
43

The Global Shared Memory Category I Petaflop Architecture
44

Category II Petaflop Architecture -- Network of microprocessors
45

Category III Petaflop Design -- Processor in Memory (PIM)
46

Necessary Latency to Support Three Categories
47

Chip Density Projections to year 2013
48

DRAM Chip count for Construction of Petaflop computer in year 2013 using 64 Gbit memory parts
49

Memory Chip Bandwidth in Gigabytes/sec
50

Power and I/O Bandwidth (I/O Connections) per Chip throught the year 2013
51

Clock Speed and I/O Speed in megabytes/sec per pin through year 2013
52

Rules for Making Hypercube Network Topologies
53

Mapping of Hypercubes into Three Dimensional Meshes
54

Mapping of Hypercubes into One Dimensional Systems
55

The One dimensional Mapping can be thought of as for one dimensional problem solving or one dimensional layout of chips forming hypercube
56

Hypercube Versus Mesh Topologies
57

Switches Versus Networks versus Fat Meshes
58

Basic Parallel Computer Architecture
59

Shared and Hierarchical Memory Computers
60

Homogeneous and Hierarchical Memory Multicomputers
61

Cache Coherency
62

Hybrid Distributed/Shared Memory Architectures
63

The INTEL Delta MIMD Distributed Memory Machine
64

Delta System Overview
65

Delta System Architecture
66

Delta System Hardware Configuration
67

Characteristics of INTEL i860 Compute Node
68

Delta System Communication Network
69

Delta Comunication Network (cont'd)
70

Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- I
71

Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- II
72

Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- III
73

Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- IV
74

Performance of High End Machines Years 1940-2000
75

Performance of High End Machines Years 1980-2000

Outside Index Summary of Material

HTML version of Basic Foils prepared 21 October 1995

Foil 1 Computational Science CPS615
Simulation Track Overview
Foilsets B 1995

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Secs 30

Geoffrey Fox

Syracuse University

NPAC 3-131 CST, 111 College Place

Syracuse NY 13244

HTML version of Basic Foils prepared 21 October 1995

Foil 2 Abstract of CPS615 Foilsets B 1995

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Secs 59

Parallel Computer and Network Architecture

Overview of Issues including synchronization, granularity and 3 classes of architectures

More details on networks

More details on system architectures

HTML version of Basic Foils prepared 21 October 1995

Foil 3 Overview of
Parallel Hardware Architecture

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

HTML version of Basic Foils prepared 21 October 1995

Foil 4 3 Major Basic Hardware Architectures

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

MIMD Distributed Memory

MIMD with logically shared memory but perhaps physically distributed. The latter is sometimes called distributed shared memory.

SIMD with logically distributed or shared memory

One interesting hardware architecture .........

Associative memory - SIMD or MIMD or content addressable memories ..... whose importance deserves further study and integration with mainstream HPCCI

Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of the above.

Metacomputers can vary full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D

HTML version of Basic Foils prepared 21 October 1995

Foil 5 Examples of the Three Current Concurrent Supercomputer Architectures

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

This list stems from 1990

HTML version of Basic Foils prepared 21 October 1995

Foil 6 Parallel Computer Architecture Issues

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Two critical Issues are:

Memory Structure

Distributed
Shared
Cached

and Heterogeneous mixtures

Control and synchronization

SIMD -lockstep synchronization

MIMD -synchronization can take several forms

Simplest: program controlled message passing
"Flags" in memory - typical shared memory construct
Special hardware - as in cache and its coherency (coordination between nodes)

HTML version of Basic Foils prepared 21 October 1995

Foil 7 General Types of Synchronization

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

An implied constraint on the ordering of operations.

In distributed memory, messages often used to synchronize.
In shared memory, software semaphores, hardware locks, or other schemes needed. [Denning, P. 1989]

Synchronous: Objects (processors, network, tasks) proceed together in lockstep fashion

Asynchronous: Objects may proceed at somewhat different speeds, sometimes with occasional synchronization points (barriers) to allow slower objects to catch up.

HTML version of Basic Foils prepared 21 October 1995

Foil 8 Granularity of Parallel Components

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors. Pieces, processors may be heterogeneous. Computation/ communication ratio very high -- Typical of Networked Metacomputing

Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors. Processors typically run the same code.(SPMD Style) Computation/communication ration often hundreds or more. Typical of MIMD Parallel Systems such as SP2 CM5 Paragon T3D

Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines. Processors typically have instructions broadcasted to them.Computation/ Communication ratio often near unity. Typical of SIMD but seen in a few MIMD systems such as Dally's J Machine or commercial Myrianet (Seitz)

Note that a machine of one type can be used on algorithms of the same or finer granularity

HTML version of Basic Foils prepared 21 October 1995

Foil 9 Types of Parallel Memory Architectures
-- Logical Structure

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Shared (Global): There is a global memory space, accessible by all processors. Processors may also have some local memory. Algorithms may use global data structures efficiently. However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times)

Distributed (Local, Message-Passing): All memory is associated with processors. To retrieve information from another processors' memory a message must be sent there. Algorithms should use distributed data structures.

HTML version of Basic Foils prepared 21 October 1995

Foil 10 Types of Parallel Memory Architectures -- Physical Characteristics

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Uniform: All processors take the same time to reach all memory locations.

Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines

DASH (Hennessey at Stanford) is best known example of such a virtual shared memory machine which is logicall shared but physically distributed.
ALEWIFE from MIT is a similar project
TERA (from Burton Smith) is Uniform memory access logically shared memory machine

Most NUMA machines these days have two memory access times

Local memory (divided in registers caches etc) and
Nonlocal memory with little or no difference in access time for different nonlocal memories

HTML version of Basic Foils prepared 21 October 1995

Foil 11 Diagrams of Shared and Distributed Memories

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

HTML version of Basic Foils prepared 21 October 1995

Foil 12 Classes of Communication Network include ...

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Classes of networks include:

Bus: All processors (and memory) connected to a common bus or busses.

Memory access fairly uniform, but not very scalable due to contention
Bus machines can be NUMA if memory consists of directly accessed local memory as well as memory banks accessed by BUS. The BUS accessed memories can be local memories on other processors

Switching Network: Processors (and memory) connected to routing switches like in telephone system.

Switches might have queues, "combining logic", which improve functionality but increase latency.
Switch settings may be determined by message headers or preset by controller.
Connections can be packet-switched (messages no longer than some fixed size) or circuit-switched (connection remains as long as needed)
Usually NUMA, blocking, often scalable and upgradable

HTML version of Basic Foils prepared 21 October 1995

Foil 13 Survey of Issues in Communication Networks

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

HTML version of Basic Foils prepared 21 October 1995

Foil 14 Glossary of Useful Concepts in Communication Systems

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Useful terms include:

Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout)

Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault?

Blocking: Some communication requests may not get through, due to conflicts caused by other requests.

Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor.

Latency (delay): Maximal time for nonblocked request to be transmitted.

Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task.

HTML version of Basic Foils prepared 21 October 1995

Foil 15 Switch and Bus based Architectures

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Switch

Bus

HTML version of Basic Foils prepared 21 October 1995

Foil 16 Point to Point Networks (Store and Forward) -- I

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Processors directly connected to only certain other processors, and must go multiple hops to get to additional processors. Also called store-and-forward.

Usually distributed memory

Examples:

ring
mesh, torus
hypercube -- as in original Caltech/JPL machines
binary tree

Usually NUMA, nonblocking, scalable, upgradable

Processor connectivity modeled by a graph in which nodes represent connections between processors. Much theoretical work here but now obsolete (irrelevant) as even if this hardware used, pipelining masks transmission time which ends up not showing term proportional to distance travelled

HTML version of Basic Foils prepared 21 October 1995

Foil 17 Examples of Interconnection Topologies

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube.

Communication (operating system) software ensures that systems appears fully connected even if physical connections partial

HTML version of Basic Foils prepared 21 October 1995

Foil 18 Degree and Diameter of Ring and Mesh(Torus) Architectures

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Mesh seems to have poor diameter but Dally and Seitz emphasize can make up as easy to build physically (circuits are two dimensional). Thus can have fewer links but as physically short can be "thick" (very high bandwidth)

HTML version of Basic Foils prepared 21 October 1995

Foil 19 Degree and Diameter of Hypercube and Tree Architectures

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Hypercube compromises so both degree and diameter grow modestly. High degree implies hard to build -- low diameter improves communication

Tree has bottleneck at root(top). In fat tree address this as in CM5 by increasing bandwidth on links as one goes up the tree

HTML version of Basic Foils prepared 21 October 1995

Foil 20 Point to Point Networks (Store and Forward) -- II

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Hardware may handle only single hop, or multiple hops as in routing chips on Paragon.

Software may mask hardware limitations so one sees full connectivity even if physically limited. Note that multiple hops always leads to poorer latency as this is travel time per bit. However we can keep bandwidth high even with multiple hops by increasing "size" of channels e.g. transmitting several bits simultaneously. Software can hide

Latency by pipelining -- doing other thing while bits in transit. This is circuit switching if done at low level.
Partial connectivity by supplying software layer that handles routing -- this is familiar on Internet

Latency related to graph diameter, among many other factors

Graph may also represent calculations that need to be performed, and the information exchange required.

HTML version of Basic Foils prepared 21 October 1995

Foil 21 Latency and Bandwidth of a Network

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Transmission Time for message of n bytes:

T0 + T1 n where

T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize

T0 = TS + Td . Number of hops

T1 is the inverse bandwidth -- it can be made small if pipe is large size.

In practice TS and T1 are most important and Td is unimportant

HTML version of Basic Foils prepared 21 October 1995

Foil 22 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

HTML version of Basic Foils prepared 21 October 1995

Foil 23 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Square blocks indicate shared memory copy performance

Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

HTML version of Basic Foils prepared 21 October 1995

Foil 24 Switches versus Processor Networks

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Note that the processor networks such as torus or hypercube can be used to build switches when one does not put processors on internal nodes but rather "just" switch chips.

Thus switches are discussed with same trade-offs -- store and forward, circuit-switched as processor networks.

Switch based machines have typically the same delay (travel time) between any two processors

In Processor networks, some machines can be nearer to each other if fewer hops

BUT in all modern machines, low level hardware essentially makes all these architectures the SAME. There are only two times of importance corresponding to DATA LOCALITY or not

Time to access memory on processor
- This further divides into time to get to main DRAM and time to get to cache
Time to access memory off processor
Here time covers both latency and bandwidth.

HTML version of Basic Foils prepared 21 October 1995

Foil 25 Circuit Switched Networks

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Rather than have messages travel single hop at a time, sometimes circuit is first established from sender to receiver, and then data transmitted along circuit.

Similar to establishing phone connection

Can result in significantly lower communication overhead as latency deterministic once circuit established

If circuit blocked, many options possible, including

Retry, perhaps with alternative circuit or random delay
Wormhole routing: message travel like a wagon-train along path, stops when blocked but stays in circuit.
Virtual cut-through: message dumps into memory where forward progress blocked, then retries. Blend of pure circuit-switching and store-and-forward.

At high levels of message traffic, performance sometimes severely degraded.

HTML version of Basic Foils prepared 21 October 1995

Foil 26 Let's Return to General Parallel Architectures in more detail

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

HTML version of Basic Foils prepared 21 October 1995

Foil 27 Overview of Computer Architecture Issues

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Many issues both between architectures and internal to each architecture

Situation confused as need to understand

"theoretical" arguments e.g. fat-tree v. torus
technology trade-offs now e.g. optical v. copper links
technology as a function of time

is a particular implementation done right

in basic hardware
as a system with software

All that perhaps is important is what either user or high level software (compiler/message passing system) sees.

HTML version of Basic Foils prepared 21 October 1995

Foil 28 Some Global Computer Architecture Issues

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

A significant problem is that there is essentially no application software and essentially no machines sold

Real World Architectural issue - what problems will produce substantial sales - should optimize for these?

So maybe simulation machines will "ride on" commercial database and multimedia server (Web server) markets
Historically this created problems as in IBM3090 which was business machine adapted for science and very poor at science
Currently SP2 good for science and doing well in Business applications

System integration issues

Does software allow hardware to realize potential ?
Hardware, software reliability
(Parallel) I/O for graphics and file access
I/O interfaces - HIPPI, FCS, ATM ... ATM most important in real world but HIPPI scientific standard?

HTML version of Basic Foils prepared 21 October 1995

Foil 29 Two General Real World Architectural Issues

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Why is MPP always behind the best workstations?

It takes n extra years to build a new architecture
Therefore MPP uses sequential technologies which are n years behind
n=2 implies ~ factor of 3 in cost/performance

Must reduce delay n by matching parallel design as much as possible to

Commercial processor chips (use workstation not custom nodes)
Commercial communication -- (use ATM not custom networks?)
Commercial Software -- (Use Web not custom MPP software)

Memory per node (SIMD and MIMD)

Some say that current machines waste transistors - activity only occurs in 1% of transistors at a time?
Others say need large memory to run UNIX, increase grain size as problems complicated and further communication needs decrease as grain size increases

HTML version of Basic Foils prepared 21 October 1995

Foil 30 MIMD Distributed Memory Architecture

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

A special case of this is a network of workstations (NOW's) or personal computers

Issues include:

Node - CPU, Memory
Network
Bandwidth
Latency
Hardware
actual (software)

HTML version of Basic Foils prepared 21 October 1995

Foil 31 Some MIMD Architecture Issues

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Choice of Node is a dominant issue?

RISC nodes SGI Challenge, IBM SP-2, Convex, Cray T3D with MIPS, IBM HP and Digital workstation chips respectively
Special purpose nodes -- CM-5, Paragon, Meiko CS-2 -- out of favor as non competitive
"old" nodes -- nCUBE-2
small nodes Caltech Mosaic which is basis of Myrianet J Machine (Dally at MIT) Execube (Kogge - Loral ex IBM)

Network Topology as described is not very important today?

Theoretical issues obscured by technology and implementation
Details of network can be and are hidden from user and compiler by simple elegant (message passing) software including collective communication primitives (broadcast, reduction, etc.)

However we still have two major types of network:

Distributed Shared Memory -- physically distributed memory but hardware support for shared memory as in SGI and Convex machines
Pure Distributed Memory as in IBM SP-2 or network of workstations

HTML version of Basic Foils prepared 21 October 1995

Foil 32 SIMD (Single Instruction Multiple Data) Architecture

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network

Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, two-dimensional mesh and general switch

Execube - 16 bit processors with 8 integrated into IBM 4 mbit memory chip, SIMD or MIMD or both,

512 processors on IBM RS6000 with three dimensional mesh

HTML version of Basic Foils prepared 21 October 1995

Foil 33 SIMD Architecture Issues

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Node 1bit, 4 bit, 16 bit, 32 bit?

Interconnect issues similar to MIMD

Critical point -- is SIMD more cost effective than MIMD on a significant number of problems

What types of problems run well on SIMD?
Take a problem - e.g. SOR for PDE solution - that runs well on SIMD - is the MP-2 more cost effective than CM-5, Paragon, SP-1?
Need to compare SIMD and MIMD machines at "same technology" stage

Current Consensus is opposite of 1990 -- namely now MIMD dominates.

SIMD AMT DAP and Maspar aimed at Market niches
Defense signal processing
Business index sorting

HTML version of Basic Foils prepared 21 October 1995

Foil 34 Shared Memory Architecture

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Some machines have a "true" shared memory. i.e. every location is "equally accessible" by each node. Original Ultracomputer, Burton Smith's Tera, Some Bus based Systems

Others, such as BBN Butterfly, Kendall Square KSR-2, have non-uniform access time but all memory locations accessible

HTML version of Basic Foils prepared 21 October 1995

Foil 35 The General Structure of a full sized CRAY C-90

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Each C.P.U. - Two Pipes;

Each pipe one add and one

multiply per clock period

Cycle Time 4 nanoseconds

HTML version of Basic Foils prepared 21 October 1995

Foil 36 The General Structure of a NEC SX-3
Classic Vector Supercomputer

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

each C.P.U. has upto 4 functional units

Cycle Time 2.9 nanoseconds

HTML version of Basic Foils prepared 21 October 1995

Foil 37 Comparison of MIMD and SIMD Parallelism seen on Classic Vector Supercomputers

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

DO 1 i = 1, alotoftimes
many thousands of lines of detailed calculations such as:
Do 2 y = 1,4
2 a(y) = a(y) + b(y) * c(y)

1 continue

Note MIMD parallelism "larger" than SIMD as MIMD size reflects number of grid points,particles etc.

HTML version of Basic Foils prepared 21 October 1995

Foil 38 Shared versus Distributed Memory

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Expected Architectures of Future will be:

Physically distributed but hardware support of shared memory for tightly coupled MPP's such as future IBM SP-X, Convex Exemplar, SGI (combined with Cray)
Physically distributed but without hardware support -- NOW's and COW's -- The World Wide Web as a Metacomputer

Essentially all problems run efficiently on a distributed memory BUT

Software is easier to develop on a shared memory machine

Some Shared Memory Issues:

Cost - Performance : additional hardware (functionality, network bandwidth) to support shared memory
Scaling. Can you build very big shared memory machines?
- Yes for NUMA distributed shared memory
Compiler challenges for distributed shared memory are difficult and major focus of academic and commercial work
This is not practically important now as 32 node KSR-2 (from past) or SGI Power Challenge (cost ~< $2m) is already at high end of important commercial market

HTML version of Basic Foils prepared 21 October 1995

Foil 39 What will happen in the year 2015 with .05 micron feature size and Petaflop Supercomputers using CMOS

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Only Superconducting Technology can possibly do better

Need Optical Interconnects probably

HTML version of Basic Foils prepared 21 October 1995

Foil 40 CMOS Technology and Parallel Processor Chip Projections

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

See Chapter 5 of Petaflops Report -- July 94

HTML version of Basic Foils prepared 21 October 1995

Foil 41 Processor Chip Requirements for a Petaflop Machine Using 0.05 Micron Technology

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

See Chapter 5 of Petaflops Report -- July 94

HTML version of Basic Foils prepared 21 October 1995

Foil 42 Three Designs for a Year 2015 Petaflops machine with 0.05 micron technology

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

See Chapter 6 of Petaflops Report -- July 94

HTML version of Basic Foils prepared 21 October 1995

Foil 43 The Global Shared Memory Category I Petaflop Architecture

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

See Chapter 6 of Petaflops Report -- July 94

This shared memory design is the natural evolution of systems such as the Cray-2,3 or Cray C-90

HTML version of Basic Foils prepared 21 October 1995

Foil 44 Category II Petaflop Architecture -- Network of microprocessors

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

See Chapter 6 of Petaflops Report -- July 94

This architecture generalizes cutrrent IBM SP-2 type system and requires unlike Category I, data locality for the upto 40,000 CPU's to be able function efficiently with minimum communication overhead

HTML version of Basic Foils prepared 21 October 1995

Foil 45 Category III Petaflop Design -- Processor in Memory (PIM)

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

See Chapter 6 of Petaflops Report -- July 94

This design is an extrapolation of systems such as the J machine(Dally), Execube (Loral) or Mosaic(Seitz). It features CPU and memory integrated on the chip (PIM).

Unlike such systems today, in the year 2015 such PIM designs have substantial memory per processor

HTML version of Basic Foils prepared 21 October 1995

Foil 46 Necessary Latency to Support Three Categories

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

This is formal latency to ensure that one can reach design goals of previous table. I picosecond (Category I) cannot be attained. You solve this by having N concurrent streams -- Then each of them needs latency of N picoseconds

e.g. If output rate of an arithmetic unit is 1 per nanosecond, then each of 400 teraflop CPU's in Category I design must have 1000 arithmetic units running at full speed to each full system performance of 0.4 petaflops

See Chapter 6 of Petaflops Report -- July 94

HTML version of Basic Foils prepared 21 October 1995

Foil 47 Chip Density Projections to year 2013

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HTML version of Basic Foils prepared 21 October 1995

Foil 48 DRAM Chip count for Construction of Petaflop computer in year 2013 using 64 Gbit memory parts

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HTML version of Basic Foils prepared 21 October 1995

Foil 49 Memory Chip Bandwidth in Gigabytes/sec

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HTML version of Basic Foils prepared 21 October 1995

Foil 50 Power and I/O Bandwidth (I/O Connections) per Chip throught the year 2013

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HTML version of Basic Foils prepared 21 October 1995

Foil 51 Clock Speed and I/O Speed in megabytes/sec per pin through year 2013

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HTML version of Basic Foils prepared 21 October 1995

Foil 52 Rules for Making Hypercube Network Topologies

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

A Hypercube is 2d computers arranged on the corners of a cube in d dimensions and connected to d nearest neighbors

Label a hypercube e1 e2 e3 e4.......ed

where each ek takes values zero or 1 depending if vertex at "top" or "bottom" of this axis.

Think of hypercube as unit cube in d dimensions with origin at bottom left hand corner

HTML version of Basic Foils prepared 21 October 1995

Foil 53 Mapping of Hypercubes into Three Dimensional Meshes

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

In string e1 e2 e3 e4.......ed, one can write d=d1+d2+d3

Any set of d1 binary indices can be mapped into a one dimensional line with periodic (wrap-around) connections

the remaining d2+d3 indices give other two dimensions

This decomposes hypercube into a mesh with:

2d1 x 2d2 x 2d3 nodes

So Hypercube includes lots of meshes

e.g. d=6 hypercube has 4 by 4 by 4, 8 by 8, 4 by 64, 64 by 1 etc.

So can study one dimension without loss of generality!

HTML version of Basic Foils prepared 21 October 1995

Foil 54 Mapping of Hypercubes into One Dimensional Systems

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

d= 1 -- 2 nodes

d= 2 -- 4 nodes

d= 3 -- 8 nodes

HTML version of Basic Foils prepared 21 October 1995

Foil 55 The One dimensional Mapping can be thought of as for one dimensional problem solving or one dimensional layout of chips forming hypercube

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

So in Hadrian's wall we saw a one dimensional layout of processors (people) to solve a one dimensional problem

However computers need to be laid out in two or three dimensional worlds and we can use same ideas.

We consider one dimension as discussion in two or three similar as each axis(dimension) corresponds to a distinct subset of binary indices

HTML version of Basic Foils prepared 21 October 1995

Foil 56 Hypercube Versus Mesh Topologies

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Hypercube:

Mesh with equivalent wire length

So can let long hypercube wires have "local" stops and hypercube becomes a "thick mesh"

With good routing technologies, a thick mesh is better than a hypercube?

HTML version of Basic Foils prepared 21 October 1995

Foil 57 Switches Versus Networks versus Fat Meshes

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

We can in diagram below choose the staging point X to be:

Wires as in hypercube or switch routing network elements
The routing elements can be separate from or identical with processor

Conversely the "processor" always routes messages in a hypercube. This routing element can be attached to CPU as in hypercube or an uncoupled processor -- routing chip pair as in switch

Note modern systems often have separate CPU for routing (dedicated special purpose VLSI) and calculation -- see Intel Paragon

HTML version of Basic Foils prepared 21 October 1995

Foil 58 Basic Parallel Computer Architecture

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

HTML version of Basic Foils prepared 21 October 1995

Foil 59 Shared and Hierarchical Memory Computers

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

HTML version of Basic Foils prepared 21 October 1995

Foil 60 Homogeneous and Hierarchical Memory Multicomputers

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Hierarchical Hypercube -- Spatial and Temporal Decomposition

Homogeneous Hypercube -- Purely Spatial Decomposition

HTML version of Basic Foils prepared 21 October 1995

Foil 61 Cache Coherency

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Cannot be manipulated independently

HTML version of Basic Foils prepared 21 October 1995

Foil 62 Hybrid Distributed/Shared Memory Architectures

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Consider "nodes" = printed circuit boards (Paragon) or bunch of printed circuit boards sharing a BUS (SGI Challenge) where the N nodes are connected as a distributed memory architecture

Inside each node, may have n=2,4,8 .... processors sharing a single memory.

Should one use this shared memory or just consider total machine as distributed with Nn nodes?

This is seen today (1995) as a network of SGI Challenge systems (i.e. metacomputer where conventional HIPPI network joins set of shared memory machines) or as a specialized torus network connecting Paragon nodes consisting of multiple i860's

HTML version of Basic Foils prepared 21 October 1995

Foil 63 The INTEL Delta MIMD Distributed Memory Machine

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Now we consider an example of a real system - the Intel Delta Touchstone

This was perhaps first real parallel supercomputer used in serious work

It is already obsolete and replaced by a large Intel Paragon which has a similar architecture

HTML version of Basic Foils prepared 21 October 1995

Foil 64 Delta System Overview

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Computational Characteristics

Peak Speed 30.8 GFlops (DP)
Total Memory 8.2 GBytes
Disk Capacity 95 GBytes

System Characteristics

Processors consist of 572 heterogeneous nodes including 513 i860 compute processors
System Architecture -- 16x36 2-dimensional mesh

Physical Characteristics

System Size -- 16 ft. long, 5 ft. high, 3 ft. deep
Cooling Method -- Air Cooled

Communications Characteristics

Interconnect Network
- 2-dimensional mesh
- Bidirectional data flow
- Packet-switched network of routing chips
Internode Communications
- 60-80 msec software latency
- 28 MBytes/s bandwidth

HTML version of Basic Foils prepared 21 October 1995

Foil 65 Delta System Architecture

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

513 Compute Nodes; 30.8 GFlops (Double-precision Peak Performance) configured as 16x32 Compute Mesh +1 Compute Node

8.2 GBytes Memory

95 GBytes Disk Space

HTML version of Basic Foils prepared 21 October 1995

Foil 66 Delta System Hardware Configuration

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

513 compute nodes using i860 processors with 16 MBytes memory each

9 service nodes using 80386 processors with 8MBytes memory

32 disk I/O nodes using 80386 processors with 8 MBytes memory each controlling 2- 1.5 GByte disks

9 tape I/O nodes using 80386 processors with 8 MBytes memory

7 with 2 Exabyte 5 GByte tape drives each
2 with 2 Exabyte 2 GByte tape drives each

2 ethernet nodes using 80386 processors with 8 MBytes memory

2 HIPPI interfaces using i860 processors with 32 MBytes memory

HTML version of Basic Foils prepared 21 October 1995

Foil 67 Characteristics of INTEL i860 Compute Node

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

RISC engine + 2FPUs + graphics accelerator

40 MHz clock

Parallel integer and floating-point processing units

Parallel multiplier and adder units within floating-point unit

Pipelined floating-point processing units (3-stage pipe)

80 MFlops (SP) Peak, 60 MFlops (DP) Peak, 33 MIP's (Integer)

all transcendental functions done in software

16 MBytes Main Memory (expandable to 64 MBytes)

160 MBytes/sec peak DRAM access rate

Very hard to build good compilers for i860!

HTML version of Basic Foils prepared 21 October 1995

Foil 68 Delta System Communication Network

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Physically Independent bidirectional 2-Dimensional Mesh

HTML version of Basic Foils prepared 21 October 1995

Foil 69 Delta Comunication Network (cont'd)

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

System-Wide Communications Fabric

handles all inter-node communication
handles all I/O communication

Interconnect supports arbitrary expansion increments

Automatically routes messages without interrupting intermediate compute nodes

Nearly equal performance between any 2 nodes, independent of distance between them

Programmer can ignore details of how messages move along the interconnect network

Mesh Routing Chips have 5 bi-directional channels each

Mesh routing chip delivers messages from sending to receiving nodes
Developed at Caltech by Prof. Charles Seitz under DARPA sponsorship

Interconnect supports 28 MBytes/sec node-node bandwidth

Allows node fault isolation

HTML version of Basic Foils prepared 21 October 1995

Foil 70 Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- I

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

1) ILLIAC

2) ICL DAP � AMT DAP

3) Goodyear MPP � LORAL ASPRO used on E2C plane s

IBM Federal Systems Execube (Kogge now at Notre Dame)

4) Denelcor* HEP founded by Burton Smith � Tera

5) Cray XMP/22 characterized by Shared Memory and Powerful nodes

6) Ultracomputer (NYU) � IBM RP3

7) Cosmic Cube (Mark I) � Intel

t FPS*

Mark II Hypercube Ametek (Symult)*
- t nCUBE
Mark III Hypercube

HTML version of Basic Foils prepared 21 October 1995

Foil 71 Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- II

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

8) Sequent Encore*

9) Alliant* � Cedar* (Illinois)

10) Multiflow* Wide word � Super Scaler

(compilers are good)

Cydrome*

11) Intel iPSC1,2 -- First commercial MIMD

distributed memory nCUBE FPS* T Series Symult (Ametek)*

HTML version of Basic Foils prepared 21 October 1995

Foil 72 Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- III

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

12) Transputers (open nCUBE) Embedded systems

Inmos � Meiko � Parsytec
Transtech etc.

13) CM-2 Real SIMD systems

Maspar (DECmpp)
AMT DAP

14) nCUBE-2 Powerful Hypercubes

Intel Touchstone iPSC/860 (Gamma,Delta)

( i860 chip in many add on systems like transputers)

HTML version of Basic Foils prepared 21 October 1995

Foil 73 Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- IV

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

15) Intel Delta (528 nodes) MIMD but not Hypercube (mesh) t

commercial epsilon (Paragon)

16) BBN* TC2000 (Butterfly) Series of elegant but slow shared memory

17) Cray C-90 Very successful "conventional" supercomputer

18) CM-5 New MIMD Distributed Memory

signaled "death" of SIMD as major player and first of set of
"serious" commercial machines
Paragon
Fujitsu ( $80M system is world's most powerful, Jan. '94 )
Meiko CS-2
Cray T3D
IBM SP-1,2
Convex Exemplar shared memory hierarchical

19) DASH Virtual Shared Memory

Alewife
KSR
Tera true UMA -- Uniform Memory Access

HTML version of Basic Foils prepared 21 October 1995

Foil 74 Performance of High End Machines Years 1940-2000

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Secs 72

HTML version of Basic Foils prepared 21 October 1995

Foil 75 Performance of High End Machines Years 1980-2000

From New CPS615 Foils-- B 28 August 95 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Secs 33

Basic foilset Master Set B of Overview Material on Parallel Computing for CPS615

Table of Contents for full HTML of Master Set B of Overview Material on Parallel Computing for CPS615

Foil 1 Computational Science CPS615Simulation Track OverviewFoilsets B 1995

Foil 2 Abstract of CPS615 Foilsets B 1995

Foil 3 Overview of Parallel Hardware Architecture

Foil 4 3 Major Basic Hardware Architectures

Foil 5 Examples of the Three Current Concurrent Supercomputer Architectures

Foil 6 Parallel Computer Architecture Issues

Foil 7 General Types of Synchronization

Foil 8 Granularity of Parallel Components

Foil 9 Types of Parallel Memory Architectures-- Logical Structure

Foil 10 Types of Parallel Memory Architectures -- Physical Characteristics

Foil 11 Diagrams of Shared and Distributed Memories

Foil 12 Classes of Communication Network include ...

Foil 13 Survey of Issues in Communication Networks

Foil 14 Glossary of Useful Concepts in Communication Systems

Foil 15 Switch and Bus based Architectures

Foil 16 Point to Point Networks (Store and Forward) -- I

Foil 17 Examples of Interconnection Topologies

Foil 18 Degree and Diameter of Ring and Mesh(Torus) Architectures

Foil 19 Degree and Diameter of Hypercube and Tree Architectures

Foil 20 Point to Point Networks (Store and Forward) -- II

Foil 21 Latency and Bandwidth of a Network

Foil 22 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing

Foil 23 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).

Foil 24 Switches versus Processor Networks

Foil 25 Circuit Switched Networks

Foil 26 Let's Return to General Parallel Architectures in more detail

Foil 27 Overview of Computer Architecture Issues

Foil 28 Some Global Computer Architecture Issues

Foil 29 Two General Real World Architectural Issues

Foil 30 MIMD Distributed Memory Architecture

Foil 31 Some MIMD Architecture Issues

Foil 32 SIMD (Single Instruction Multiple Data) Architecture

Foil 33 SIMD Architecture Issues

Foil 34 Shared Memory Architecture

Foil 35 The General Structure of a full sized CRAY C-90

Foil 36 The General Structure of a NEC SX-3Classic Vector Supercomputer

Foil 37 Comparison of MIMD and SIMD Parallelism seen on Classic Vector Supercomputers

Foil 38 Shared versus Distributed Memory

Foil 39 What will happen in the year 2015 with .05 micron feature size and Petaflop Supercomputers using CMOS

Foil 40 CMOS Technology and Parallel Processor Chip Projections

Foil 41 Processor Chip Requirements for a Petaflop Machine Using 0.05 Micron Technology

Foil 42 Three Designs for a Year 2015 Petaflops machine with 0.05 micron technology

Foil 43 The Global Shared Memory Category I Petaflop Architecture

Foil 44 Category II Petaflop Architecture -- Network of microprocessors

Foil 45 Category III Petaflop Design -- Processor in Memory (PIM)

Foil 46 Necessary Latency to Support Three Categories

Foil 47 Chip Density Projections to year 2013

Foil 48 DRAM Chip count for Construction of Petaflop computer in year 2013 using 64 Gbit memory parts

Foil 49 Memory Chip Bandwidth in Gigabytes/sec

Foil 50 Power and I/O Bandwidth (I/O Connections) per Chip throught the year 2013

Foil 51 Clock Speed and I/O Speed in megabytes/sec per pin through year 2013

Foil 52 Rules for Making Hypercube Network Topologies

Foil 53 Mapping of Hypercubes into Three Dimensional Meshes

Foil 54 Mapping of Hypercubes into One Dimensional Systems

Foil 55 The One dimensional Mapping can be thought of as for one dimensional problem solving or one dimensional layout of chips forming hypercube

Foil 56 Hypercube Versus Mesh Topologies

Foil 57 Switches Versus Networks versus Fat Meshes

Foil 58 Basic Parallel Computer Architecture

Foil 59 Shared and Hierarchical Memory Computers

Foil 60 Homogeneous and Hierarchical Memory Multicomputers

Foil 61 Cache Coherency

Foil 62 Hybrid Distributed/Shared Memory Architectures

Foil 63 The INTEL Delta MIMD Distributed Memory Machine

Foil 64 Delta System Overview

Foil 65 Delta System Architecture

Foil 66 Delta System Hardware Configuration

Foil 67 Characteristics of INTEL i860 Compute Node

Foil 68 Delta System Communication Network

Foil 69 Delta Comunication Network (cont'd)

Foil 70 Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- I

Foil 71 Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- II

Foil 72 Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- III

Foil 73 Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- IV

Foil 74 Performance of High End Machines Years 1940-2000

Foil 75 Performance of High End Machines Years 1980-2000

Foil 1 Computational Science CPS615
Simulation Track Overview
Foilsets B 1995

Foil 3 Overview of
Parallel Hardware Architecture

Foil 9 Types of Parallel Memory Architectures
-- Logical Structure

Foil 36 The General Structure of a NEC SX-3
Classic Vector Supercomputer