Full HTML for

Scripted foilset Second set of lectures on CPS615 Parallel Computing Overview

Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared 18 Sept 1995
Outside Index Summary of Material


This starts with a discussion of Parallel Computing using analogies from nature
It uses foils and material from CSEP chapter on Computer Architecture to discuss how and why to build a parallel computer including synchronization memory structure and network issues
SIMD and MIMD Architectures with a brief comparison of workstation networks with closely coupled systems
A look to the future is based on results from Petaflops workshop

Table of Contents for full HTML of Second set of lectures on CPS615 Parallel Computing Overview

Denote Foils where Image Critical
Denote Foils where Image has important information
Denote Foils where HTML is sufficient
Indicates Available audio which is lightpurple2ed out if missing
1 CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1995 --
Lecture Stream 2

2 Abstract of Lecture Stream 2 of CPS615
3 Elementary Discussion of
Parallel Computing

4 Single nCUBE2 CPU Chip
5 64 Node nCUBE Board
6 CM-5 in NPAC Machine Room
7 Basic METHODOLOGY of Parallel Computing
8 Concurrent Computation as a Mapping Problem -I
9 Concurrent Computation as a Mapping Problem - II
10 Concurrent Computation as a Mapping Problem - III
11 Finite Element Mesh From Nastran
(mesh only shown in upper half)

12 A Simple Equal Area Decomposition
13 Decomposition After Annealing
(one particularly good but nonoptimal decomposition)

14 Parallel Processing and Society
15 Concurrent Construction of a Wall
Using N = 8 Bricklayers
Decomposition by Vertical Sections

16 Quantitative Speed-Up Analysis for Construction of Hadrian's Wall
17 Amdahl's law for Real World Parallel Processing
18 Pipelining --Another Parallel Processing Strategy for Hadrian's Wall
19 Hadrian's Wall Illustrates that the Topology of Processor Must Include Topology of Problem
20 General Speed Up Analysis
21 Comparison of The Complete Problem to the subproblems formed in domain decomposition
22 Hadrian's Wall Illustrating an
Irregular but Homogeneous Problem

23 Some Problems are Inhomogeneous Illustrated by:
An Inhomogeneous Hadrian Wall with Decoration

24 Global and Local Parallelism Illustrated by Hadrian's Wall
25 Parallel I/O Illustrated by
Concurrent Brick Delivery for Hadrian's Wall
Bandwidth of Trucks and Roads
Matches that of Masons

26 Nature's Concurrent Computers
27 Comparison of Concurrent Processing in Society and Computing
28 Computational Science CPS615
Simulation Track Overview
Foilsets B 1995

29 Abstract of CPS615 Foilsets B 1995
30 Overview of
Parallel Hardware Architecture

31 3 Major Basic Hardware Architectures
32 Examples of the Three Current Concurrent Supercomputer Architectures
33 Parallel Computer Architecture Issues
34 General Types of Synchronization
35 Granularity of Parallel Components
36 Types of Parallel Memory Architectures
-- Logical Structure

37 Types of Parallel Memory Architectures -- Physical Characteristics
38 Diagrams of Shared and Distributed Memories
39 Survey of Issues in Communication Networks
40 Glossary of Useful Concepts in Communication Systems
41 Switch and Bus based Architectures
42 Classes of Communication Network include ...
43 Point to Point Networks (Store and Forward) -- I
44 Examples of Interconnection Topologies
45 Degree and Diameter of Ring and Mesh(Torus) Architectures
46 Degree and Diameter of Hypercube and Tree Architectures
47 Rules for Making Hypercube Network Topologies
48 Mapping of Hypercubes into Three Dimensional Meshes
49 Mapping of Hypercubes into One Dimensional Systems
50 The One dimensional Mapping can be thought of as for one dimensional problem solving or one dimensional layout of chips forming hypercube
51 Hypercube Versus Mesh Topologies
52 Point to Point Networks (Store and Forward) -- II
53 Latency and Bandwidth of a Network
54 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing
55 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).
56 Switches versus Processor Networks
57 Circuit Switched Networks
58 Let's Return to General Parallel Architectures in more detail
59 Overview of Computer Architecture Issues
60 Some Global Computer Architecture Issues
61 Two General Real World Architectural Issues
62 MIMD Distributed Memory Architecture
63 Some MIMD Architecture Issues
64 SIMD (Single Instruction Multiple Data) Architecture
65 SIMD Architecture Issues
66 Shared Memory Architecture
67 Shared versus Distributed Memory
68 The General Structure of a full sized CRAY C-90
69 The General Structure of a NEC SX-3
Classic Vector Supercomputer

70 Comparison of MIMD and SIMD Parallelism seen on Classic Vector Supercomputers
71 What will happen in the year 2015 with .05 micron feature size and Petaflop Supercomputers using CMOS
72 CMOS Technology and Parallel Processor Chip Projections
73 Processor Chip Requirements for a Petaflop Machine Using 0.05 Micron Technology
74 Three Designs for a Year 2015 Petaflops machine with 0.05 micron technology
75 The Global Shared Memory Category I Petaflop Architecture
76 Category II Petaflop Architecture -- Network of microprocessors
77 Category III Petaflop Design -- Processor in Memory (PIM)
78 Necessary Latency to Support Three Categories
79 Chip Density Projections to year 2013
80 DRAM Chip count for Construction of Petaflop computer in year 2013 using 64 Gbit memory parts
81 Memory Chip Bandwidth in Gigabytes/sec
82 Power and I/O Bandwidth (I/O Connections) per Chip throught the year 2013
83 Clock Speed and I/O Speed in megabytes/sec per pin through year 2013

Outside Index Summary of Material



HTML version of Scripted Foils prepared 18 Sept 1995

Foil 1 CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1995 --
Lecture Stream 2

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Geoffrey Fox
NPAC
Room 3-131 CST
111 College Place
Syracuse NY 13244-4100

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 2 Abstract of Lecture Stream 2 of CPS615

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
This starts with a discussion of Parallel Computing using analogies from nature
It uses foils and material from CSEP chapter on Computer Architecture to discuss how and why to build a parallel computer including synchronization memory structure and network issues
SIMD and MIMD Architectures with a brief comparison of workstation networks with closely coupled systems
A look to the future is based on results from Petaflops workshop

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 3 Elementary Discussion of
Parallel Computing

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index Secs 30

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 4 Single nCUBE2 CPU Chip

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index Secs 18

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 5 64 Node nCUBE Board

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index Secs 34
Each node is CPU and 6 memory chips -- CPU Chip integrates communication channels with floating, integer and logical CPU functions

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 6 CM-5 in NPAC Machine Room

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index Secs 47
32 node CM-5 and in foreground old CM-2 diskvault

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 7 Basic METHODOLOGY of Parallel Computing

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Simple, but general and extensible to many more nodes is domain decomposition
All successful concurrent machines with
  • Many nodes
  • High performance (this excludes Dataflow)
Have obtained parallelism from "Data Parallelism" or "Domain Decomposition"
Problem is an algorithm applied to data set
  • and obtains concurrency by acting on data concurrently.
The three architectures considered here differ as follows:
MIMD Distributed Memory
  • Processing and Data Distributed
MIMD Shared Memory
  • Processing Distributed but data shared
SIMD Distributed Memory
  • Synchronous Processing on Distributed Data

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 8 Concurrent Computation as a Mapping Problem -I

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
2 Different types of Mappings in Physical Spaces
Both are static
  • a) Seismic Migration with domain decomposition on 4 nodes
  • b)Universe simulation with irregular data but static 16 node decomposition
  • but this problem would be best with dynamic irregular decomposition

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 9 Concurrent Computation as a Mapping Problem - II

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Different types of Mappings -- A very dynamic case without any underlying Physical Space
c)Computer Chess with dynamic game tree decomposed onto 4 nodes

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 10 Concurrent Computation as a Mapping Problem - III

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 11 Finite Element Mesh From Nastran
(mesh only shown in upper half)

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 12 A Simple Equal Area Decomposition

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
And the corresponding poor workload balance

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 13 Decomposition After Annealing
(one particularly good but nonoptimal decomposition)

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
And excellent workload balance

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 14 Parallel Processing and Society

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
The fundamental principles behind the use of concurrent computers are identical to those used in society - in fact they are partly why society exists.
If a problem is too large for one person, one does not hire a SUPERman, but rather puts together a team of ordinary people...
cf. Construction of Hadrians Wall

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 15 Concurrent Construction of a Wall
Using N = 8 Bricklayers
Decomposition by Vertical Sections

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Domain Decomposition is Key to Parallelism
Need "Large" Subdomains l >> l overlap

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 16 Quantitative Speed-Up Analysis for Construction of Hadrian's Wall

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 17 Amdahl's law for Real World Parallel Processing

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
AMDAHL"s LAW or
Too many cooks spoil the broth
Says that
Speedup S is small if efficiency e small
or for Hadrian's wall
equivalently S is small if length l small
But this is irrelevant as we do not need parallel processing unless problem big!

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 18 Pipelining --Another Parallel Processing Strategy for Hadrian's Wall

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
"Pipelining" or decomposition by horizontal section is:
  • In general less effective
  • and leads to less parallelism
  • (N = Number of bricklayers must be < number of layers of bricks)

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 19 Hadrian's Wall Illustrates that the Topology of Processor Must Include Topology of Problem

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Hadrian's Wall is one dimensional
Humans represent a flexible processor node that can be arranged in different ways for different problems
The lesson for computing is:
Original MIMD machines used a hypercube topology. The hypercube includes several topologies including all meshes. It is a flexible concurrent computer that can tackle a broad range of problems. Current machines use different interconnect structure from hypercube but preserve this capability.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 20 General Speed Up Analysis

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Comparing Computer and Hadrian's Wall Cases

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 21 Comparison of The Complete Problem to the subproblems formed in domain decomposition

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
The case of Programming a Hypercube
Each node runs software that is similar to sequential code
e.g., FORTRAN with geometry and boundary value sections changed

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 22 Hadrian's Wall Illustrating an
Irregular but Homogeneous Problem

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Geometry irregular but each brick takes about the same amount of time to lay.
Decomposition of wall for an irregular geometry involves equalizing number of bricks per mason, not length of wall per mason.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 23 Some Problems are Inhomogeneous Illustrated by:
An Inhomogeneous Hadrian Wall with Decoration

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Fundamental entities (bricks, gargoyles) are of different complexity
Best decomposition dynamic
Inhomogeneous problems run on concurrent computers but require dynamic assignment of work to nodes and strategies to optimize this
(we use neural networks, simulated annealing, spectral bisection etc.)

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 24 Global and Local Parallelism Illustrated by Hadrian's Wall

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Global Parallelism
  • Break up domain
  • Amount of Parallelism proportional to size of problem (and is usually large)
  • Unit is Bricklayer or Computer node
Local Parallelism
  • Do in parallel local operations in the processing of basic entities
    • e.g. for Hadrian's problem, use two hands, one for brick and one for mortar while ...
    • for computer case, do addition at same time as multiplication
  • Local Parallelism is limited but useful
Local and Global Parallelism
Should both be Exploited

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 25 Parallel I/O Illustrated by
Concurrent Brick Delivery for Hadrian's Wall
Bandwidth of Trucks and Roads
Matches that of Masons

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Disk (input/output) Technology is better matched to several modest power processors than to a single sequential supercomputer
Concurrent Computers natural in databases, transaction analysis

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 26 Nature's Concurrent Computers

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
At the finest resolution, collection of neurons sending and receiving messages by axons and dendrites
At a coarser resolution
Society is a collection of brains sending and receiving messages by sight and sound
Ant Hill is a collection of ants (smaller brains) sending and receiving messages by chemical signals
Lesson: All Nature's Computers Use Message Passing
With several different Architectures

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 27 Comparison of Concurrent Processing in Society and Computing

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Problems are large - use domain decomposition Overheads are edge effects
Topology of processor matches that of domain - processor with rich flexible node/topology matches most domains
Regular homogeneous problems easiest but
irregular or
Inhomogeneous
Can use local and global parallelism
Can handle concurrent calculation and I/O
Nature always uses message passing as in parallel computers (at lowest level)

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 28 Computational Science CPS615
Simulation Track Overview
Foilsets B 1995

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index Secs 30
Geoffrey Fox
Syracuse University
NPAC 3-131 CST, 111 College Place
Syracuse NY 13244

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 29 Abstract of CPS615 Foilsets B 1995

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index Secs 59
Parallel Computer and Network Architecture
Overview of Issues including synchronization, granularity and 3 classes of architectures
More details on networks
More details on system architectures

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 30 Overview of
Parallel Hardware Architecture

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 31 3 Major Basic Hardware Architectures

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
MIMD Distributed Memory
MIMD with logically shared memory but perhaps physically distributed. The latter is sometimes called distributed shared memory.
SIMD with logically distributed or shared memory
One interesting hardware architecture .........
  • Associative memory - SIMD or MIMD or content addressable memories ..... whose importance deserves further study and integration with mainstream HPCCI
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of the above.
  • Metacomputers can vary full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 32 Examples of the Three Current Concurrent Supercomputer Architectures

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
This list stems from 1990

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 33 Parallel Computer Architecture Issues

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Two critical Issues are:
Memory Structure
  • Distributed
  • Shared
  • Cached
and Heterogeneous mixtures
Control and synchronization
SIMD -lockstep synchronization
MIMD -synchronization can take several forms
  • Simplest: program controlled message passing
  • "Flags" in memory - typical shared memory construct
  • Special hardware - as in cache and its coherency (coordination between nodes)

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 34 General Types of Synchronization

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
An implied constraint on the ordering of operations.
  • In distributed memory, messages often used to synchronize.
  • In shared memory, software semaphores, hardware locks, or other schemes needed. [Denning, P. 1989]
Synchronous: Objects (processors, network, tasks) proceed together in lockstep fashion
Asynchronous: Objects may proceed at somewhat different speeds, sometimes with occasional synchronization points (barriers) to allow slower objects to catch up.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 35 Granularity of Parallel Components

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors. Pieces, processors may be heterogeneous. Computation/ communication ratio very high -- Typical of Networked Metacomputing
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors. Processors typically run the same code.(SPMD Style) Computation/communication ration often hundreds or more. Typical of MIMD Parallel Systems such as SP2 CM5 Paragon T3D
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines. Processors typically have instructions broadcasted to them.Computation/ Communication ratio often near unity. Typical of SIMD but seen in a few MIMD systems such as Dally's J Machine or commercial Myrianet (Seitz)
Note that a machine of one type can be used on algorithms of the same or finer granularity

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 36 Types of Parallel Memory Architectures
-- Logical Structure

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Shared (Global): There is a global memory space, accessible by all processors. Processors may also have some local memory. Algorithms may use global data structures efficiently. However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times)
Distributed (Local, Message-Passing): All memory is associated with processors. To retrieve information from another processors' memory a message must be sent there. Algorithms should use distributed data structures.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 37 Types of Parallel Memory Architectures -- Physical Characteristics

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Uniform: All processors take the same time to reach all memory locations.
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
  • DASH (Hennessey at Stanford) is best known example of such a virtual shared memory machine which is logicall shared but physically distributed.
  • ALEWIFE from MIT is a similar project
  • TERA (from Burton Smith) is Uniform memory access logically shared memory machine
Most NUMA machines these days have two memory access times
  • Local memory (divided in registers caches etc) and
  • Nonlocal memory with little or no difference in access time for different nonlocal memories

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 38 Diagrams of Shared and Distributed Memories

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 39 Survey of Issues in Communication Networks

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 40 Glossary of Useful Concepts in Communication Systems

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Useful terms include:
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout)
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault?
Blocking: Some communication requests may not get through, due to conflicts caused by other requests.
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor.
Latency (delay): Maximal time for nonblocked request to be transmitted.
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 41 Switch and Bus based Architectures

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Switch
Bus

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 42 Classes of Communication Network include ...

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Classes of networks include:
Bus: All processors (and memory) connected to a common bus or busses.
  • Memory access fairly uniform, but not very scalable due to contention
  • Bus machines can be NUMA if memory consists of directly accessed local memory as well as memory banks accessed by BUS. The BUS accessed memories can be local memories on other processors
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
  • Switches might have queues, "combining logic", which improve functionality but increase latency.
  • Switch settings may be determined by message headers or preset by controller.
  • Connections can be packet-switched (messages no longer than some fixed size) or circuit-switched (connection remains as long as needed)
  • Usually NUMA, blocking, often scalable and upgradable

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 43 Point to Point Networks (Store and Forward) -- I

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Processors directly connected to only certain other processors, and must go multiple hops to get to additional processors. Also called store-and-forward.
Usually distributed memory
Examples:
  • ring
  • mesh, torus
  • hypercube -- as in original Caltech/JPL machines
  • binary tree
Usually NUMA, nonblocking, scalable, upgradable
Processor connectivity modeled by a graph in which nodes represent connections between processors. Much theoretical work here but now obsolete (irrelevant) as even if this hardware used, pipelining masks transmission time which ends up not showing term proportional to distance travelled

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 44 Examples of Interconnection Topologies

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube.
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 45 Degree and Diameter of Ring and Mesh(Torus) Architectures

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Mesh seems to have poor diameter but Dally and Seitz emphasize can make up as easy to build physically (circuits are two dimensional). Thus can have fewer links but as physically short can be "thick" (very high bandwidth)

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 46 Degree and Diameter of Hypercube and Tree Architectures

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Hypercube compromises so both degree and diameter grow modestly. High degree implies hard to build -- low diameter improves communication
Tree has bottleneck at root(top). In fat tree address this as in CM5 by increasing bandwidth on links as one goes up the tree

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 47 Rules for Making Hypercube Network Topologies

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
A Hypercube is 2d computers arranged on the corners of a cube in d dimensions and connected to d nearest neighbors
Label a hypercube e1 e2 e3 e4.......ed
where each ek takes values zero or 1 depending if vertex at "top" or "bottom" of this axis.
Think of hypercube as unit cube in d dimensions with origin at bottom left hand corner

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 48 Mapping of Hypercubes into Three Dimensional Meshes

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
In string e1 e2 e3 e4.......ed, one can write d=d1+d2+d3
Any set of d1 binary indices can be mapped into a one dimensional line with periodic (wrap-around) connections
the remaining d2+d3 indices give other two dimensions
This decomposes hypercube into a mesh with:
    • 2d1 x 2d2 x 2d3 nodes
So Hypercube includes lots of meshes
e.g. d=6 hypercube has 4 by 4 by 4, 8 by 8, 4 by 64, 64 by 1 etc.
So can study one dimension without loss of generality!

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 49 Mapping of Hypercubes into One Dimensional Systems

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
d= 1 -- 2 nodes
d= 2 -- 4 nodes
d= 3 -- 8 nodes

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 50 The One dimensional Mapping can be thought of as for one dimensional problem solving or one dimensional layout of chips forming hypercube

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
So in Hadrian's wall we saw a one dimensional layout of processors (people) to solve a one dimensional problem
However computers need to be laid out in two or three dimensional worlds and we can use same ideas.
We consider one dimension as discussion in two or three similar as each axis(dimension) corresponds to a distinct subset of binary indices

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 51 Hypercube Versus Mesh Topologies

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Hypercube:
Mesh with equivalent wire length
So can let long hypercube wires have "local" stops and hypercube becomes a "thick mesh"
With good routing technologies, a thick mesh is better than a hypercube?

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 52 Point to Point Networks (Store and Forward) -- II

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Hardware may handle only single hop, or multiple hops as in routing chips on Paragon.
Software may mask hardware limitations so one sees full connectivity even if physically limited. Note that multiple hops always leads to poorer latency as this is travel time per bit. However we can keep bandwidth high even with multiple hops by increasing "size" of channels e.g. transmitting several bits simultaneously. Software can hide
  • Latency by pipelining -- doing other thing while bits in transit. This is circuit switching if done at low level.
  • Partial connectivity by supplying software layer that handles routing -- this is familiar on Internet
Latency related to graph diameter, among many other factors
Graph may also represent calculations that need to be performed, and the information exchange required.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 53 Latency and Bandwidth of a Network

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Transmission Time for message of n bytes:
T0 + T1 n where
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize
T0 = TS + Td . Number of hops
T1 is the inverse bandwidth -- it can be made small if pipe is large size.
In practice TS and T1 are most important and Td is unimportant

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 54 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 55 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Square blocks indicate shared memory copy performance
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 56 Switches versus Processor Networks

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Note that the processor networks such as torus or hypercube can be used to build switches when one does not put processors on internal nodes but rather "just" switch chips.
Thus switches are discussed with same trade-offs -- store and forward, circuit-switched as processor networks.
Switch based machines have typically the same delay (travel time) between any two processors
In Processor networks, some machines can be nearer to each other if fewer hops
BUT in all modern machines, low level hardware essentially makes all these architectures the SAME. There are only two times of importance corresponding to DATA LOCALITY or not
  • Time to access memory on processor
    • This further divides into time to get to main DRAM and time to get to cache
  • Time to access memory off processor
  • Here time covers both latency and bandwidth.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 57 Circuit Switched Networks

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Rather than have messages travel single hop at a time, sometimes circuit is first established from sender to receiver, and then data transmitted along circuit.
Similar to establishing phone connection
Can result in significantly lower communication overhead as latency deterministic once circuit established
If circuit blocked, many options possible, including
  • Retry, perhaps with alternative circuit or random delay
  • Wormhole routing: message travel like a wagon-train along path, stops when blocked but stays in circuit.
  • Virtual cut-through: message dumps into memory where forward progress blocked, then retries. Blend of pure circuit-switching and store-and-forward.
At high levels of message traffic, performance sometimes severely degraded.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 58 Let's Return to General Parallel Architectures in more detail

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 59 Overview of Computer Architecture Issues

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Many issues both between architectures and internal to each architecture
Situation confused as need to understand
  • "theoretical" arguments e.g. fat-tree v. torus
  • technology trade-offs now e.g. optical v. copper links
  • technology as a function of time
is a particular implementation done right
  • in basic hardware
  • as a system with software
All that perhaps is important is what either user or high level software (compiler/message passing system) sees.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 60 Some Global Computer Architecture Issues

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
A significant problem is that there is essentially no application software and essentially no machines sold
Real World Architectural issue - what problems will produce substantial sales - should optimize for these?
  • So maybe simulation machines will "ride on" commercial database and multimedia server (Web server) markets
  • Historically this created problems as in IBM3090 which was business machine adapted for science and very poor at science
  • Currently SP2 good for science and doing well in Business applications
System integration issues
  • Does software allow hardware to realize potential ?
  • Hardware, software reliability
  • (Parallel) I/O for graphics and file access
  • I/O interfaces - HIPPI, FCS, ATM ... ATM most important in real world but HIPPI scientific standard?

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 61 Two General Real World Architectural Issues

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Why is MPP always behind the best workstations?
  • It takes n extra years to build a new architecture
  • Therefore MPP uses sequential technologies which are n years behind
  • n=2 implies ~ factor of 3 in cost/performance
Must reduce delay n by matching parallel design as much as possible to
  • Commercial processor chips (use workstation not custom nodes)
  • Commercial communication -- (use ATM not custom networks?)
  • Commercial Software -- (Use Web not custom MPP software)
Memory per node (SIMD and MIMD)
  • Some say that current machines waste transistors - activity only occurs in 1% of transistors at a time?
  • Others say need large memory to run UNIX, increase grain size as problems complicated and further communication needs decrease as grain size increases

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 62 MIMD Distributed Memory Architecture

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
A special case of this is a network of workstations (NOW's) or personal computers
Issues include:
  • Node - CPU, Memory
  • Network
  • Bandwidth
  • Latency
  • Hardware
  • actual (software)

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 63 Some MIMD Architecture Issues

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Choice of Node is a dominant issue?
  • RISC nodes SGI Challenge, IBM SP-2, Convex, Cray T3D with MIPS, IBM HP and Digital workstation chips respectively
  • Special purpose nodes -- CM-5, Paragon, Meiko CS-2 -- out of favor as non competitive
  • "old" nodes -- nCUBE-2
  • small nodes Caltech Mosaic which is basis of Myrianet J Machine (Dally at MIT) Execube (Kogge - Loral ex IBM)
Network Topology as described is not very important today?
  • Theoretical issues obscured by technology and implementation
  • Details of network can be and are hidden from user and compiler by simple elegant (message passing) software including collective communication primitives (broadcast, reduction, etc.)
However we still have two major types of network:
  • Distributed Shared Memory -- physically distributed memory but hardware support for shared memory as in SGI and Convex machines
  • Pure Distributed Memory as in IBM SP-2 or network of workstations

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 64 SIMD (Single Instruction Multiple Data) Architecture

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, two-dimensional mesh and general switch
Execube - 16 bit processors with 8 integrated into IBM 4 mbit memory chip, SIMD or MIMD or both,
512 processors on IBM RS6000 with three dimensional mesh

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 65 SIMD Architecture Issues

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Node 1bit, 4 bit, 16 bit, 32 bit?
Interconnect issues similar to MIMD
Critical point -- is SIMD more cost effective than MIMD on a significant number of problems
  • What types of problems run well on SIMD?
  • Take a problem - e.g. SOR for PDE solution - that runs well on SIMD - is the MP-2 more cost effective than CM-5, Paragon, SP-1?
  • Need to compare SIMD and MIMD machines at "same technology" stage
Current Consensus is opposite of 1990 -- namely now MIMD dominates.
  • SIMD AMT DAP and Maspar aimed at Market niches
  • Defense signal processing
  • Business index sorting

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 66 Shared Memory Architecture

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Some machines have a "true" shared memory. i.e. every location is "equally accessible" by each node. Original Ultracomputer, Burton Smith's Tera, Some Bus based Systems
Others, such as BBN Butterfly, Kendall Square KSR-2, have non-uniform access time but all memory locations accessible

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 67 Shared versus Distributed Memory

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Expected Architectures of Future will be:
  • Physically distributed but hardware support of shared memory for tightly coupled MPP's such as future IBM SP-X, Convex Exemplar, SGI (combined with Cray)
  • Physically distributed but without hardware support -- NOW's and COW's -- The World Wide Web as a Metacomputer
Essentially all problems run efficiently on a distributed memory BUT
Software is easier to develop on a shared memory machine
Some Shared Memory Issues:
  • Cost - Performance : additional hardware (functionality, network bandwidth) to support shared memory
  • Scaling. Can you build very big shared memory machines?
    • Yes for NUMA distributed shared memory
  • Compiler challenges for distributed shared memory are difficult and major focus of academic and commercial work
  • This is not practically important now as 32 node KSR-2 (from past) or SGI Power Challenge (cost ~< $2m) is already at high end of important commercial market

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 68 The General Structure of a full sized CRAY C-90

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Each C.P.U. - Two Pipes;
Each pipe one add and one
multiply per clock period
Cycle Time 4 nanoseconds

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 69 The General Structure of a NEC SX-3
Classic Vector Supercomputer

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
each C.P.U. has upto 4 functional units
Cycle Time 2.9 nanoseconds

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 70 Comparison of MIMD and SIMD Parallelism seen on Classic Vector Supercomputers

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
1 continue
Note MIMD parallelism "larger" than SIMD as MIMD size reflects number of grid points,particles etc.

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 71 What will happen in the year 2015 with .05 micron feature size and Petaflop Supercomputers using CMOS

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Only Superconducting Technology can possibly do better
Need Optical Interconnects probably

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 72 CMOS Technology and Parallel Processor Chip Projections

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
See Chapter 5 of Petaflops Report -- July 94

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 73 Processor Chip Requirements for a Petaflop Machine Using 0.05 Micron Technology

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
See Chapter 5 of Petaflops Report -- July 94

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 74 Three Designs for a Year 2015 Petaflops machine with 0.05 micron technology

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
See Chapter 6 of Petaflops Report -- July 94

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 75 The Global Shared Memory Category I Petaflop Architecture

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
See Chapter 6 of Petaflops Report -- July 94
This shared memory design is the natural evolution of systems such as the Cray-2,3 or Cray C-90

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 76 Category II Petaflop Architecture -- Network of microprocessors

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
See Chapter 6 of Petaflops Report -- July 94
This architecture generalizes cutrrent IBM SP-2 type system and requires unlike Category I, data locality for the upto 40,000 CPU's to be able function efficiently with minimum communication overhead

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 77 Category III Petaflop Design -- Processor in Memory (PIM)

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
See Chapter 6 of Petaflops Report -- July 94
This design is an extrapolation of systems such as the J machine(Dally), Execube (Loral) or Mosaic(Seitz). It features CPU and memory integrated on the chip (PIM).
Unlike such systems today, in the year 2015 such PIM designs have substantial memory per processor

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 78 Necessary Latency to Support Three Categories

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
This is formal latency to ensure that one can reach design goals of previous table. I picosecond (Category I) cannot be attained. You solve this by having N concurrent streams -- Then each of them needs latency of N picoseconds
e.g. If output rate of an arithmetic unit is 1 per nanosecond, then each of 400 teraflop CPU's in Category I design must have 1000 arithmetic units running at full speed to each full system performance of 0.4 petaflops
See Chapter 6 of Petaflops Report -- July 94

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 79 Chip Density Projections to year 2013

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 80 DRAM Chip count for Construction of Petaflop computer in year 2013 using 64 Gbit memory parts

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 81 Memory Chip Bandwidth in Gigabytes/sec

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 82 Power and I/O Bandwidth (I/O Connections) per Chip throught the year 2013

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HTML version of Scripted Foils prepared 18 Sept 1995

Foil 83 Clock Speed and I/O Speed in megabytes/sec per pin through year 2013

From Second set of lectures on CPS615 Parallel Computing Overview CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *
Full HTML Index
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Fri Aug 15 1997