Full HTML for

Scripted foilset CPS615-Lecture on Performance(end) and Computer Technologies(start)

Given by Geoffrey C. Fox at Delivered Lectures of CPS615 Basic Simulation Track for Computational Science on 10 September 96. Foils prepared 27 october 1996
Outside Index Summary of Material Secs 50.4


This starts by filling in details of communication overhead in parallel processing for case where "range" of interaction is large
We show two old examples from Caltech illustrates correctness of analytic form
We return to discussion of computer architectures describing
  • Vector Supercomputers
  • General Relevance of data locality and pipelining
  • Flynn's classification (MIMD,SIMD etc.)
  • Memory Structures
  • Initial issues in MIMD and SIMD discussion

Table of Contents for full HTML of CPS615-Lecture on Performance(end) and Computer Technologies(start)

Denote Foils where Image Critical
Denote Foils where HTML is sufficient
Indicates Available audio which is lightpurpleed out if missing
1 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of September 10 - 1996

2 Abstract of Sept 10 1996 CPS615 Lecture
3 Communication to Calculation Ratio as a function of template
4 Performance for Increasing Stencil
5 Matrix Multiplication on the Hypercube
6 Efficiency of QCD Physics Simulation on JPL MarkIIIfp Hypercube
7 Architecture Classes of High Performance Computers
8 von Neuman Architecture in a Nutshell
9 Illustration of Importance of Cache
10 Vector Supercomputers in a Nutshell - I
11 Vector Supercomputing in a picture
12 Vector Supercomputers in a Nutshell - II
13 Flynn's Classification of HPC Systems
14 Parallel Computer Architecture Memory Structure
15 Comparison of Memory Access Strategies
16 Types of Parallel Memory Architectures -- Physical Characteristics
17 Diagrams of Shared and Distributed Memories
18 Parallel Computer Architecture Control Structure
19 Some Major Hardware Architectures - MIMD
20 MIMD Distributed Memory Architecture

Outside Index Summary of Material



HTML version of Scripted Foils prepared 27 october 1996

Foil 1 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of September 10 - 1996

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 24.4
Geoffrey Fox
NPAC
Room 3-131 CST
111 College Place
Syracuse NY 13244-4100

HTML version of Scripted Foils prepared 27 october 1996

Foil 2 Abstract of Sept 10 1996 CPS615 Lecture

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 50.4
This starts by filling in details of communication overhead in parallel processing for case where "range" of interaction is large
We show two old examples from Caltech illustrates correctness of analytic form
We return to discussion of computer architectures describing
  • Vector Supercomputers
  • General Relevance of data locality and pipelining
  • Flynn's classification (MIMD,SIMD etc.)
  • Memory Structures
  • Initial issues in MIMD and SIMD discussion

HTML version of Scripted Foils prepared 27 october 1996

Foil 3 Communication to Calculation Ratio as a function of template

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 93.6

HTML version of Scripted Foils prepared 27 october 1996

Foil 4 Performance for Increasing Stencil

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 277.9
The previous foil showed that increasing stencil made slight improvements!
This foil shows that larger stencils have much lower overheads (and hence better parallel performance) than simple Laplace's equation with 5 point stencil

HTML version of Scripted Foils prepared 27 october 1996

Foil 5 Matrix Multiplication on the Hypercube

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 109.4
Showing linear overhead behavior for fc

HTML version of Scripted Foils prepared 27 october 1996

Foil 6 Efficiency of QCD Physics Simulation on JPL MarkIIIfp Hypercube

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 159.8

HTML version of Scripted Foils prepared 27 october 1996

Foil 7 Architecture Classes of High Performance Computers

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 41.7
Sequential or von Neuman Architecture
Vector (Super)computers
Parallel Computers
  • with various architectures classified by Flynn's methodology (this is incomplete as only discusses control or synchronization structure )
  • SISD
  • MISD
  • MIMD
  • SIMD
  • Metacomputers

HTML version of Scripted Foils prepared 27 october 1996

Foil 8 von Neuman Architecture in a Nutshell

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 48.9
Instructions and data are stored in the same memory for which there is a single link (the von Neumann Bottleneck) to the CPU which decodes and executues instructions
The CPU can have multiple functional units
The memory access can be enhanced by use of caches made from faster memory to allow greater bandwidth and lower latency

HTML version of Scripted Foils prepared 27 october 1996

Foil 9 Illustration of Importance of Cache

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 250.5
Fig 1.14 of Aspects of Computational Science
Editor Aad van der Steen
published by NCF

HTML version of Scripted Foils prepared 27 october 1996

Foil 10 Vector Supercomputers in a Nutshell - I

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 154
This design enhances performance by noting that many applications calculate "vector-like" operations
  • Such as c(i)=a(i)+b(i) for i=1...N and N quite large
This allows one to address two performance problems
  • Latency in accessing memory (e.g. could take 10-20 clock cycles between requesting a particular memory location and delivery of result to CPU)
  • A complex operation , e.g. a floating point operation, can take a few machine cycles to complete
They are typified by Cray 1, XMP, YMP, C-90, CDC-205, ETA-10 and Japaneses Supercomputers from NEC Fujitsu and Hitachi

HTML version of Scripted Foils prepared 27 october 1996

Foil 11 Vector Supercomputing in a picture

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 120.9
A pipeline for vector addition looks like:
  • From Aspects of Computational Science -- Editor Aad van der Steen published by NCF

HTML version of Scripted Foils prepared 27 october 1996

Foil 12 Vector Supercomputers in a Nutshell - II

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 406
Vector machines pipeline data through the CPU
They are not so popular/relevant as in the past as
  • Improved C.P.U. architecture needs fewer cycles than before for each (complex) operation (e.g 4 now not ~100 as in past)
  • 8 Mhz 8087 of Cosmic Cube took 160 to 400 clock cycles to do a full floating point operation in 1983
  • Applications need more flexible pipelines which allow different operations to be executed on consequitive operands as they stream through CPU
  • Modern RISC processors (super scalar) can support such complex pipelines as they have far more logic than CPU's of the past
In fact excellence of say, Cray C-90 is due to its very good memory architecture allowing one to get enough operands to sustain pipeline.
Most workstation class machines have "good" CPU's but can never get enough data from memory to sustain good performance except for a few cache intensive applications

HTML version of Scripted Foils prepared 27 october 1996

Foil 13 Flynn's Classification of HPC Systems

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 406
Very high speed computing systems,Proc of IEEE 54,12,p1901-1909(1966) and
Some Computer Organizations and their Effectiveness, IEEE Trans. on Comp. C-21,948-960(1972) -- both papers by M.J. Flynn
SISD -- Single Instruction stream, Single Data Stream -- i.e. von Neumann Architecture
MISD -- Multiple Instruction stream, Single Data Stream -- Not interesting
SIMD -- Single Instruction stream, Multiple Data Stream
MIMD -- Multiple Instruction stream and Multiple Data Stream -- dominant parallel system with ~one to ~one match of instruction and data streams.

HTML version of Scripted Foils prepared 27 october 1996

Foil 14 Parallel Computer Architecture Memory Structure

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 205.9
Memory Structure of Parallel Machines
  • Distributed
  • Shared
  • Cached
and Heterogeneous mixtures
Shared (Global): There is a global memory space, accessible by all processors.
  • Processors may also have some local memory.
  • Algorithms may use global data structures efficiently.
  • However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times)
Distributed (Local, Message-Passing): All memory is associated with processors.
  • To retrieve information from another processors' memory a message must be sent there.
  • Algorithms should use distributed data structures.

HTML version of Scripted Foils prepared 27 october 1996

Foil 15 Comparison of Memory Access Strategies

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 335.5
Memory can be accessed directly (analogous to a phone call) as in red lines below or indirectly by message passing (green line below)
We show two processors in a MIMD machine for distributed (left) or shared(right) memory architectures

HTML version of Scripted Foils prepared 27 october 1996

Foil 16 Types of Parallel Memory Architectures -- Physical Characteristics

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 486.7
Uniform: All processors take the same time to reach all memory locations.
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
  • DASH (Hennessey at Stanford) is best known example of such a virtual shared memory machine which is logically shared but physically distributed.
  • ALEWIFE from MIT is a similar project
  • TERA (from Burton Smith) is Uniform memory access and logically shared memory machine
Most NUMA machines these days have two memory access times
  • Local memory (divided in registers caches etc) and
  • Nonlocal memory with little or no difference in access time for different nonlocal memories
This simple two level memory access model gets more complicated in proposed 10 year out "petaflop" designs

HTML version of Scripted Foils prepared 27 october 1996

Foil 17 Diagrams of Shared and Distributed Memories

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 64.8

HTML version of Scripted Foils prepared 27 october 1996

Foil 18 Parallel Computer Architecture Control Structure

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 381.6
SIMD -lockstep synchronization
  • Each processor executes same instruction stream
MIMD - Each Processor executes independent instruction streams
MIMD Synchronization can take several forms
  • Simplest: program controlled message passing
  • "Flags" (barriers,semaphores) in memory - typical shared memory construct as in locks seen in Java Threads
  • Special hardware - as in cache and its coherency (coordination between nodes)

HTML version of Scripted Foils prepared 27 october 1996

Foil 19 Some Major Hardware Architectures - MIMD

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 156.9
MIMD Distributed Memory
  • This is now best illustrated by a collection of computers on a network (i.e. a metacomputer)
MIMD with logically shared memory but usually physically distributed. The latter is sometimes called distributed shared memory.
  • In near future, ALL formal (closely coupled) MPP's will be distributed shared memory
  • Note all computers (e.g. current MIMD distributed memory IBM SP2) allow any node to get at any memory but this is done indirectly -- you send a message
  • In future "closely-coupled" machines, there will be built in hardware supporting the function that any node can directly address all memory of the system
  • This distributed shared memory architecture is currently of great interest to (a major challenge for) parallel compilers

HTML version of Scripted Foils prepared 27 october 1996

Foil 20 MIMD Distributed Memory Architecture

From CPS615-Lecture on Performance(end) and Computer Technologies(start) Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 10 September 96. *
Full HTML Index Secs 614.8
A special case of this is a network of workstations (NOW's) or personal computers (metacomputer)
Issues include:
  • Node - CPU, Memory
  • Network - Bandwidth, Memory
  • Hardware Enhanced Access to distributed Memory

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Fri Aug 15 1997