Given by Geoffrey C. Fox at Basic Simulation Track for Computational Science CPS615 on Fall Semester 96. Foils prepared 10 Sept 1996
Outside Index
Summary of Material
Secs 80
Overview of Course Itself! -- and then introductory material on basic curricula |
Overview of National Program -- The Grand Challenges |
Overview of Technology Trends leading to petaflop performance in year 2007 (hopefully) |
Overview of Syracuse and National programs in computational science |
Parallel Computing in Society |
Why Parallel Computing works |
Simple Overview of Computer Architectures
|
General Discussion of Message Passing and Data Parallel Programming Paradigms and a comparison of languages |
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Overview of Course Itself! -- and then introductory material on basic curricula |
Overview of National Program -- The Grand Challenges |
Overview of Technology Trends leading to petaflop performance in year 2007 (hopefully) |
Overview of Syracuse and National programs in computational science |
Parallel Computing in Society |
Why Parallel Computing works |
Simple Overview of Computer Architectures
|
General Discussion of Message Passing and Data Parallel Programming Paradigms and a comparison of languages |
Instructor: Geoffrey Fox gcf@npac.syr.edu 3154432163 Room 3-131 CST |
Backup: Nancy McCracken njm@npac.syr.edu 3154434687 Room 3-234 CST |
NPAC Administrative support: Nora Downey-Easter nora@npac.syr.edu 3154431722 Room 3-206 CST |
CPS615 Powers that be above can be reached at cps615ad@npac.syr.edu |
CPS615 Students can be reached by mailing cps615@npac.syr.edu |
Homepage will be: |
http://www.npac.syr.edu/projects/cps615fall96 |
See my paper SCCS 736 as an overview of HPCC status |
Graded on the basis of Approximately 8 Homeworks which will be due Thursday of week following day given out (Tuesday or Thursday) |
Plus one modest sized project at the end of class -- must involve "real" running parallel code! |
No finals or written exams |
All material will be placed on World Wide Web(WWW) |
Preference given to work returned on the Web |
Overview of National Scene -- Why is High Performance Computing Important
|
What is Computational Science -- The Program at Syracuse |
Basic Technology Situation -- Increasing density of transistors on a chip
|
Elementary Discussion of Parallel Computing including use in society
|
Computer Architecture -- Parallel and Sequential
|
Simple base example -- Laplace's Equation
|
This is followed by two sections -- software technologies and applications which are interspersed with each other and "algorithm" modules |
Programming Models -- Message Passing and Data Parallel Computing -- MPI and HPF (Fortran 90)
|
Some real applications analysed in detail
|
This introduction is followed by a set of "vignettes" discussing problem classes which illustrate parallel programming and parallel algorithms |
Ordinary Differential Equations
|
Numerical Integration including adaptive methods |
Floating Point Arithmetic |
Monte Carlo Methods including Random Numbers |
Full Matrix Algebra as in
|
Partial Differential Equations implemented as sparse matrix problems (as in Computational Fluid Dynamics)
|
Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994 |
L=Logic, D=DRAM, A=ASIC, mP = microprocessor |
Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994 |
Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994 |
Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994 |
We can choose technology and architecture separately in designing our high performance system |
Technology is like choosing ants people or tanks as basic units in our society analogy
|
In HPCC arena, we can distinguish current technologies
|
Near term technology choices include
|
Further term technology choices include
|
It will cost $40 Billion for next industry investment in CMOS plants and this huge investment makes it hard for new technologies to "break in" |
Architecture is equivalent to organization or design in society analogy
|
We can distinguish formal and informal parallel computers |
Informal parallel computers are typically "metacomputers"
|
Metacomputers are a very important trend which uses similar software and algorithms to conventional "MPP's" but have typically less optimized parameters
|
Formal high performance computers are the classic (basic) object of study and are |
"closely coupled" specially designed collections of compute nodes which have (in principle) been carefully optimized and balanced in the areas of
|
In society, we see a rich set of technologies and architectures
|
With several different communication mechanisms with different trade-offs
|
Quantum-Mechanical Computers by Seth Lloyd, Scientific American, Oct 95 |
Chapter 6 of The Feynman Lectures on Computation edited by Tony Hey and Robin Allen, Addison-Wesley, 1996 |
Quantum Computing: Dream or Nightmare? Haroche and Raimond, Physics Today, August 96 page 51 |
Basically any physical system can "compute" as one "just" needs a system that gives answers that depend on inputs and all physical systems have this property |
Thus one can build "superconducting" "DNA" or "Quantum" computers exploiting respectively superconducting molecular or quantum mechanical rules |
For a "new technology" computer to be useful, one needs to be able to
|
Conventional computers are built around bit ( taking values 0 or 1) manipulation |
One can build arbitarily complex arithmetic if have some way of implementing NOT and AND |
Quantum Systems naturally represent bits
|
Interactions between quantum systems can cause "spin-flips" or state transitions and so implement arithmetic |
Incident photons can "read" state of system and so give I/O capabilities |
Quantum "bits" called qubits have another property as one has not only
|
Lloyd describes how such coherent states provide new types of computing capabilities
|
Superconductors produce wonderful "wires" which transmit picosecond (10^-12 seconds) pulses at near speed of light
|
Niobium used in constructing such superconducting circuits can be processed by similar fabrication techniques to CMOS |
Josephson Junctions allow picosecond performance switches |
BUT IBM (!969-1983) and Japan (MITI 1981-90) terminated major efforts in this area |
New ideas have resurrected this concept using RSFQ -- Rapid Single Flux Quantum -- approach |
This naturally gives a bit which is 0 or 1 (or in fact n units!) |
This gives interesting circuits of similar structure to CMOS systems but with a clock speed of order 100-300GHz -- factor of 100 better than CMOS which will asymptote at around 1 GHz (= one nanosecond cycle time) |
At least two major problems: |
Semiconductor industry will invest some some $40B in CMOS "plants" and infrastructure
|
Cannot build memory to match CPU speed and current designs have superconducting CPU's (with perhaps 256 Kbytes superconducting memory per processor) but conventional CMOS memory
|
Superconducting technology also has a bad "name" due to IBM termination! |
Sequential or von Neuman Architecture |
Vector (Super)computers |
Parallel Computers
|
Instructions and data are stored in the same memory for which there is a single link (the von Neumann Bottleneck) to the CPU which decodes and executues instructions |
The CPU can have multiple functional units |
The memory access can be enhanced by use of caches made from faster memory to allow greater bandwidth and lower latency |
Fig 1.14 of Aspects of Computational Science |
Editor Aad van der Steen |
published by NCF |
This design enhances performance by noting that many applications calculate "vector-like" operations
|
This allows one to address two performance problems
|
They are typified by Cray 1, XMP, YMP, C-90, CDC-205, ETA-10 and Japaneses Supercomputers from NEC Fujitsu and Hitachi |
A pipeline for vector addition looks like:
|
Vector machines pipeline data through the CPU |
They are not so popular/relevant as in the past as
|
In fact excellence of say, Cray C-90 is due to its very good memory architecture allowing one to get enough operands to sustain pipeline. |
Most workstation class machines have "good" CPU's but can never get enough data from memory to sustain good performance except for a few cache intensive applications |
Very high speed computing systems,Proc of IEEE 54,12,p1901-1909(1966) and |
Some Computer Organizations and their Effectiveness, IEEE Trans. on Comp. C-21,948-960(1972) -- both papers by M.J. Flynn |
SISD -- Single Instruction stream, Single Data Stream -- i.e. von Neumann Architecture |
MISD -- Multiple Instruction stream, Single Data Stream -- Not interesting |
SIMD -- Single Instruction stream, Multiple Data Stream |
MIMD -- Multiple Instruction stream and Multiple Data Stream -- dominant parallel system with ~one to ~one match of instruction and data streams. |
Memory Structure of Parallel Machines
|
and Heterogeneous mixtures |
Shared (Global): There is a global memory space, accessible by all processors.
|
Distributed (Local, Message-Passing): All memory is associated with processors.
|
Memory can be accessed directly (analogous to a phone call) as in red lines below or indirectly by message passing (green line below) |
We show two processors in a MIMD machine for distributed (left) or shared(right) memory architectures |
Uniform: All processors take the same time to reach all memory locations. |
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
|
Most NUMA machines these days have two memory access times
|
This simple two level memory access model gets more complicated in proposed 10 year out "petaflop" designs |
SIMD -lockstep synchronization
|
MIMD - Each Processor executes independent instruction streams |
MIMD Synchronization can take several forms
|
MIMD Distributed Memory
|
MIMD with logically shared memory but usually physically distributed. The latter is sometimes called distributed shared memory.
|
A special case of this is a network of workstations (NOW's) or personal computers (metacomputer) |
Issues include:
|
SIMD -- Single Instruction Multiple Data -- can have logically distributed or shared memory
|
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network |
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, fast two-dimensional mesh and slower general switch for communication |
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of MIMD or SIMD, Sequential or Parallel machines. |
Metacomputers can vary from full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D |
This is a critical future architecture which is intrinsically distributed memory as multi-vendor heterogenity implies that one cannot have special hardware enhanced shared memory
|
Cluster of workstations or PC's |
Heterogeneous MetaComputer System |
One example is an Associative memory - SIMD or MIMD or content addressable memories |
This is an an example of a special purpose "signal" processing machine which can in fact be built from "conventional" SIMD or "MIMD" architectures |
This type of machine is not so popular as most applications are not dominated by computations for which good special purpose devices can be designed |
If only 10% of a problem is say "track-finding" or some special purpose processing, then who cares if you reduce that 10% by a factor of 100
|
N body problems (e.g. Newton's laws for one million stars in a globular cluster) can have succesful special purpose devices |
See GRAPE (GRAvity PipE) machine (Sugimoto et al. Nature 345 page 90,1990)
|
Note GRAPE uses EXACTLY same parallel algorithm that one finds in the books (e.g. Solving Problems on Concurrent Processors) for N-body problems on classic distributed memory MIMD machines |
GRAPE will execute the classic O(N^2) (parallel) N body algorithm BUT this is not the algorithm used in most such computations |
Rather there is the O(N) or O(N)logN so called "fast-multipole" algorithm which uses hierarchical approach
|
So special purpose devices cannot usually take advantage of new nifty algorithms! |
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors.
|
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors.
|
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines.
|
Note that a machine of given granularity can be used on algorithms of the same or finer granularity |
The last major architectural feature of a parallel machine is the network or design of hardware/software connecting processors and memories together. |
Bus: All processors (and memory) connected to a common bus or busses.
|
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
|
Switch |
Bus |
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube. |
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial |
Useful terms include: |
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout) |
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault? |
Blocking: Some communication requests may not get through, due to conflicts caused by other requests. |
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor. |
Latency (delay): Maximal time for nonblocked request to be transmitted. |
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task. |
Worm Hole Routing -- Intermediate switch nodes do not wait for full message but allow it to pass throuch in small packets |
From Aspects of Computational Science, Editor Aad van der Steen, published by NCF |
System Communication Speed Computation Speed
|
IBM SP2 40 267 |
Intel iPSC860 2.8 60 |
Intel Paragon 200 75 |
Kendall Square |
KSR-1 17.1 40 |
Meiko CS-2 100 200 |
Parsytec GC 20 25 |
TMC CM-5 20 128 |
Cray T3D 150 300 |
tcomm = 4 or 8 /Speed in Mbytes sec
|
tfloat = 1/Speed in Mflops per sec |
Thus tcomm / tfloat is just 4 X Computation Speed divided by Communication speed |
tcomm / tfloat is 26.7, 85, 1.5, 9.35, 8, 5, 25.6, 8 for the machines SP2, iPSC860, Paragon, KSR-1, Meiko CS2, Parsytec GC, TMC CM5, and Cray T3D respectively |
Latency makes situation worse for small messages and double for 64bit arithmetic natural on large problems! |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This starts by considering the analytic form for communication overhead and illustrates its stencil dependence in simple local cases -- stressing relevance of grain size |
The implication for scaling and generalizing from Laplace example is covered
|
We noted some useful material was missing and this was continued in next lecture (Sept 10,96) |
The lecture starts coverage of computer architecture covering base technologies with both CMOS covered in an earlier lecture contrasted to Quantum and Superconducting technology |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This starts by filling in details of communication overhead in parallel processing for case where "range" of interaction is large |
We show two old examples from Caltech illustrates correctness of analytic form |
We return to discussion of computer architectures describing
|
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This continues the computer architecture discussion with
|
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This continues the discussion of Fortran 90 with a set of overview remarks on each of the key new capabilities of this language |
We also comment on value of Fortran90/HPF in a world that will switch to Java |
We disgress to discuss a general theory of problem architectures as this explains such things as tradeoffs
|
And the types of applications each software model is designed to address |
(Note Server errors at start which confuses audio) |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This quickly completes the discussion of problem architecture but rather than continuing qualitative discussion of HPF applications in notes |
Jumped to a discussion of HPF language describing |
Basic Approach to Parallelism with "owner-computes" rule |
Types of new constructs with |
TEMPLATE ALIGN and PROCESSORS described |
The lecture started with a description of the Web based Programming Laboratory developed by Kivanc Dincer |
Loosely Synchronous, Synchronous or Asynchronous classify problems by their "control" or "synchronization" structure |
However there is an important class of problems where this does not matter as the synchronization overhead -- even if in the difficult asynchronous case -- is irrelevant |
This is when overhead small or zero. These are "embarassingly parallel problems" where each component of decomposed problem is essentially independent |
Examples are:
|
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
We go through the 2D Laplace's Equation with both HPF and MPI for Simple Jacobi Iteration |
HPF and Fortran90 are reviewed followed by MPI |
We also discuss the structure of problems as these determine why and when certain software approaches are appropriate |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This continues the discussion of HPF in the area of distribution and ALIGN statements. |
The discussion of ALIGN should be improved as audio makes dubious statements about "broadcasting" information. |
The distribution discussion includes a reasonable descriuption of block and cyclic and when you should use them. |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This covers two topics: |
Monte Carlo Integration for large scale Problems using Experimental and Theoretical high energy physics as an example |
This includes accept-reject methods, uniform weighting and parallel algorithms |
Then we complete HPF discussion with embarassingly parallel DO INDEPENDENT discussed in Monte Carlo case |
And HPF2 Changes |
This starts by finishing the simple overview of statistics |
Covering Gaussian Random Numbers, Numerical Generation of Random Numbers both sequentially and in parallel |
Then we describe the central limit theorem which underlies Monte Carlo method |
Then it returns to Numerical Integration with the first part of discussion of Monte Carlo Integration |
This finishes the last part of N body and ODE discussion fociussing on pipeline data parallel algorithm |
Note several foils were changed after presentation and so discussion is a little disconnected from foils at times |
We start Numerical Integration with a basic discussion of Newton-Cotes formulae (including Trapezoidal and Simpson's rule) |
We illustrate them pictorially |
This completes the MPI general discussion with the basic message passing, collective communication and some advanced features |
It then returns to Laplace Example foilset to show how MPI can be used here
|
This starts basic module on Partial Differential Solvers with |
Introduction to Classification of Equations |
Basic Discretization |
Derivation of Sparse Matrix Formulation |
Analogies of Iteration with Artificial Time |
Start of Explicit Matrix Formulation for Simple Cases |
Tom Haupt on Fortran 90 / HPF |
This discusses solution of systems of ordinary differential equations (ODE) in the context of N squared particle dynamics problems |
We start with motivation with brief discussion of relevant problems |
We go through basic theory of ODE's including Euler, Runge-Kutta, Predictor-Corrector and Multistep Methods |
We begin the discussion of solving N body problems using classic Connection Machine elegant but inefficient algorithm |
Note -- Some foils expanded to two after talk and second one is given without audio in script |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This started with a description of current Web set-up of CPS615 and other foilsets |
Then we started the foilset describing Physical Simulations and the various approaches -- Continuum Physics, Monte Carlo, Quantum Dynamics, and Computational Fluid Dynamics |
For CFD we do enough to discuss why viscosity and High Reynolds numbers are critical in air and similar media |
We discuss computation and communication needs of CFD compared to Laplace equation |
This covers essentially all the finite element method and its solution using the conjugate gradient method |
Only the simple 2D Laplace equation using triangular nodes is discussed |
We stress variational method as an optimization method and you use this analogy to motivate conjugate gradient as an improved steepest descent approach |
We discuss parallel computing issues for both finite element and conjugate gradient |
This lecture covers two distinct areas. |
Firstly a short discussion of LInear Programming -- what type of problems its used for, what the equations look like and basic issues in the difficult use of parallel processing |
Then we give an abbreviated discussion of Full Matrix algorithms covering
|