Full HTML for

Scripted foilset CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures

Given by Geoffrey C. Fox, (Some Culler, Koelbel material) at CPS615 Basic Simulation Track for Computational Science on Fall Semester 98. Foils prepared 23 August 1998
Outside Index Summary of Material


We Introduce Computational Science and Driving Forces
  • Technology Advances and Commodity Trends
  • Inevitability of Parallelism
  • Integration of Distributed and Parallel Computing
  • Comparison with Internetics
We give a simple overview of parallel architectures today with distributed, shared or distributed shared memory
We describe data, functional and pleasing parallelism
We describe principles of parallel programming using atmospheric simulation as an example
We describe the growing importance of Java
We explain pragmatic choices
  • MPI with Fortran and C today
  • Java Grande is future?

Table of Contents for full HTML of CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures

Denote Foils where Image Critical
Denote Foils where Image has important information
Denote Foils where HTML is sufficient
Denote Foils where Image is not available
Indicates Available audio which is lightpurpleed out if missing
1 Overview of Computational Science
2 Abstract of Computational Science Presentation
3 Some Notes on Lecture Technology
4 What is Computational Science ?
5 Synergy of Parallel Computing and Web Internetics as Unifying Principle
6 Basic CPS615 Contact Points
7 Course Organization
8 Material Covered in this Course
9 Structure of CPS615 - II
10 Basic Structure of Complete CPS615 Base Course on Computational Science Simulation Track -- III
11 What are Parallel and Distributed Computing?
12 Why Parallel Computing?
13 Parallel Computing Technology Rationale
14 Motivating Applications
15 Performance of High End Machines Years 1940-2000
16 Performance of High End Machines Years 1980-2000
17 Peak Supercomputer Performance
18 Some Comments on Simulation and HPCC
19 The Multicomputer: an Idealized Parallel Computer
20 Multicomputer Architecture
21 Multicomputer Cost Model
22 Sequential Memory Structure
23 Parallel Computer Memory Structure
24 Real Parallel Computers Architectures
25 Parallel Computers -- Classic Overview
26 Distributed Memory MIMD Multiprocessor
27 Distributed Memory Machines
28 Distributed Memory Machines -- Notes
29 Shared Memory MIMD Multiprocessor
30 Shared-Memory Machines
31 Shared-Memory Machines -- Notes
32 Distributed Shared Memory (DSM)
33 Distributed Shared Memory Machines
34 Workstation Clusters
35 Parallel Algorithms
36 Data Parallelism in Algorithms
37 Some Illustrative Examples of Parallel Applications!
38 Concurrent Computation as a Mapping Problem -I
39 Concurrent Computation as a Mapping Problem - II
40 Concurrent Computation as a Mapping Problem - III
41 Finite Element Mesh From Nastran
(mesh only shown in upper half)

42 A Simple Equal Area Decomposition
43 Decomposition After Annealing
(one particularly good but nonoptimal decomposition)

44 Functional Parallelism in Algorithms
45 Structure(Architecture) of Applications - I
46 Structure(Architecture) of Applications - II
47 Multi Server Model for metaproblems
48 Multi-Server Gateway Tier
49 Pleasingly Parallel Algorithms
50 Parallel Languages
51 Data-Parallel Languages
52 Message-Passing Systems
53 A Simple Parallel Programming Model
54 Properties of Programming Model
55 Some Steps in Parallel Programming
56 Partitioning
57 Communication
58 Agglomeration
59 Mapping
60 What is Parallel Architecture?
61 Why Study Parallel Architecture as a computer scientist?
62 Why Study Architecture Today?
63 Inevitability of Parallel Computing
64 Application Trends
65 TPC-C (database transaction processing)
66 Summary of Application Trends
67 Technology Trends -- CPU's
68 General Technology Trends
69 Technology: A Closer Look
70 Clock Frequency Growth Rate
71 Transistor Count Growth Rate
72 Similar Story for Storage
73 Parallel Processing and Society
74 Concurrent Construction of a Wall
Using N = 8 Bricklayers
Decomposition by Vertical Sections

75 Quantitative Speed-Up Analysis for Construction of Hadrian's Wall
76 Amdahl's law for Real World Parallel Processing
77 Pipelining --Another Parallel Processing Strategy for Hadrian's Wall
78 Hadrian's Wall Illustrates that the Topology of Processor Must Include Topology of Problem
79 General Speed Up Analysis
80 Nature's Concurrent Computers
81 Comparison of Concurrent Processing in Society and Computing
82 Comparison of The Complete Problem to the subproblems formed in domain decomposition
83 Hadrian's Wall Illustrating an
Irregular but Homogeneous Problem

84 Some Problems are Inhomogeneous Illustrated by:
An Inhomogeneous Hadrian Wall with Decoration

85 Global and Local Parallelism Illustrated by Hadrian's Wall
86 Parallel I/O Illustrated by
Concurrent Brick Delivery for Hadrian's Wall
Bandwidth of Trucks and Roads
Matches that of Masons

87 Example: Atmosphere Model
88 Atmosphere Model: Numerical Methods
89 Atmosphere Model: Partition
90 Atmosphere Model: Communication
91 Atmosphere Model: Agglomeration
92 Atmosphere Model: Mapping
93 The HPCC Dilemma and its Solution
94 What is Commodity Software
95 The Computing Pyramid
96 Implications of the Computing Pyramid
97 The 3 Roles of Java
98 Why is Java Worth Looking at?
99 What is Java Grande?
100 Java and Parallelism?
101 "Pure" Java Model For Parallelism
102 Java for Scientific Computing Resource
103 Pragmatic Computational Science August 1998

Outside Index Summary of Material



HTML version of Scripted Foils prepared 23 August 1998

Foil 1 Overview of Computational Science

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Fall Semester 98 1998
Geoffrey Fox
http://www.npac.syr.edu/projects/jsufall98
Northeast Parallel Architectures Center
Syracuse University
111 College Place
Syracuse NY
gcf@npac.syr.edu

HTML version of Scripted Foils prepared 23 August 1998

Foil 2 Abstract of Computational Science Presentation

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
We Introduce Computational Science and Driving Forces
  • Technology Advances and Commodity Trends
  • Inevitability of Parallelism
  • Integration of Distributed and Parallel Computing
  • Comparison with Internetics
We give a simple overview of parallel architectures today with distributed, shared or distributed shared memory
We describe data, functional and pleasing parallelism
We describe principles of parallel programming using atmospheric simulation as an example
We describe the growing importance of Java
We explain pragmatic choices
  • MPI with Fortran and C today
  • Java Grande is future?

HTML version of Scripted Foils prepared 23 August 1998

Foil 3 Some Notes on Lecture Technology

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
There is a conventional website http://www.npac.syr.edu/projects/jsufall98
This will hold links to all presentations given in the class and to other resources at both NPAC and elsewhere
This can be used to review material before and after classes -- in the lingo, it is an asynchronous learning resource
We teach the class from Syracuse twice a week for 70 minutes at 5.30pm Syracuse, 4.30 pm Jackson time.
This teaching uses Tango to deliver electronic material remotely
Tango supports inter alia chat rooms, whiteboards, and the sharing of web pages. These are web pages available at "conventional website" above. This is synchronous learning but it uses same material you would use offline.

HTML version of Scripted Foils prepared 23 August 1998

Foil 4 What is Computational Science ?

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Practical use of leading edge computer science technologies to address "real" applications
Two tracks at Syracuse
CPS615/713: Simulation Track
CPS606/616/640/714: Information Track
Large Scale Parallel Computing Low latency Closely Coupled Components
World Wide Distributed Computing Loosely Coupled Components
Parallel Algorithms Fortran90
HPF MPI Interconnection Networks
Transactions
Security
Compression
PERL JavaScript Multimedia
Wide Area Networks
Java
VRML Collaboration Integration (middleware) Metacomputing CORBA Databases

HTML version of Scripted Foils prepared 23 August 1998

Foil 5 Synergy of Parallel Computing and Web Internetics as Unifying Principle

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
The two forms of Large Scale Computing Scale Computer for Scale Users in Proportion Power User to number of computers
Parallel Commodity Distributed Computers Information Systems Technology <--------------- Internetics Technologies --------------->
Parallel Computer Distributed Computer
HPF MPI HPJava HTML VRML

HTML version of Scripted Foils prepared 23 August 1998

Foil 6 Basic CPS615 Contact Points

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Instructor: Geoffrey Fox -- gcf@npac.syr.edu, 3154432163 and Room 3-131 CST
Backup: Nancy McCracken -- njm@npac.syr.edu, 3154434687 and Room 3-234 CST
Grader: Saleh Elmohamed -- saleh@npac.syr.edu, 3154431073
NPAC administrative support: Nicole Caza -- ncaza@npac.syr.edu, 3154431722 and room 3-206 CST
The above can be reached at cps615ad@npac.syr.edu
Students will be on alias cps615@npac.syr.edu
Homepage is: http://www.npac.syr.edu/projects/jsufall98

HTML version of Scripted Foils prepared 23 August 1998

Foil 7 Course Organization

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Graded on the basis of approximately 8 Homework sets which will be due Thursday of the week following day (Tuesday or Thursday given out)
There will be one project -- which will start after message passing (MPI) discussed
Total grade is 70% homework, 30% project
Languages will Fortran or C and Java -- we will do a survey early on to clarify this
All homework will be handled through the web and indeed all computer access will be through the VPL or Virtual Programming Laboratory which gives access to compilers, Java visualization etc. through the web

HTML version of Scripted Foils prepared 23 August 1998

Foil 8 Material Covered in this Course

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Status of High Performance Computing and Computation HPCC nationally
What is Computational Science Nationally (and at Syracuse)
Technology driving forces
Moore's law and exponentially increasing transistors
Elementary discussion of Parallel Computing in Society and why it must obviously work in simulation!
Sequential and Parallel Computer Architectures
Comparison of Architecture of World Wide Web and Parallel Systems

HTML version of Scripted Foils prepared 23 August 1998

Foil 9 Structure of CPS615 - II

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Simple base Example -- Laplace's Equation
  • Illustrate parallel computing
  • Illustrate use of Web
Then we discuss software and algorithms (mini-applications) intermixed
Programming Models -- SPMD and Message Passing (MPI) with Fortran, C and Java
Data Parallel (HPF, Fortran90) languages will be discussed but NOT used
Visualization -- Java Applets
Other tools such as Collaboration , Performance Measurement will be mentioned

HTML version of Scripted Foils prepared 23 August 1998

Foil 10 Basic Structure of Complete CPS615 Base Course on Computational Science Simulation Track -- III

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Secs 96
This introduction is followed by a set of "vignettes" discussing problem classes which illustrate parallel programming and parallel algorithms
Ordinary Differential Equations
  • N body Problem by both O(N^2) and "fast multipole" O(N) method
Numerical Integration including adaptive methods
Floating Point Arithmetic
Monte Carlo Methods including Random Numbers
Full Matrix Algebra as in
  • Computational Electromagnetism
  • Computational Chemistry
Partial Differential Equations implemented as sparse matrix problems (as in Computational Fluid Dynamics)
  • Iterative Algorithms from Gauss Seidel to Conjugate Gradient
  • Finite Element Methods

HTML version of Scripted Foils prepared 23 August 1998

Foil 11 What are Parallel and Distributed Computing?

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
It is multiple processes running on multiple computers which are coordinated to solve a single problem!
Distributed Computing can be defined in same way
Parallel computing has tight coordination (implying low latency and high bandwidth communication)
Distributed computing has looser constraints between component processes

HTML version of Scripted Foils prepared 23 August 1998

Foil 12 Why Parallel Computing?

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Continuing demands for higher performance
Physical limits on single processor performance
Increasing number of transistors in a single chip or on a single board implies parallel architectures
High costs of internal concurrency
Result is rise of multiprocessor architectures
  • And number of processors will continue to increase
Increasing importance of DISTRIBUTED computing through the Web and Internet
  • or Intranets which is the Internet internal to organizations
  • Distributed computing is a much larger field

HTML version of Scripted Foils prepared 23 August 1998

Foil 13 Parallel Computing Technology Rationale

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Transistors are getting cheaper and cheaper and it only takes some 0.5 million transistors to make a very high quality CPU
  • Essentially impossible to increase clock speed and so must exploit increasing transistor density in figure of merit (1/f)2-4
Already we build chips with some factor of ten more transistors than this and this is used for "automatic" instruction level parallelism.
  • This corresponds to parallelism in "innermost loops"
However getting much more speedup than this requires use of "outer loop" or data parallelism.
Actually memory bandwidth is an essential problem in any computer as doing more computations per second requires accessing more memory cells per second!
  • Harder for sequential than parallel computers
  • Data locality is unifying concept!

HTML version of Scripted Foils prepared 23 August 1998

Foil 14 Motivating Applications

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Large Scale Simulations in Engineering
  • Model airflow around an aircraft
  • Study environmental issues -- flow of contaminants
  • Forecast weather
  • Oil Industry: Reservoir Simulation and analysis of Seismic data
Large Scale Academic Simulations (Physics, Chemistry, Biology)
  • Study of Evolution of Universe
  • Study of fundamental particles: Quarks and Gluons
  • Study of protein folding
  • Study of catalysts
"Other Critical Real World Applications"
  • Run optimization and classification algorithms in datamining of Enterprise Information Systems
  • Model Financial Instruments

HTML version of Scripted Foils prepared 23 August 1998

Foil 15 Performance of High End Machines Years 1940-2000

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Secs 72

HTML version of Scripted Foils prepared 23 August 1998

Foil 16 Performance of High End Machines Years 1980-2000

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index Secs 33

HTML version of Scripted Foils prepared 23 August 1998

Foil 17 Peak Supercomputer Performance

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
For "Convential" MPP/Distributed Shared Memory Architecture
Now(1996) Peak is 0.1 to 0.2 Teraflops in Production Centers
  • Note both SGI and IBM are changing architectures:
  • IBM Distributed Memory to Distributed Shared Memory
  • SGI Shared Memory to Distributed Shared Memory
In 1999, one will see production 1 Teraflop systems
In 2003, one will see production 10 Teraflop Systems
In 2007, one will see production 50-100 Teraflop Systems
Memory is Roughly 0.25 to 1 Terabyte per 1 Teraflop
If you are lucky/work hard: Realized performance is 30% of Peak

HTML version of Scripted Foils prepared 23 August 1998

Foil 18 Some Comments on Simulation and HPCC

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
HPCC is a maturing field with many organizations installing large scale systems
These include NSF (academic computations) with PACI activity, DoE (Dept of Energy) with ASCI and DoD (Defense) with Modernization
There are new applications with new algorithmic challenges
  • Our work on Black Holes has novel adaptive meshes
  • On earthquake simulation, new "fast multipole" approaches to a problem not tackled this way before
  • On financial modeling, new Monte Carlo methods for complex options
However these are not "qualitatively" new concepts
Software ideas are "sound" but note simulation is only a few percent of total computer market and software immature
  • Use where possible software from other sources (such as web)which can spend more money on each feature!

HTML version of Scripted Foils prepared 23 August 1998

Foil 19 The Multicomputer: an Idealized Parallel Computer

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
P
Memory
P
P
Node
Node
Node
Node
Typically a Bus on a board

HTML version of Scripted Foils prepared 23 August 1998

Foil 20 Multicomputer Architecture

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Multicomputer = nodes + network
Node = processor(s) + local memory
Access to local memory is cheap
  • 10s of cycles; does not involve network
  • Conventional memory reference
Access to remote memory is expensive
  • 100s or 1000s of cycles; uses network
  • Use I/O-like mechanisms (e.g., send/receive)

HTML version of Scripted Foils prepared 23 August 1998

Foil 21 Multicomputer Cost Model

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Cost of remote memory access/communication (including synchronization)
  • T = ts + N tw
  • ts = per-message cost ("latency")
  • tw = per-word cost
  • N = message size in words
Hence data locality is an important property of good parallel algorithms
N
T
ts
tw

HTML version of Scripted Foils prepared 23 August 1998

Foil 22 Sequential Memory Structure

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Data locality implies CPU finds information it needs in cache which stores most recently accessed information
This means one reuses a given memory reference in many nearby computations e.g.
A1 = B*C
A2 = B*D + B*B
.... Reuses B
Processor
Cache
L2 Cache
L3 Cache
Main
Memory
Disk
Increasing Memory
Capacity Decreasing
Memory Speed (factor of 100 difference between processor
and main memory
speed)

HTML version of Scripted Foils prepared 23 August 1998

Foil 23 Parallel Computer Memory Structure

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
For both parallel and sequential computers, cost is accessing remote memories with some form of "communication"
Data locality addresses in both cases
Differences are quantitative size of effect and what is done by user and what automatically
Processor
Cache
L2 Cache
L3 Cache
Main
Memory
Processor
Cache
L2 Cache
Processor
Cache
L2 Cache
Board Level Interconnection Networks
....
....
System Level Interconnection Network
L3 Cache
Main
Memory
L3 Cache
Main
Memory
Slow
Very Slow

HTML version of Scripted Foils prepared 23 August 1998

Foil 24 Real Parallel Computers Architectures

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Major architecture types
  • Distributed memory MIMD
  • Shared memory MIMD
  • Distributed shared memory (DSM)
  • Workstation clusters and "metacomputers"
  • An older design SIMD where processors run in lockstep has fallen out of favor
Idealized Multicomputer Model fits current architectures pretty well

HTML version of Scripted Foils prepared 23 August 1998

Foil 25 Parallel Computers -- Classic Overview

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Parallel computers allow several CPUs to contribute to a computation simultaneously and different architectures take different steps to enable this coordinated action.
For our purposes, a parallel computer has three types of parts:
  • Processors
  • Memory modules
  • Communication / synchronization network
Some Key points: Easier in Some Architectures/Applications than others!
  • All processors must be busy for peak speed.
  • Local memory is directly connected to each processor.
  • Accessing local memory is much faster than other memory.
  • Synchronization is expensive, but necessary for correctness.
Colors Used in Following pictures

HTML version of Scripted Foils prepared 23 August 1998

Foil 26 Distributed Memory MIMD Multiprocessor

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Multiple Instruction/Multiple Data
Processors with local memory connected by high-speed interconnection network
Typically high bandwidth, medium latency
Hardware support for remote memory access
Model breaks down when topology matters
Examples: Cray T3E, IBM SP

HTML version of Scripted Foils prepared 23 August 1998

Foil 27 Distributed Memory Machines

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Every processor has a memory others can't access.
Advantages:
  • Relatively easy to design and build
  • Predictable (if modest) behavior
  • Can be scalable
  • Can hide latency of communication
Disadvantages:
  • Hard to program
  • Program and O/S (and sometimes data) must be replicated

HTML version of Scripted Foils prepared 23 August 1998

Foil 28 Distributed Memory Machines -- Notes

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster are quite similar.
  • Bandwidth and latency of interconnects different
  • The network topology is a two-dimensional torus for Paragon, fat tree for CM-5, hypercube for nCUBE and Switch for SP-2
To program these machines:
  • Divide the problem to minimize number of messages while retaining parallelism
  • Convert all references to global structures into references to local pieces (explicit messages convert distant to local variables)
  • Optimization: Pack messages together to reduce fixed overhead (almost always needed)
  • Optimization: Carefully schedule messages (usually done by library)

HTML version of Scripted Foils prepared 23 August 1998

Foil 29 Shared Memory MIMD Multiprocessor

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Processors access shared memory via bus
Low latency, high bandwidth
Bus contention limits scalability
Search for scalability introduces locality
  • Cache (a form of local memory)
  • Multistage architectures (some memory closer)
Examples: Cray T90, SGI Power Challenge, Sun

HTML version of Scripted Foils prepared 23 August 1998

Foil 30 Shared-Memory Machines

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
All processors access the same memory.
Advantages:
  • Retain sequential programming languages such as Java or Fortran
  • Easy to program (correctly)
  • Can share code and data among processors
Disadvantages:
  • Hard to program (optimally)
  • Not scalable due to bandwidth limitations in bus

HTML version of Scripted Foils prepared 23 August 1998

Foil 31 Shared-Memory Machines -- Notes

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Interconnection network shown here is actually for the BBN Butterfly, but C-90 is in the same spirit.
These machines share data by direct access.
  • Potentially conflicting accesses must be protected by synchronization.
  • Simultaneous access to the same memory bank will cause contention, degrading performance.
  • Some access patterns will collide in the network (or bus), causing contention.
  • Many machines have caches at the processors.
  • All these features make it profitable to have each processor concentrate on one area of memory that others access infrequently.

HTML version of Scripted Foils prepared 23 August 1998

Foil 32 Distributed Shared Memory (DSM)

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
A hybrid of distributed and shared memory
Small groups of processors share memory; others access across a scalable network
Low to moderate latency, high bandwidth
Model simplifies the multilevel hierarchy
Examples: SGI Origin, HP Exemplar

HTML version of Scripted Foils prepared 23 August 1998

Foil 33 Distributed Shared Memory Machines

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Combining the (dis)advantages of shared and distributed memory
Lots of hierarchical designs are appearing.
  • Typically, "shared memory nodes" with 4 to 32 processors
  • Each processor has a local cache
  • Processors within a node access shared memory
  • Nodes can get data from or put data to other nodes' memories

HTML version of Scripted Foils prepared 23 August 1998

Foil 34 Workstation Clusters

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Workstations connected by network
Cost effective
High latency, low to moderate bandwidth
Often lack integrated software environment
Multicomputer Model breaks down if connectivity limited
Example Connectivities: Ethernet, ATM crossbar, Myrinet
"Ordinary" Network
Ordinary PC's or
Workstations

HTML version of Scripted Foils prepared 23 August 1998

Foil 35 Parallel Algorithms

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
A parallel algorithm is a collection of tasks and a partial ordering between them.
Design goals: We will return to this later on
  • Match tasks to the available processors (exploit parallelism).
  • Minimize ordering (avoid unnecessary synchronization points).
  • Recognize ways parallelism can be helped by changing ordering
Sources of parallelism: We will discuss this now
  • Data parallelism: updating array elements simultaneously.
  • Functional parallelism: conceptually different tasks which combine to solve the problem. This happens at fine and coarse grain size
    • fine is "internal" such as I/O and computation; coarse is "external" such as separate modules linked together

HTML version of Scripted Foils prepared 23 August 1998

Foil 36 Data Parallelism in Algorithms

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Data-parallel algorithms exploit the parallelism inherent in many large data structures.
  • A problem is an (identical) algorithm applied to multiple points in data "array"
  • Usually iterate over such "updates"
  • Example: Red-Black Relaxation
    • All "red" points can be updated in parallel; then all the "black"
Analysis:
  • Scalable parallelism -- can usually get million or more way parallelism
  • Hard to express when "geometry" irregular or dynamic

HTML version of Scripted Foils prepared 23 August 1998

Foil 37 Some Illustrative Examples of Parallel Applications!

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Now we go through a set of semi-realistic examples of parallel computing and show they use various forms of data parallelism
Seismic Wave Simulation: Regular mesh of grid points
Cosmology (Universe) Simulation: Irregular collection of stars or galaxies
Computer Chess: "Data" is now parts of a game tree with major complication of pruning algorithms
  • "Deep Blue" uses this global parallelism combined with optimal locally parallel evaluation of "score" of a position using special hardware
Defending the Nation: Tracking multiple missiles achieves parallelism from set of missiles
Finite Element (NASTRAN) structures problem: Irregular collection of nodal (grid) points clustered near a crack

HTML version of Scripted Foils prepared 23 August 1998

Foil 38 Concurrent Computation as a Mapping Problem -I

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
2 Different types of Mappings in Physical Spaces
Both are static
  • a) Seismic Migration with domain decomposition on 4 nodes
  • b)Universe simulation with irregular data but static 16 node decomposition
  • but this problem would be best with dynamic irregular decomposition

HTML version of Scripted Foils prepared 23 August 1998

Foil 39 Concurrent Computation as a Mapping Problem - II

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Different types of Mappings -- A very dynamic case without any underlying Physical Space
c)Computer Chess with dynamic game tree decomposed onto 4 nodes

HTML version of Scripted Foils prepared 23 August 1998

Foil 40 Concurrent Computation as a Mapping Problem - III

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index

HTML version of Scripted Foils prepared 23 August 1998

Foil 41 Finite Element Mesh From Nastran
(mesh only shown in upper half)

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index

HTML version of Scripted Foils prepared 23 August 1998

Foil 42 A Simple Equal Area Decomposition

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
And the corresponding poor workload balance

HTML version of Scripted Foils prepared 23 August 1998

Foil 43 Decomposition After Annealing
(one particularly good but nonoptimal decomposition)

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
And excellent workload balance

HTML version of Scripted Foils prepared 23 August 1998

Foil 44 Functional Parallelism in Algorithms

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Functional parallelism exploits the parallelism between the parts of many systems.
  • Many pieces to work on ? many independent operations
  • Example: Coarse grain Aeroelasticity (aircraft design)
    • CFD(fluids) and CSM(structures) and others (acoustics, electromagnetics etc.) can be evaluated in parallel
Analysis:
  • Parallelism limited in size -- tens not millions
  • Synchronization probably good as parallelism natural from problem and usual way of writing software
  • Web exploits functional parallelism NOT data parallelism

HTML version of Scripted Foils prepared 23 August 1998

Foil 45 Structure(Architecture) of Applications - I

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Applications are metaproblems with a mix of module (aka coarse grain functional) and data parallelism
Modules are decomposed into parts (data parallelism) and composed hierarchically into full applications.They can be the
  • "10,000" separate programs (e.g. structures,CFD ..) used in design of aircraft
  • the various filters used in Khoros based image processing system
  • the ocean-atmosphere components in integrated climate simulation
  • The data-base or file system access of a data-intensive application
  • the objects in a distributed Forces Modeling Event Driven Simulation

HTML version of Scripted Foils prepared 23 August 1998

Foil 46 Structure(Architecture) of Applications - II

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Modules are "natural" message-parallel components of problem and tend to have less stringent latency and bandwidth requirements than those needed to link data-parallel components
  • modules are what HPF needs task parallelism for
  • Often modules are naturally distributed whereas parts of data parallel decomposition may need to be kept on tightly coupled MPP
Assume that primary goal of metacomputing system is to add to existing parallel computing environments, a higher level supporting module parallelism
  • Now if one takes a large CFD problem and divides into a few components, those "coarse grain data-parallel components" will be supported by computational grid technology
Use Java/Distributed Object Technology for modules -- note Java to growing extent used to write servers for CORBA and COM object systems

HTML version of Scripted Foils prepared 23 August 1998

Foil 47 Multi Server Model for metaproblems

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
We have multiple supercomputers in the backend -- one doing CFD simulation of airflow; another structural analysis while in more detail you have linear algebra servers (Netsolve); Optimization servers (NEOS); image processing filters(Khoros);databases (NCSA Biology workbench); visualization systems(AVS, CAVEs)
  • One runs 10,000 separate programs to design a modern aircraft which must be scheduled and linked .....
All linked to collaborative information systems in a sea of middle tier servers(as on previous page) to support design, crisis management, multi-disciplinary research

HTML version of Scripted Foils prepared 23 August 1998

Foil 48 Multi-Server Gateway Tier

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Database
Matrix Solver
Optimization Service
MPP
MPP
Parallel DB Proxy
NEOS Control Optimization
Origin 2000 Proxy
NetSolve Linear Alg. Server
IBM SP2 Proxy
Gateway Control
Agent-based Choice of Compute Engine
Multidisciplinary Control (WebFlow)
Data Analysis Server

HTML version of Scripted Foils prepared 23 August 1998

Foil 49 Pleasingly Parallel Algorithms

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Many applications are what is called (essentially) embarrassingly or more kindly pleasingly parallel
These are made up of independent concurrent components
  • Each client independently accesses a Web Server
  • Each roll of a Monte Carlo dice (random number) is an independent sample
  • Each stock can be priced separately in a financial portfolio
  • Each transaction in a database is almost independent (a given account is locked but usually different accounts are accessed at same time)
  • Different parts of Seismic data can be processed independently
In contrast points in a finite difference grid (from a differential equation) canNOT be updated independently
Such problems are often formally data-parallel but can be handled much more easily -- like functional parallelism

HTML version of Scripted Foils prepared 23 August 1998

Foil 50 Parallel Languages

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
A parallel language provides an executable notation for implementing a parallel algorithm.
Design criteria:
  • How are parallel operations defined?
    • static tasks vs. dynamic tasks vs. implicit operations
  • How is data shared between tasks?
    • explicit communication/synchronization vs. shared memory
  • How is the language implemented?
    • low-overhead runtime systems vs. optimizing compilers
Usually a language reflects a particular style of expressing parallelism.
Data parallel expresses concept of identical algorithm on different parts of array
Message parallel expresses fact that at low level parallelism implis information is passed between different concurrently executing program parts

HTML version of Scripted Foils prepared 23 August 1998

Foil 51 Data-Parallel Languages

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Data-parallel languages provide an abstract, machine-independent model of parallelism.
  • Fine-grain parallel operations, such as element-wise operations on arrays
  • Shared data in large, global arrays with mapping "hints"
  • Implicit synchronization between operations
  • Partially explicit communication from operation definitions
Advantages:
  • Global operations conceptually simple
  • Easy to program (particularly for certain scientific applications)
Disadvantages:
  • Unproven compilers
  • As express "problem" can be inflexible if new algorithm which language didn't express well
Examples: HPF, C*, HPC++
Originated on SIMD machines where parallel operations are in lock-step but generalized (not so successfully as compilers too hard) to MIMD

HTML version of Scripted Foils prepared 23 August 1998

Foil 52 Message-Passing Systems

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Program is based on typically coarse-grain tasks
Separate address space and a processor number for each task
Data shared by explicit messages
  • Point-to-point and collective communications patterns
Examples: MPI, PVM, Occam for parallel computing
Universal model for distributed computing to link naturally decomposed parts e.g. HTTP, RMI, IIOP etc. are all message passing
  • distributed object technology (COM, CORBA) built on functionally concurrent objects sending and receiving messages
Advantages:
  • Close to hardware ALWAYS
  • Can be close to problem as in distributed objects or functional parallelism
Disadvantages:
  • Many low-level details when NOT close to problem.

HTML version of Scripted Foils prepared 23 August 1998

Foil 53 A Simple Parallel Programming Model

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
A parallel computation is a set of tasks
Each task has local data, can be connected to other tasks by channels
A task can
  • Compute using local data
  • Send to/receive from other tasks
  • Create new tasks, or terminate itself
A receiving task blocks until data available
Note ALL coordination is message passing

HTML version of Scripted Foils prepared 23 August 1998

Foil 54 Properties of Programming Model

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Concurrency is enhanced by creating multiple tasks
Scalability: More tasks than processor nodes
Locality: Access local data when possible -- minimize message traffic
A task (with local data and subtasks) is a unit for modular design -- it is an object!
Mapping to nodes affects performance only

HTML version of Scripted Foils prepared 23 August 1998

Foil 55 Some Steps in Parallel Programming

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Partition
  • Define tasks
Communication
  • Identify requirements
Agglomeration
  • Enhance locality
Mapping
  • Place tasks

HTML version of Scripted Foils prepared 23 August 1998

Foil 56 Partitioning

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Goal: identify opportunities for concurrent execution (define tasks: computation+data)
Focus on data operated on by algorithm ...
  • Then distribute computation appropriately
  • Domain decomposition or Data Parallelism
... or on the operations performed
  • Then distribute data appropriately
  • Functional decomposition

HTML version of Scripted Foils prepared 23 August 1998

Foil 57 Communication

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Identify communication requirements
  • If computation in one task requires data located in another, communication is needed
Example: finite difference computation
Must communicate with each neighbor
Xi = (Xi-1 + 2*Xi + Xi+1)/4
Partition creates
one task per point
I=2 above

HTML version of Scripted Foils prepared 23 August 1998

Foil 58 Agglomeration

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Once tasks + communication determined, "agglomerate" small tasks into larger tasks
Motivations
  • To reduce communication costs
  • If tasks cannot execute concurrently
  • To reduce software engineering costs
Caveats
  • May involve replicating computation or data

HTML version of Scripted Foils prepared 23 August 1998

Foil 59 Mapping

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Place tasks on processors, to
  • Maximize concurrency } note potential
  • Minimize communication } conflict
Techniques:
  • Regular problems: agglomerate to P tasks
  • Irregular problems: use static load balancing
  • If irregular in time: dynamic load balancing

HTML version of Scripted Foils prepared 23 August 1998

Foil 60 What is Parallel Architecture?

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
A parallel computer is any old collection of processing elements that cooperate to solve large problems fast
  • from a pile of PC's to a shared memory multiprocessor
Some broad issues:
  • Resource Allocation:
    • how large a collection?
    • how powerful are the elements?
    • how much memory?
  • Data access, Communication and Synchronization
    • how do the elements cooperate and communicate?
    • how are data transmitted between processors?
    • what are the abstractions and primitives for cooperation?
  • Performance and Scalability
    • how does it all translate into performance?
    • how does it scale?

HTML version of Scripted Foils prepared 23 August 1998

Foil 61 Why Study Parallel Architecture as a computer scientist?

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Role of a computer architect:
  • To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.
Parallelism:
  • Provides alternative to faster clock for performance
  • Applies at all levels of system design
  • Is a fascinating perspective from which to view architecture
  • Is increasingly central in information processing

HTML version of Scripted Foils prepared 23 August 1998

Foil 62 Why Study Architecture Today?

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
History: diverse and innovative organizational structures, often tied to novel programming models
Rapidly maturing under strong technological constraints
  • The "killer microprocessor" is ubiquitous
  • Laptops and supercomputers are fundamentally similar!
  • Technological trends cause diverse approaches to converge
Technological trends make parallel computing inevitable
  • In the mainstream
Need to understand fundamental principles and design tradeoffs, not just taxonomies
  • According to Culler fundamental are: Naming, Ordering, Replication, Communication performance
  • According to Fox: match computer hardware software and problem

HTML version of Scripted Foils prepared 23 August 1998

Foil 63 Inevitability of Parallel Computing

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Application demands: Our insatiable need for computing cycles
  • Scientific computing: CFD, Biology, Chemistry, Physics, ...
  • General-purpose computing: Video, Graphics, CAD, Databases, TP...
Technology Trends
  • Number of transistors on chip growing rapidly
  • Clock rates expected to go up only slowly
Architecture Trends
  • Instruction-level parallelism(ILP) valuable but limited
  • Coarser-level parallelism, as in Multiprocessors, the most viable approach
Economics
Current trends:
  • Today's microprocessors have multiprocessor(MP) support
  • Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!...
  • Tomorrow's microprocessors are multiprocessors

HTML version of Scripted Foils prepared 23 August 1998

Foil 64 Application Trends

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Demand for cycles fuels advances in hardware, and vice-versa
  • Cycle drives exponential increase in microprocessor performance (this is largely from low-end)
  • Drives parallel architecture harder: most demanding applications (this is high end driver)
Range of performance demands
  • Need range of system performance with progressively increasing cost
  • Platform pyramid
Goal of applications in using parallel machines: Speedup
  • Speedup (p processors) =
For a fixed problem size (input data set), performance = 1/time
  • Speedup fixed problem (p processors) =

HTML version of Scripted Foils prepared 23 August 1998

Foil 65 TPC-C (database transaction processing)

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Results from March 96
Parallelism is pervasive
Small to moderate scale parallelism very important

HTML version of Scripted Foils prepared 23 August 1998

Foil 66 Summary of Application Trends

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Transition to parallel computing has occurred for scientific and engineering computing but this is 1-2% of computer market
In rapid progress in commercial computing
  • Database and transactions as well as financial modeling/oil reservoir simulation
  • Web servers including multi-media and search growing importance
  • Typically functional or pleasingly parallel
  • Usually smaller-scale, but large-scale systems also used
Desktop also uses multithreaded programs, which are a lot like parallel programs (again functional not data parallel)
Demand for improving throughput on sequential workloads
  • Greatest use of small-scale multiprocessors
Solid application demand exists and will increase

HTML version of Scripted Foils prepared 23 August 1998

Foil 67 Technology Trends -- CPU's

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
The natural building block for multiprocessors is now also about the fastest!

HTML version of Scripted Foils prepared 23 August 1998

Foil 68 General Technology Trends

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Microprocessor performance increases 50% - 100% per year
Transistor count doubles every 3 years
DRAM size quadruples every 3 years
Huge investment per generation is carried by huge commodity market
Note that single-processor performance is plateauing, but that parallelism is a natural way to improve it.
Sequential or small multiprocessors -- today 2 processor PC's

HTML version of Scripted Foils prepared 23 August 1998

Foil 69 Technology: A Closer Look

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Basic advance is decreasing feature size ( ??)
  • Circuits become either faster or lower in power
Die size is growing too
  • Clock rate improves roughly proportional to improvement in ?
  • Number of transistors improves like ????(or faster)
Performance > 100x per decade; clock rate 10x, rest of increase is due to transistor count
How to use more transistors?
  • Parallelism in processing
    • multiple operations per cycle reduces CPI
  • Locality in data access
    • avoids latency and reduces CPI
    • also improves processor utilization
  • Both need resources, so tradeoff
Fundamental issue is resource distribution, as in uniprocessors
CPI = Clock Cycles per Instruction

HTML version of Scripted Foils prepared 23 August 1998

Foil 70 Clock Frequency Growth Rate

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
30% per year

HTML version of Scripted Foils prepared 23 August 1998

Foil 71 Transistor Count Growth Rate

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
100 million transistors on chip by early 2000's A.D.
Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades

HTML version of Scripted Foils prepared 23 August 1998

Foil 72 Similar Story for Storage

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Divergence between memory capacity and speed more pronounced
  • Capacity increased by 1000x from 1980-95, speed only 2x
  • Gigabit DRAM by c. 2000, but gap with processor speed much greater
Larger memories are slower, while processors get faster
  • Need to transfer more data in parallel
  • Need deeper cache hierarchies
  • How to organize caches?
Parallelism increases effective size of each level of hierarchy, without increasing access time
Parallelism and locality within memory systems too
  • New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface
  • Buffer caches most recently accessed data
Disks too: Parallel disks plus caching

HTML version of Scripted Foils prepared 23 August 1998

Foil 73 Parallel Processing and Society

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
The fundamental principles behind the use of concurrent computers are identical to those used in society - in fact they are partly why society exists.
If a problem is too large for one person, one does not hire a SUPERman, but rather puts together a team of ordinary people...
cf. Construction of Hadrians Wall

HTML version of Scripted Foils prepared 23 August 1998

Foil 74 Concurrent Construction of a Wall
Using N = 8 Bricklayers
Decomposition by Vertical Sections

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Domain Decomposition is Key to Parallelism
Need "Large" Subdomains l >> l overlap

HTML version of Scripted Foils prepared 23 August 1998

Foil 75 Quantitative Speed-Up Analysis for Construction of Hadrian's Wall

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index

HTML version of Scripted Foils prepared 23 August 1998

Foil 76 Amdahl's law for Real World Parallel Processing

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
AMDAHL"s LAW or
Too many cooks spoil the broth
Says that
Speedup S is small if efficiency e small
or for Hadrian's wall
equivalently S is small if length l small
But this is irrelevant as we do not need parallel processing unless problem big!

HTML version of Scripted Foils prepared 23 August 1998

Foil 77 Pipelining --Another Parallel Processing Strategy for Hadrian's Wall

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
"Pipelining" or decomposition by horizontal section is:
  • In general less effective
  • and leads to less parallelism
  • (N = Number of bricklayers must be < number of layers of bricks)

HTML version of Scripted Foils prepared 23 August 1998

Foil 78 Hadrian's Wall Illustrates that the Topology of Processor Must Include Topology of Problem

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Hadrian's Wall is one dimensional
Humans represent a flexible processor node that can be arranged in different ways for different problems
The lesson for computing is:
Original MIMD machines used a hypercube topology. The hypercube includes several topologies including all meshes. It is a flexible concurrent computer that can tackle a broad range of problems. Current machines use different interconnect structure from hypercube but preserve this capability.

HTML version of Scripted Foils prepared 23 August 1998

Foil 79 General Speed Up Analysis

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Comparing Computer and Hadrian's Wall Cases

HTML version of Scripted Foils prepared 23 August 1998

Foil 80 Nature's Concurrent Computers

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
At the finest resolution, collection of neurons sending and receiving messages by axons and dendrites
At a coarser resolution
Society is a collection of brains sending and receiving messages by sight and sound
Ant Hill is a collection of ants (smaller brains) sending and receiving messages by chemical signals
Lesson: All Nature's Computers Use Message Passing
With several different Architectures

HTML version of Scripted Foils prepared 23 August 1998

Foil 81 Comparison of Concurrent Processing in Society and Computing

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Problems are large - use domain decomposition Overheads are edge effects
Topology of processor matches that of domain - processor with rich flexible node/topology matches most domains
Regular homogeneous problems easiest but
irregular or
Inhomogeneous
Can use local and global parallelism
Can handle concurrent calculation and I/O
Nature always uses message passing as in parallel computers (at lowest level)

HTML version of Scripted Foils prepared 23 August 1998

Foil 82 Comparison of The Complete Problem to the subproblems formed in domain decomposition

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
The case of Programming a Hypercube
Each node runs software that is similar to sequential code
e.g., FORTRAN with geometry and boundary value sections changed

HTML version of Scripted Foils prepared 23 August 1998

Foil 83 Hadrian's Wall Illustrating an
Irregular but Homogeneous Problem

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Geometry irregular but each brick takes about the same amount of time to lay.
Decomposition of wall for an irregular geometry involves equalizing number of bricks per mason, not length of wall per mason.

HTML version of Scripted Foils prepared 23 August 1998

Foil 84 Some Problems are Inhomogeneous Illustrated by:
An Inhomogeneous Hadrian Wall with Decoration

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Fundamental entities (bricks, gargoyles) are of different complexity
Best decomposition dynamic
Inhomogeneous problems run on concurrent computers but require dynamic assignment of work to nodes and strategies to optimize this
(we use neural networks, simulated annealing, spectral bisection etc.)

HTML version of Scripted Foils prepared 23 August 1998

Foil 85 Global and Local Parallelism Illustrated by Hadrian's Wall

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Global Parallelism
  • Break up domain
  • Amount of Parallelism proportional to size of problem (and is usually large)
  • Unit is Bricklayer or Computer node
Local Parallelism
  • Do in parallel local operations in the processing of basic entities
    • e.g. for Hadrian's problem, use two hands, one for brick and one for mortar while ...
    • for computer case, do addition at same time as multiplication
  • Local Parallelism is limited but useful
Local and Global Parallelism
Should both be Exploited

HTML version of Scripted Foils prepared 23 August 1998

Foil 86 Parallel I/O Illustrated by
Concurrent Brick Delivery for Hadrian's Wall
Bandwidth of Trucks and Roads
Matches that of Masons

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Disk (input/output) Technology is better matched to several modest power processors than to a single sequential supercomputer
Concurrent Computers natural in databases, transaction analysis

HTML version of Scripted Foils prepared 23 August 1998

Foil 87 Example: Atmosphere Model

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Simulate atmospheric processes
  • Conservation of momentum, mass, energy
  • Ideal gas law, hydrostatic approximation
Represent atmosphere state by 3-D grid
  • Periodic in two horizontal dimensions
  • Nx.Ny.Nz: e.g., Ny=50-500, Nx=2Ny, Nz=15-30
Computation includes
  • Atmospheric dynamics: finite difference
  • "Physics" (radiation etc.) in vertical only

HTML version of Scripted Foils prepared 23 August 1998

Foil 88 Atmosphere Model: Numerical Methods

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Discretize the (continuous) domain by a regular Nx??Ny ??Nz grid
  • Store p, u, v, T, ? at every grid point
Approximate derivatives by finite differences
  • Leads to stencils in vertical and horizontal

HTML version of Scripted Foils prepared 23 August 1998

Foil 89 Atmosphere Model: Partition

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Use domain decomposition
  • Because model operates on large, regular grid
  • Can decompose in 1, 2, or 3 dimensions
  • 3-D decomposition offers greatest flexibility

HTML version of Scripted Foils prepared 23 August 1998

Foil 90 Atmosphere Model: Communication

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Finite difference stencil horizontally
  • Local, regular, structured
Radiation calculations vertically
  • Global, regular, structured
Diagnostic sums
  • Global, regular, structured

HTML version of Scripted Foils prepared 23 August 1998

Foil 91 Atmosphere Model: Agglomeration

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
In horizontal
  • Clump so that 4 points per task
  • Efficiency: communicate with 4 neighbors only
In vertical, clump all points in column
  • Performance: avoid communication
  • Modularity: Reuse physics modules
Resulting algorithm "reasonably scalable"
  • (Nx.Ny)/4 : at least 1250 tasks

HTML version of Scripted Foils prepared 23 August 1998

Foil 92 Atmosphere Model: Mapping

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Technique depends on load distribution
1) Agglomerate to one task per processor
  • Appropriate if little load imbalance
2) Extend (1) to incorporate cyclic mapping
  • Works well for diurnal cycle imbalance
3) Use dynamic, local load balancing
  • Works well for unpredictable, local imbalances

HTML version of Scripted Foils prepared 23 August 1998

Foil 93 The HPCC Dilemma and its Solution

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
HPCC has developed good research ideas but cannot implement them as solving computing's hardest problem with 1 percent of the funding
  • HPCC applications are very complex and use essentially all computer capabilities and also have synchronization and performance constraints from HPCC
We have learnt to use commodity hardware either
  • partially as in Origin 2000/SP2 with consumer CPU's but custom network or
  • fully as in PC cluster with fast ethernet/ATM
Let us do the same with software and design systems with maximum possible commodity software basis

HTML version of Scripted Foils prepared 23 August 1998

Foil 94 What is Commodity Software

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
The world is building a wonderful distributed computing (information processing) environment using Web (dissemination) and distributed object (CORBA COM) technologies
This includes Java, Web-linked databases and the essential standards such as HTML(documents), VRML(3D objects), JDBC (Java database connectivity).
  • The standard interfaces are essential in that they allow modular (component based) software
We will "just" add high performance to this commodity distributed infrastructure
  • Respecting architecture of the object web, should allow us to naturally use improved software as it produced
The alternative strategy starts with HPCC technologies (such as MPI,HPF) and adds links to commodity world. This approach does not easily track evolution of commodity systems and so has large maintenance costs

HTML version of Scripted Foils prepared 23 August 1998

Foil 95 The Computing Pyramid

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Bottom of Pyramid has 100 times dollar value and 1000 times compute power of best supercomputer

HTML version of Scripted Foils prepared 23 August 1998

Foil 96 Implications of the Computing Pyramid

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Web Software MUST be cheaper and better than MPP software as factor of 100 more money invested!
Therefore natural strategy is to get parallel computing environment by adding synchronization of parallel algorithms to loosely coupled Web distributed computing model

HTML version of Scripted Foils prepared 23 August 1998

Foil 97 The 3 Roles of Java

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index

HTML version of Scripted Foils prepared 23 August 1998

Foil 98 Why is Java Worth Looking at?

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
The Java Language has several good design features
  • secure, safe (wrt bugs), object-oriented, familiar (to C C++ and even Fortran programmers)
Java has a very good set of libraries covering everything from commerce, multimedia, images to math functions (under development at http://math.nist.gov/javanumerics)
Java has best available electronic and paper training and support resources
Java is rapidly getting best integrated program development environments
Java naturally integrated with network and universal machine supports powerful "write once-run anywhere" model
There is a large and growing trained labor force
Can we exploit this in computational science?

HTML version of Scripted Foils prepared 23 August 1998

Foil 99 What is Java Grande?

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Use of Java for:
High Performance Network Computing
Scientific and Engineering Computation
(Distributed) Modeling and Simulation
Parallel and Distributed Computing
Data Intensive Computing
Communication and Computing Intensive Commercial and Academic Applications
HPCC Computational Grids ........
Very difficult to find a "conventional name" that doesn't get misunderstood by some community!

HTML version of Scripted Foils prepared 23 August 1998

Foil 100 Java and Parallelism?

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
The Web integration of Java gives it excellent "network" classes and support for message passing.
Thus "Java plus message passing" form of parallel computing is actually somewhat easier than in Fortran or C.
Coarse grain parallelism very natural in Java
"Data Parallel" languages features are NOT in Java and have to be added (as a translator) of NPAC's HPJava to Java+Messaging just as HPF translates to Fortran plus message passing
Java has built in "threads" and a given Java Program can run multiple threads at a time
  • In Web use, allows one to process Image in one thread, HTML page in another etc.
Can be used to do more general parallel computing but only on shared memory computers
  • JavaVM (standard Java Runtime) does not support distributed memory systems

HTML version of Scripted Foils prepared 23 August 1998

Foil 101 "Pure" Java Model For Parallelism

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
Combine threads on a shared memory machine with message passing between distinct distributed memories
"Distributed" or "Virtual" Shared memory does support the JavaVM as hardware gives illusion of shared memory to JavaVM
Message Passing
Message Passing

HTML version of Scripted Foils prepared 23 August 1998

Foil 102 Java for Scientific Computing Resource

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
See Original Foil

HTML version of Scripted Foils prepared 23 August 1998

Foil 103 Pragmatic Computational Science August 1998

From CPS615-Introduction-Course,Driving Technology and HPCC Current Status and Futures CPS615 Basic Simulation Track for Computational Science -- Fall Semester 98. *
Full HTML Index
So here is a recipe for developing HPCC (parallel) applications as of August 1998
Use MPI for data parallel programs as alternatives are HPF, HPC++ or parallelizing compilers today
  • Neither HPF or HPC++ has clear long term future for implementations -- ideas are sound
  • MPI will run on PC clusters as well as customized parallel machines -- parallelizing compilers will not work on distributed memory machines
Today Fortran and C are good production languages
Today Java can be used for client side applets and in systems middleware but too slow for production scientific code
  • This should change over next year with better Java compilers -- including "native" compilers which do not translate to Java Virtual Machine but go directly to native machine language
Use metacomputers for pleasingly parallel and metaproblems -- not for closely knit problems with tight synchronization between parts
Use where possible web and distributed object technology for "coordination"
pleasingly parallel problems can use MPI or web/metacomputing technology

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sat Aug 29 1998