Given by Geoffrey C. Fox at CPS615 Computational Science Class on Spring Semester 2000. Foils prepared 20 February 00
Outside Index
Summary of Material
Course Logistics |
Course Overview arranged around 3 Exemplar applications with simple and complex versions |
Status of High Performance Computing and Computation HPCC nationally |
Application Driving Forces |
Computational Science and Information Technology
|
Technology and Commodity Driving Forces
|
Outside Index Summary of Material
Spring Semester 2000 |
Geoffrey Fox |
Northeast Parallel Architectures Center |
Syracuse University |
111 College Place |
Syracuse NY |
gcf@npac.syr.edu |
gcf@cs.fsu.edu |
Course Logistics |
Course Overview arranged around 3 Exemplar applications with simple and complex versions |
Status of High Performance Computing and Computation HPCC nationally |
Application Driving Forces |
Computational Science and Information Technology
|
Technology and Commodity Driving Forces
|
Instructor: Geoffrey Fox -- gcf@npac.syr.edu, 3154432163 and Room 3-131 CST |
Backup: Nancy McCracken -- njm@npac.syr.edu, 3154434687 and Room 3-234 CST |
Grader: Qiang Zheng -- zhengq@npac.syr.edu, 314439209 |
NPAC administrative support: Nicole Caza -- ncaza@npac.syr.edu, 3154431722 and room 3-206 CST |
The class alias cps615spring00@npac.syr.edu |
Powers that be alias cps615adm@npac.syr.edu |
Home Page is: http://www.npac.syr.edu/projects/cps615spring00 |
Homework etc.: http://class-server.npac.syr.edu:3768 |
Graded on the basis of approximately 8 Homework sets which will be due Thursday of the week following day (Monday or Wednesday given out) |
There will be one project -- which will start after message passing (MPI) discussed |
Total grade is 70% homework, 30% project |
Languages will Fortran or C and Java -- we will do a survey early on to clarify this |
All homework will be handled through the web and indeed all computer access will be through the VPL or Virtual Programming Laboratory which gives access to compilers, Java visualization etc. through the web |
Peter S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann, 1997.
|
Gropp, Lusk and Skjellum, Using MPI, Second Edition 1999, MIT Press.
|
Please register for CPS615 at the student course records database at
|
Status of High Performance Computing and Computation HPCC nationally |
Application driving forces
|
What is Computational Science Nationally and how does it with Information Technology |
Technology driving forces
|
Basic Principles of High Performance Systems
|
Overview of Sequential and Parallel Computer Architectures
|
Elementary discussion of Parallel Computing in Society and why it must obviously work for computers! |
What Features of Applications matter
|
Issues of Scaling |
What sort of software exists and Programming Paradigms
|
Three Exemplars: Partial Differential Equations (PDE), Particle Dynamics, Matrix Problems |
Simple base version of first Example -- Laplace's Equation
|
General Discussion of Programming Models -- SPMD and Message Passing (MPI) with Fortran, C and Java |
Data Parallel (HPF, Fortran90) languages will be discussed later but NOT used |
Visualization is important but will not be discussed |
Return to First Example: Computational Fluid Dynamics and other PDE based Applications
|
Return to Parallel Architectures -- Real Systems in more detail, Trends, Petaflops |
Second Exemplar: Particle Dynamics
|
Third Exemplar: Matrix Problems
|
Advanced Software Systems
|
Application Wrap-up: Integration of Ideas and the Future |
David Bailey and Bob Lucas CS267 Applications of Parallel Computers
|
Jim Demmel's Parallel Applications Course: http://www.cs.berkeley.edu/~demmel/cs267_Spr99/ |
Dave Culler's Parallel Architecture course: http://www.cs.berkeley.edu/~culler/cs258-s99/ |
David Culler and Horst Simon 1997 Parallel Applications: http://now.CS.Berkeley.edu/cs267/ |
Jack Dongarra High Performance Computing: http://www.cs.utk.edu/~dongarra/WEB-PAGES/index.html |
Michael Heath Parallel Numerical Algorithms: http://www.cse.uiuc.edu/cse412/index.html |
Willi Schonauer book (hand written): http://www.uni-karlsruhe.de/Uni/RZ/Personen/rz03/book/index.html |
Parallel computing at CMU: http://www.cs.cmu.edu/~scandal/research/parallel.html |
HPCC is a maturing field with many organizations installing large scale systems |
These include NSF (academic computations) with PACI activity, DoE (Dept of Energy) with ASCI and DoD (Defense) with Modernization
|
There are new applications with new algorithmic challenges
|
These have many new issues but typically issue is integrating new systems and not some new mathematical idea |
One counter example (SIAM News December 99) is Akamai Technologies founded by MIT Computer Science theorist Tom Leighton and others
|
Hardware trends reflect both commodity technology and commodity systems
|
Note Sun systems are "pure" Shared Memory and Inktomi are "pure" distributed memory showing architectural focus in two distinct areas with distributed memory mainly supporting specialized services |
Software ideas are "sound" but not very easy to use as so far nobody has found a precise way of expressing parallelism in a way that combines:
|
Integration of Simulation and Data is of growing importance
|
Problem Solving Environments help bring all components of a problem area into a single interface and
|
5 Exemplars |
Large Scale Simulations in Engineering
|
Large Scale Academic Simulations (Physics, Chemistry, Biology)
|
"Other Critical Real World Applications"
|
From Jim Demmel we need to define: |
Kobe 1995 Earthquake caused $200 Billion in damage and was quite unexpected -- the big one(s) in California are expected to be worse |
Field Involves Integration of simulation (of earth dynamics) with sensor data (e.g. new GPS satellite measurements of strains http://www.scign.org) and with information gotten from pick and shovel at the fault line.
|
Northridge Quake |
Technologies include data-mining (is dog barking really correlated with earthquakes) as well as PDE solvers where both finite element and fast multipole methods (for Green's function problems) are important |
Multidisciplinary links of ground motion to building response simulations |
Applications include real-time estimates of after-shocks used by scientists and perhaps crisis management groups |
http://www.npac.syr.edu/projects/gem |
Note Web Search, like transaction analysis has "obvious" parallelization (over both users and data)with modest synchronization issues |
Critical issues are: fault-tolerance (.9999 to .99999 reliability); bounded search time (a fraction of a second); scalability (to the world); fast system upgrade times |
(days) |
TPC-C Benchmark Results from March 96 |
Parallelism is pervasive (more natural in SQL than Fortran) |
Small to moderate scale parallelism very important |
As with all physical simulations, realistic 3D computations require "Teraflop" (10^12 operations per second) performance |
Numerical Relativity just solves the "trivial" Einstein equations G?? = 8?T?? with indices running over 4 dimensions |
Apply to collision of two black holes which are expected to be a major source of gravitational waves for which US and Europe are building major detectors |
Unique features includes freedom to choose coordinate systems (Gauge freedom) in ways that changes nature of equations |
Black Hole has amazing boundary condition that no information can escape from it.
|
At infinity, one has "simple" (but numerically difficult) wave equation; near black hole one finds very non linear system |
Fortran90 (array syntax) very attractive to handle equations which are naturally written in Tensor (multi-dimensional) form |
12 independent field values defined on a mesh with black holes excised -- non trivial dynamic irregularity as holes rotate and spiral into each other in interesting domain |
Irregular dynamic mesh is not so natural in (parallel) Fortran 90 and one needs technology (including distributed data structures like DAGH) to support adaptive finite difference codes. |
Separate Holes are simulated till Merger |
0.7 |
6.9 |
13.2 |
Time |
SMOP is Space Mission Operations Portal or Portal to Space Internet |
Ground stations are placed internationally (linked to the "Space Internet") and so networking has international scope |
Real-time control for monitoring |
Dataflow computing model (Khoros) for customized filtering of data
|
Direct connection to hand-held devices for mission or processing status changes
|
Illustrates Integration of special needs (Space) with more general infrastructure and the integration of computing with data
|
Current NASA Contract (CSOC) for this is $3.5 Billion over 10 years for a team led by Lockheed Martin |
Satellite + |
Sensor(s) |
Relay Station |
(Remote) Ground Station |
or |
or |
or ....... |
Compute Engines |
XML Computing Portals Interface |
or |
Compute Engines |
Filter,Monitor,Plan .. |
XML Grid Forum Interface |
There is a dynamic interplay between application needing more hardware and hardware allowing new/more applications |
Transition to parallel computing has occurred for scientific and engineering computing but this is 1-2% of computer market
|
Rapid progress in commercial computing
|
SMOP illustrates a Computing Portal or PSE |
From John Rice at http://www.cs.purdue.edu/research/cse/pses |
"A PSE is a computer system that provides all the computational facilities needed to solve a target class of problems. |
These features include advanced solution methods, automatic and semiautomatic selection of solution methods, and ways to easily incorporate novel solution methods. |
Moreover, PSEs use the language of the target class of problems, so users can run them without specialized knowledge of the underlying computer hardware or software. |
By exploiting modern technologies such as interactive color graphics, powerful processors, and networks of specialized services, PSEs can track extended problem solving tasks and allow users to review them easily. |
Overall, PSEs create a framework that is all things to all people: they solve simple or complex problems, support rapid prototyping or detailed analysis, and can be used in introductory education or at the frontiers of science." |
PSEs can be traced to the 1963 proposal of Culler and Fried for an "Online Computer Center for Scientific Problems". |
Important examples of PSEs are:
|
The current set of PSEs are built around observation that Yahoo is a "PSE for the World's Information" |
Improved computer performance gives one the opportunity to integrate together multiple capabilities |
Thus Object Web technology allow one to implement a PSE as a Web Portal to Computational Science |
Other essentially equivalent terms to PSEs are Scientific Workbenches or Toolkits. |
http://www.npac.syr.edu/users/gcf/internetics2/ http://www.npac.syr.edu/users/gcf/internetics/ |
Together cover practical use of leading edge computer science technologies to address "real" applications |
Two tracks at Syracuse |
CPS615/713: Simulation Track |
CPS606/616/640/714: Information Track |
Large Scale Parallel Computing Low latency Closely Coupled Components |
World Wide Distributed Computing Loosely Coupled Components |
Parallel Algorithms |
Performance |
Visualization |
Fortran90 |
HPF MPI Interconnection Networks |
Transactions |
Security |
Compression |
PERL JavaScript Multimedia |
e-commerce |
Wide Area Networks |
(Parallel) I/O |
Java VRML Collaboration Integration (middleware) Metacomputing / PSE's CORBA Databases |
The two forms of Large Scale Computing Scale Computer for Scale Users in Proportion Power User to number of computers |
Parallel Commodity Distributed Computers Information Systems Technology <--------------- Internetics Technologies ---------------> |
Parallel Computer Distributed Computer |
HPF MPI PDE's HPJava HTML CORBA |
There can be no doubt that topics in Computational Science are useful and those in CSIT (Computational Science and Information Technology) are even more useful
|
The CSIT technologies are also difficult and involve fundamental ideas |
The area is of interest to both those in computer science and application fields |
Probably most jobs going to Computer Science graduates really need CSIT education but unfortunately
|
So most current implementations make CSIT as a set of courses within existing disciplines
|
The commodity Stranglehold |
http://www.netlib.org/utk/people/JackDongarra/SLIDES/top500-11-99.htm |
Here are Top 10 for 1999 |
First, Tenth, 100th, 500th, SUM of all 500 versus Time |
First, Tenth, 100th, 500th, SUM of all 500 Projected in Time |
Earth Simulator from Japan |
http://geofem.tokyo.rist.or.jp/ |
From Jack Dongarra http://www.netlib.org/utk/people/JackDongarra |
Shared Memory |
(designed distributed memory) |
From Jack Dongarra http://www.netlib.org/utk/people/JackDongarra |
Bottom of Pyramid has 100 times dollar value and 1000 times compute power of best supercomputer |
The natural building block for multiprocessors is now also about the fastest! |
Microprocessor performance increases 50% - 100% per year |
Transistor count doubles every 3 years |
DRAM size quadruples every 3 years |
Huge $ investment per generation is carried by commodity PC market |
Note that "simple" single-processor performance is plateauing, but that parallelism is a natural way to improve it. But many different forms of parallelism "Data or Thread Parallel" or "Automatic Instruction Parallel" |
Linpack Performance |
Linear Equation Solver Benchmark |
From David Culler again |
Basic driver of performance advances is decreasing feature size ( ?)
|
Die size is growing too
|
Performance increases > 100x per decade; clock rate x, rest of increase is due to transistor count |
Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect |
But of course you could change this natural "sweet spot" |
? 1 |
CPU |
Cache |
Interconnect |
30% per year |
100 million transistors on chip by early 2000's A.D. |
Transistor count grows faster than clock rate |
- 40% per year, order of magnitude more contribution in 2 decades |
Divergence between memory capacity and speed more pronounced
|
Larger memories are slower, while processors get faster
|
Parallelism increases effective size of each level of hierarchy, without increasing access time |
Parallelism and locality within memory systems too
|
Disks too: Parallel disks plus caching |
Data locality implies CPU finds information it needs in cache which stores most recently accessed information |
This means one reuses a given memory reference in many nearby computations e.g. |
A1 = B*C |
A2 = B*D + B*B |
.... Reuses B |
L3 Cache |
Main |
Memory |
Disk |
Increasing Memory |
Capacity Decreasing |
Memory Speed (factor of 100 difference between processor |
and main memory |
speed) |
Parallelism in processing
|
Cache to give locality in data access
|
Both need (transistor) resources, so tradeoff |
ILP (Instruction Loop Parallelism) drove performance gains of sequential microprocessors |
ILP Success was not expected by aficionado's of parallel computing and this "delayed" relevance of scaling "outer-loop" parallelism as user's just purchased faster "sequential machines" |
CPI = Clock Cycles per Instruction |
Hardware allowed many instructions per cycle using transistor budget for ILP parallelism |
Limited Speed up (average 2.75 below) and inefficient (50% or worse) |
However TOTALLY automatic (compiler generated) |
Thread level parallelism is the on chip version of dominant scaling (data) parallelism |
Transistors are still getting cheaper and cheaper and it only takes some 0.5 million transistors to make a very high quality CPU |
This chip would have little ILP (or parallelism in "innermost loops") |
Thus next generation of processor chips more or less have to have multiple CPU's as gain from ILP limited |
However getting much more speedup than this requires use of "outer loop" or data parallelism.
|
The March of Parallelism: Multiple boards --> Multiple chips on a board --> Multiple CPU's on a chip |
Implies that "outer loop" Parallel Computing gets more and more important in dominant commodity market |
Use of "Outer Loop" parallelism can not (yet) be automated |
Actually memory bandwidth is an essential problem in any computer as doing more computations per second requires accessing more memory cells per second!
|
Key limit is that memory gets slower as it gets larger and one tries to keep information as near to CPU as possible (in necessarily small size storage) |
This Data locality is unifying concept of caches (sequential) and parallel computer multiple memory accesses |
Problem seen in extreme case for Superconducting CPU's which can be 100X current CPU's but seem to need to use conventional memory |
From Jim Demmel -- remaining from David Culler |
This implies need for complex memory systems to hide memory latency |
µProc |
60%/yr. |
DRAM |
7%/yr. |
1 |
10 |
100 |
1000 |
1980 |
1981 |
1983 |
1984 |
1985 |
1986 |
1987 |
1988 |
1989 |
1990 |
1991 |
1992 |
1993 |
1994 |
1995 |
1996 |
1997 |
1998 |
1999 |
2000 |
DRAM |
CPU |
1982 |
Processor-Memory |
Performance Gap: (grows 50% / year) |
Time |
"Moore's Law" |
Performance |
For both parallel and sequential computers, cost is accessing remote memories with some form of "communication" |
Data locality addresses in both cases |
Differences are quantitative size of effect and what is done by user and what automatically |
L3 Cache |
Main |
Memory |
Board Level Interconnection Networks |
.... |
.... |
System Level Interconnection Network |
L3 Cache |
Main |
Memory |
L3 Cache |
Main |
Memory |
Slow |
Very Slow |