Given by Geoffrey C. Fox, (Some Culler, Koelbel material) at CPS615 Basic Simulation Track for Computational Science on Fall Semester 98. Foils prepared 23 August 1998
Outside Index
Summary of Material
We Introduce Computational Science and Driving Forces
|
We give a simple overview of parallel architectures today with distributed, shared or distributed shared memory |
We describe data, functional and pleasing parallelism |
We describe principles of parallel programming using atmospheric simulation as an example |
We describe the growing importance of Java |
We explain pragmatic choices
|
Outside Index Summary of Material
Fall Semester 98 1998 |
Geoffrey Fox |
http://www.npac.syr.edu/projects/jsufall98 |
Northeast Parallel Architectures Center |
Syracuse University |
111 College Place |
Syracuse NY |
gcf@npac.syr.edu |
We Introduce Computational Science and Driving Forces
|
We give a simple overview of parallel architectures today with distributed, shared or distributed shared memory |
We describe data, functional and pleasing parallelism |
We describe principles of parallel programming using atmospheric simulation as an example |
We describe the growing importance of Java |
We explain pragmatic choices
|
There is a conventional website http://www.npac.syr.edu/projects/jsufall98 |
This will hold links to all presentations given in the class and to other resources at both NPAC and elsewhere |
This can be used to review material before and after classes -- in the lingo, it is an asynchronous learning resource |
We teach the class from Syracuse twice a week for 70 minutes at 5.30pm Syracuse, 4.30 pm Jackson time. |
This teaching uses Tango to deliver electronic material remotely |
Tango supports inter alia chat rooms, whiteboards, and the sharing of web pages. These are web pages available at "conventional website" above. This is synchronous learning but it uses same material you would use offline. |
Practical use of leading edge computer science technologies to address "real" applications |
Two tracks at Syracuse |
CPS615/713: Simulation Track |
CPS606/616/640/714: Information Track |
Large Scale Parallel Computing Low latency Closely Coupled Components |
World Wide Distributed Computing Loosely Coupled Components |
Parallel Algorithms Fortran90 |
HPF MPI Interconnection Networks |
Transactions |
Security |
Compression |
PERL JavaScript Multimedia |
Wide Area Networks |
Java |
VRML Collaboration Integration (middleware) Metacomputing CORBA Databases |
The two forms of Large Scale Computing Scale Computer for Scale Users in Proportion Power User to number of computers |
Parallel Commodity Distributed Computers Information Systems Technology <--------------- Internetics Technologies ---------------> |
Parallel Computer Distributed Computer |
HPF MPI HPJava HTML VRML |
Instructor: Geoffrey Fox -- gcf@npac.syr.edu, 3154432163 and Room 3-131 CST |
Backup: Nancy McCracken -- njm@npac.syr.edu, 3154434687 and Room 3-234 CST |
Grader: Saleh Elmohamed -- saleh@npac.syr.edu, 3154431073 |
NPAC administrative support: Nicole Caza -- ncaza@npac.syr.edu, 3154431722 and room 3-206 CST |
The above can be reached at cps615ad@npac.syr.edu |
Students will be on alias cps615@npac.syr.edu |
Homepage is: http://www.npac.syr.edu/projects/jsufall98 |
Graded on the basis of approximately 8 Homework sets which will be due Thursday of the week following day (Tuesday or Thursday given out) |
There will be one project -- which will start after message passing (MPI) discussed |
Total grade is 70% homework, 30% project |
Languages will Fortran or C and Java -- we will do a survey early on to clarify this |
All homework will be handled through the web and indeed all computer access will be through the VPL or Virtual Programming Laboratory which gives access to compilers, Java visualization etc. through the web |
Status of High Performance Computing and Computation HPCC nationally |
What is Computational Science Nationally (and at Syracuse) |
Technology driving forces |
Moore's law and exponentially increasing transistors |
Elementary discussion of Parallel Computing in Society and why it must obviously work in simulation! |
Sequential and Parallel Computer Architectures |
Comparison of Architecture of World Wide Web and Parallel Systems |
Simple base Example -- Laplace's Equation
|
Then we discuss software and algorithms (mini-applications) intermixed |
Programming Models -- SPMD and Message Passing (MPI) with Fortran, C and Java |
Data Parallel (HPF, Fortran90) languages will be discussed but NOT used |
Visualization -- Java Applets |
Other tools such as Collaboration , Performance Measurement will be mentioned |
This introduction is followed by a set of "vignettes" discussing problem classes which illustrate parallel programming and parallel algorithms |
Ordinary Differential Equations
|
Numerical Integration including adaptive methods |
Floating Point Arithmetic |
Monte Carlo Methods including Random Numbers |
Full Matrix Algebra as in
|
Partial Differential Equations implemented as sparse matrix problems (as in Computational Fluid Dynamics)
|
It is multiple processes running on multiple computers which are coordinated to solve a single problem! |
Distributed Computing can be defined in same way |
Parallel computing has tight coordination (implying low latency and high bandwidth communication) |
Distributed computing has looser constraints between component processes |
Continuing demands for higher performance |
Physical limits on single processor performance |
Increasing number of transistors in a single chip or on a single board implies parallel architectures |
High costs of internal concurrency |
Result is rise of multiprocessor architectures
|
Increasing importance of DISTRIBUTED computing through the Web and Internet
|
Transistors are getting cheaper and cheaper and it only takes some 0.5 million transistors to make a very high quality CPU
|
Already we build chips with some factor of ten more transistors than this and this is used for "automatic" instruction level parallelism.
|
However getting much more speedup than this requires use of "outer loop" or data parallelism. |
Actually memory bandwidth is an essential problem in any computer as doing more computations per second requires accessing more memory cells per second!
|
Large Scale Simulations in Engineering
|
Large Scale Academic Simulations (Physics, Chemistry, Biology)
|
"Other Critical Real World Applications"
|
For "Convential" MPP/Distributed Shared Memory Architecture |
Now(1996) Peak is 0.1 to 0.2 Teraflops in Production Centers
|
In 1999, one will see production 1 Teraflop systems |
In 2003, one will see production 10 Teraflop Systems |
In 2007, one will see production 50-100 Teraflop Systems |
Memory is Roughly 0.25 to 1 Terabyte per 1 Teraflop |
If you are lucky/work hard: Realized performance is 30% of Peak |
HPCC is a maturing field with many organizations installing large scale systems |
These include NSF (academic computations) with PACI activity, DoE (Dept of Energy) with ASCI and DoD (Defense) with Modernization |
There are new applications with new algorithmic challenges
|
However these are not "qualitatively" new concepts |
Software ideas are "sound" but note simulation is only a few percent of total computer market and software immature
|
P |
Memory |
P |
P |
Node |
Node |
Node |
Node |
Typically a Bus on a board |
Multicomputer = nodes + network |
Node = processor(s) + local memory |
Access to local memory is cheap
|
Access to remote memory is expensive
|
Cost of remote memory access/communication (including synchronization)
|
Hence data locality is an important property of good parallel algorithms |
N |
T |
ts |
tw |
Data locality implies CPU finds information it needs in cache which stores most recently accessed information |
This means one reuses a given memory reference in many nearby computations e.g. |
A1 = B*C |
A2 = B*D + B*B |
.... Reuses B |
Processor |
Cache |
L2 Cache |
L3 Cache |
Main |
Memory |
Disk |
Increasing Memory |
Capacity Decreasing |
Memory Speed (factor of 100 difference between processor |
and main memory |
speed) |
For both parallel and sequential computers, cost is accessing remote memories with some form of "communication" |
Data locality addresses in both cases |
Differences are quantitative size of effect and what is done by user and what automatically |
Processor |
Cache |
L2 Cache |
L3 Cache |
Main |
Memory |
Processor |
Cache |
L2 Cache |
Processor |
Cache |
L2 Cache |
Board Level Interconnection Networks |
.... |
.... |
System Level Interconnection Network |
L3 Cache |
Main |
Memory |
L3 Cache |
Main |
Memory |
Slow |
Very Slow |
Major architecture types
|
Idealized Multicomputer Model fits current architectures pretty well |
Parallel computers allow several CPUs to contribute to a computation simultaneously and different architectures take different steps to enable this coordinated action. |
For our purposes, a parallel computer has three types of parts:
|
Some Key points: Easier in Some Architectures/Applications than others!
|
Colors Used in Following pictures |
Multiple Instruction/Multiple Data |
Processors with local memory connected by high-speed interconnection network |
Typically high bandwidth, medium latency |
Hardware support for remote memory access |
Model breaks down when topology matters |
Examples: Cray T3E, IBM SP |
Every processor has a memory others can't access. |
Advantages:
|
Disadvantages:
|
Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster are quite similar.
|
To program these machines:
|
Processors access shared memory via bus |
Low latency, high bandwidth |
Bus contention limits scalability |
Search for scalability introduces locality
|
Examples: Cray T90, SGI Power Challenge, Sun |
All processors access the same memory. |
Advantages:
|
Disadvantages:
|
Interconnection network shown here is actually for the BBN Butterfly, but C-90 is in the same spirit. |
These machines share data by direct access.
|
A hybrid of distributed and shared memory |
Small groups of processors share memory; others access across a scalable network |
Low to moderate latency, high bandwidth |
Model simplifies the multilevel hierarchy |
Examples: SGI Origin, HP Exemplar |
Combining the (dis)advantages of shared and distributed memory |
Lots of hierarchical designs are appearing.
|
Workstations connected by network |
Cost effective |
High latency, low to moderate bandwidth |
Often lack integrated software environment |
Multicomputer Model breaks down if connectivity limited |
Example Connectivities: Ethernet, ATM crossbar, Myrinet |
"Ordinary" Network |
Ordinary PC's or |
Workstations |
A parallel algorithm is a collection of tasks and a partial ordering between them. |
Design goals: We will return to this later on
|
Sources of parallelism: We will discuss this now
|
Data-parallel algorithms exploit the parallelism inherent in many large data structures.
|
Analysis:
|
Now we go through a set of semi-realistic examples of parallel computing and show they use various forms of data parallelism |
Seismic Wave Simulation: Regular mesh of grid points |
Cosmology (Universe) Simulation: Irregular collection of stars or galaxies |
Computer Chess: "Data" is now parts of a game tree with major complication of pruning algorithms
|
Defending the Nation: Tracking multiple missiles achieves parallelism from set of missiles |
Finite Element (NASTRAN) structures problem: Irregular collection of nodal (grid) points clustered near a crack |
2 Different types of Mappings in Physical Spaces |
Both are static
|
Different types of Mappings -- A very dynamic case without any underlying Physical Space |
c)Computer Chess with dynamic game tree decomposed onto 4 nodes |
And the corresponding poor workload balance |
And excellent workload balance |
Functional parallelism exploits the parallelism between the parts of many systems.
|
Analysis:
|
Applications are metaproblems with a mix of module (aka coarse grain functional) and data parallelism |
Modules are decomposed into parts (data parallelism) and composed hierarchically into full applications.They can be the
|
Modules are "natural" message-parallel components of problem and tend to have less stringent latency and bandwidth requirements than those needed to link data-parallel components
|
Assume that primary goal of metacomputing system is to add to existing parallel computing environments, a higher level supporting module parallelism
|
Use Java/Distributed Object Technology for modules -- note Java to growing extent used to write servers for CORBA and COM object systems |
We have multiple supercomputers in the backend -- one doing CFD simulation of airflow; another structural analysis while in more detail you have linear algebra servers (Netsolve); Optimization servers (NEOS); image processing filters(Khoros);databases (NCSA Biology workbench); visualization systems(AVS, CAVEs)
|
All linked to collaborative information systems in a sea of middle tier servers(as on previous page) to support design, crisis management, multi-disciplinary research |
Database |
Matrix Solver |
Optimization Service |
MPP |
MPP |
Parallel DB Proxy |
NEOS Control Optimization |
Origin 2000 Proxy |
NetSolve Linear Alg. Server |
IBM SP2 Proxy |
Gateway Control |
Agent-based Choice of Compute Engine |
Multidisciplinary Control (WebFlow) |
Data Analysis Server |
Many applications are what is called (essentially) embarrassingly or more kindly pleasingly parallel |
These are made up of independent concurrent components
|
In contrast points in a finite difference grid (from a differential equation) canNOT be updated independently |
Such problems are often formally data-parallel but can be handled much more easily -- like functional parallelism |
A parallel language provides an executable notation for implementing a parallel algorithm. |
Design criteria:
|
Usually a language reflects a particular style of expressing parallelism. |
Data parallel expresses concept of identical algorithm on different parts of array |
Message parallel expresses fact that at low level parallelism implis information is passed between different concurrently executing program parts |
Data-parallel languages provide an abstract, machine-independent model of parallelism.
|
Advantages:
|
Disadvantages:
|
Examples: HPF, C*, HPC++ |
Originated on SIMD machines where parallel operations are in lock-step but generalized (not so successfully as compilers too hard) to MIMD |
Program is based on typically coarse-grain tasks |
Separate address space and a processor number for each task |
Data shared by explicit messages
|
Examples: MPI, PVM, Occam for parallel computing |
Universal model for distributed computing to link naturally decomposed parts e.g. HTTP, RMI, IIOP etc. are all message passing
|
Advantages:
|
Disadvantages:
|
A parallel computation is a set of tasks |
Each task has local data, can be connected to other tasks by channels |
A task can
|
A receiving task blocks until data available |
Note ALL coordination is message passing |
Concurrency is enhanced by creating multiple tasks |
Scalability: More tasks than processor nodes |
Locality: Access local data when possible -- minimize message traffic |
A task (with local data and subtasks) is a unit for modular design -- it is an object! |
Mapping to nodes affects performance only |
Partition
|
Communication
|
Agglomeration
|
Mapping
|
Goal: identify opportunities for concurrent execution (define tasks: computation+data) |
Focus on data operated on by algorithm ...
|
... or on the operations performed
|
Identify communication requirements
|
Example: finite difference computation |
Must communicate with each neighbor |
Xi = (Xi-1 + 2*Xi + Xi+1)/4 |
Partition creates |
one task per point |
I=2 above |
Once tasks + communication determined, "agglomerate" small tasks into larger tasks |
Motivations
|
Caveats
|
Place tasks on processors, to
|
Techniques:
|
A parallel computer is any old collection of processing elements that cooperate to solve large problems fast
|
Some broad issues:
|
Role of a computer architect:
|
Parallelism:
|
History: diverse and innovative organizational structures, often tied to novel programming models |
Rapidly maturing under strong technological constraints
|
Technological trends make parallel computing inevitable
|
Need to understand fundamental principles and design tradeoffs, not just taxonomies
|
Application demands: Our insatiable need for computing cycles
|
Technology Trends
|
Architecture Trends
|
Economics |
Current trends:
|
Demand for cycles fuels advances in hardware, and vice-versa
|
Range of performance demands
|
Goal of applications in using parallel machines: Speedup
|
For a fixed problem size (input data set), performance = 1/time
|
Results from March 96 |
Parallelism is pervasive |
Small to moderate scale parallelism very important |
Transition to parallel computing has occurred for scientific and engineering computing but this is 1-2% of computer market |
In rapid progress in commercial computing
|
Desktop also uses multithreaded programs, which are a lot like parallel programs (again functional not data parallel) |
Demand for improving throughput on sequential workloads
|
Solid application demand exists and will increase |
The natural building block for multiprocessors is now also about the fastest! |
Microprocessor performance increases 50% - 100% per year |
Transistor count doubles every 3 years |
DRAM size quadruples every 3 years |
Huge investment per generation is carried by huge commodity market |
Note that single-processor performance is plateauing, but that parallelism is a natural way to improve it. |
Sequential or small multiprocessors -- today 2 processor PC's |
Basic advance is decreasing feature size ( ??)
|
Die size is growing too
|
Performance > 100x per decade; clock rate 10x, rest of increase is due to transistor count |
How to use more transistors?
|
Fundamental issue is resource distribution, as in uniprocessors |
CPI = Clock Cycles per Instruction |
30% per year |
100 million transistors on chip by early 2000's A.D. |
Transistor count grows much faster than clock rate |
- 40% per year, order of magnitude more contribution in 2 decades |
Divergence between memory capacity and speed more pronounced
|
Larger memories are slower, while processors get faster
|
Parallelism increases effective size of each level of hierarchy, without increasing access time |
Parallelism and locality within memory systems too
|
Disks too: Parallel disks plus caching |
The fundamental principles behind the use of concurrent computers are identical to those used in society - in fact they are partly why society exists. |
If a problem is too large for one person, one does not hire a SUPERman, but rather puts together a team of ordinary people... |
cf. Construction of Hadrians Wall |
Domain Decomposition is Key to Parallelism |
Need "Large" Subdomains l >> l overlap |
AMDAHL"s LAW or |
Too many cooks spoil the broth |
Says that |
Speedup S is small if efficiency e small |
or for Hadrian's wall |
equivalently S is small if length l small |
But this is irrelevant as we do not need parallel processing unless problem big! |
"Pipelining" or decomposition by horizontal section is:
|
Hadrian's Wall is one dimensional |
Humans represent a flexible processor node that can be arranged in different ways for different problems |
The lesson for computing is: |
Original MIMD machines used a hypercube topology. The hypercube includes several topologies including all meshes. It is a flexible concurrent computer that can tackle a broad range of problems. Current machines use different interconnect structure from hypercube but preserve this capability. |
Comparing Computer and Hadrian's Wall Cases |
At the finest resolution, collection of neurons sending and receiving messages by axons and dendrites |
At a coarser resolution |
Society is a collection of brains sending and receiving messages by sight and sound |
Ant Hill is a collection of ants (smaller brains) sending and receiving messages by chemical signals |
Lesson: All Nature's Computers Use Message Passing |
With several different Architectures |
Problems are large - use domain decomposition Overheads are edge effects |
Topology of processor matches that of domain - processor with rich flexible node/topology matches most domains |
Regular homogeneous problems easiest but |
irregular or |
Inhomogeneous |
Can use local and global parallelism |
Can handle concurrent calculation and I/O |
Nature always uses message passing as in parallel computers (at lowest level) |
The case of Programming a Hypercube |
Each node runs software that is similar to sequential code |
e.g., FORTRAN with geometry and boundary value sections changed |
Geometry irregular but each brick takes about the same amount of time to lay. |
Decomposition of wall for an irregular geometry involves equalizing number of bricks per mason, not length of wall per mason. |
Fundamental entities (bricks, gargoyles) are of different complexity |
Best decomposition dynamic |
Inhomogeneous problems run on concurrent computers but require dynamic assignment of work to nodes and strategies to optimize this |
(we use neural networks, simulated annealing, spectral bisection etc.) |
Global Parallelism
|
Local Parallelism
|
Local and Global Parallelism |
Should both be Exploited |
Disk (input/output) Technology is better matched to several modest power processors than to a single sequential supercomputer |
Concurrent Computers natural in databases, transaction analysis |
Simulate atmospheric processes
|
Represent atmosphere state by 3-D grid
|
Computation includes
|
Discretize the (continuous) domain by a regular Nx??Ny ??Nz grid
|
Approximate derivatives by finite differences
|
Use domain decomposition
|
Finite difference stencil horizontally
|
Radiation calculations vertically
|
Diagnostic sums
|
In horizontal
|
In vertical, clump all points in column
|
Resulting algorithm "reasonably scalable"
|
Technique depends on load distribution |
1) Agglomerate to one task per processor
|
2) Extend (1) to incorporate cyclic mapping
|
3) Use dynamic, local load balancing
|
HPCC has developed good research ideas but cannot implement them as solving computing's hardest problem with 1 percent of the funding
|
We have learnt to use commodity hardware either
|
Let us do the same with software and design systems with maximum possible commodity software basis |
The world is building a wonderful distributed computing (information processing) environment using Web (dissemination) and distributed object (CORBA COM) technologies |
This includes Java, Web-linked databases and the essential standards such as HTML(documents), VRML(3D objects), JDBC (Java database connectivity).
|
We will "just" add high performance to this commodity distributed infrastructure
|
The alternative strategy starts with HPCC technologies (such as MPI,HPF) and adds links to commodity world. This approach does not easily track evolution of commodity systems and so has large maintenance costs |
Bottom of Pyramid has 100 times dollar value and 1000 times compute power of best supercomputer |
Web Software MUST be cheaper and better than MPP software as factor of 100 more money invested! |
Therefore natural strategy is to get parallel computing environment by adding synchronization of parallel algorithms to loosely coupled Web distributed computing model |
The Java Language has several good design features
|
Java has a very good set of libraries covering everything from commerce, multimedia, images to math functions (under development at http://math.nist.gov/javanumerics) |
Java has best available electronic and paper training and support resources |
Java is rapidly getting best integrated program development environments |
Java naturally integrated with network and universal machine supports powerful "write once-run anywhere" model |
There is a large and growing trained labor force |
Can we exploit this in computational science? |
Use of Java for: |
High Performance Network Computing |
Scientific and Engineering Computation |
(Distributed) Modeling and Simulation |
Parallel and Distributed Computing |
Data Intensive Computing |
Communication and Computing Intensive Commercial and Academic Applications |
HPCC Computational Grids ........ |
Very difficult to find a "conventional name" that doesn't get misunderstood by some community! |
The Web integration of Java gives it excellent "network" classes and support for message passing. |
Thus "Java plus message passing" form of parallel computing is actually somewhat easier than in Fortran or C. |
Coarse grain parallelism very natural in Java |
"Data Parallel" languages features are NOT in Java and have to be added (as a translator) of NPAC's HPJava to Java+Messaging just as HPF translates to Fortran plus message passing |
Java has built in "threads" and a given Java Program can run multiple threads at a time
|
Can be used to do more general parallel computing but only on shared memory computers
|
Combine threads on a shared memory machine with message passing between distinct distributed memories |
"Distributed" or "Virtual" Shared memory does support the JavaVM as hardware gives illusion of shared memory to JavaVM |
Message Passing |
Message Passing |
See Original Foil |
So here is a recipe for developing HPCC (parallel) applications as of August 1998 |
Use MPI for data parallel programs as alternatives are HPF, HPC++ or parallelizing compilers today
|
Today Fortran and C are good production languages |
Today Java can be used for client side applets and in systems middleware but too slow for production scientific code
|
Use metacomputers for pleasingly parallel and metaproblems -- not for closely knit problems with tight synchronization between parts |
Use where possible web and distributed object technology for "coordination" |
pleasingly parallel problems can use MPI or web/metacomputing technology |