Given by Geoffrey C. Fox, Nancy McCracken at Computational Science for Simulations on Fall Semester 1998. Foils prepared 24 August 98
Outside Index
Summary of Material
We Introduce Computational Science and Driving Forces
|
We give a simple overview of parallel architectures today with distributed, shared or distributed shared memory |
We describe the growing importance of Java |
We explain pragmatic choices
|
Outside Index Summary of Material
Fall Semester 98 1998 |
Geoffrey Fox |
Northeast Parallel Architectures Center |
Syracuse University |
111 College Place |
Syracuse NY |
gcf@npac.syr.edu |
We Introduce Computational Science and Driving Forces
|
We give a simple overview of parallel architectures today with distributed, shared or distributed shared memory |
We describe the growing importance of Java |
We explain pragmatic choices
|
Practical use of leading edge computer science technologies to address "real" applications |
Two tracks at Syracuse |
CPS615/713: Simulation Track |
CPS606/616/640/714: Information Track |
Large Scale Parallel Computing Low latency Closely Coupled Components |
World Wide Distributed Computing Loosely Coupled Components |
Parallel Algorithms Fortran90 |
HPF MPI Interconnection Networks |
Transactions |
Security |
Compression |
PERL JavaScript Multimedia |
Wide Area Networks |
Java |
VRML Collaboration Integration (middleware) Metacomputing CORBA Databases |
The two forms of Large Scale Computing Scale Computer for Scale Users in Proportion Power User to number of computers |
Parallel Commodity Distributed Computers Information Systems Technology <--------------- Internetics Technologies ---------------> |
Parallel Computer Distributed Computer |
HPF MPI HPJava HTML VRML |
Instructor: Geoffrey Fox -- gcf@npac.syr.edu, 3154432163 and Room 3-131 CST |
Backup: Nancy McCracken -- njm@npac.syr.edu, 3154434687 and Room 3-234 CST |
Grader: Saleh Elmohamed -- saleh@npac.syr.edu, 3154431073 |
NPAC administrative support: Nicole Caza -- ncaza@npac.syr.edu, 3154431722 and room 3-206 CST |
The above can be reached at cps615ad@npac.syr.edu |
Students will be on alias cps615@npac.syr.edu |
Homepage is: http://www.npac.syr.edu/projects/jsufall98 |
Graded on the basis of approximately 8 Homework sets which will be due Thursday of the week following day (Tuesday or Thursday given out) |
There will be one project -- which will start after message passing (MPI) discussed |
Total grade is 70% homework, 30% project |
Languages will Fortran or C and Java -- we will do a survey early on to clarify this |
All homework will be handled through the web and indeed all computer access will be through the VPL or Virtual Programming Laboratory which gives access to compilers, Java visualization etc. through the web |
Status of High Performance Computing and Computation HPCC nationally |
What is Computational Science Nationally (and at Syracuse) |
Technology driving forces |
Moore's law and exponentially increasing transistors |
Elementary discussion of Parallel Computing in Society and why it must obviously work in simulation! |
Sequential and Parallel Computer Architectures |
Comparison of Architecture of World Wide Web and Parallel Systems |
Simple base Example -- Laplace's Equation
|
Then we discuss software and algorithms (mini-applications) intermixed |
Programming Models -- SPMD and Message Passing (MPI) with Fortran, C and Java |
Data Parallel (HPF, Fortran90) languages will be discussed but NOT used |
Visualization -- Java Applets |
Other tools such as Collaboration , Performance Measurement will be mentioned |
It is multiple processes running on multiple computers which are coordinated to solve a single problem! |
Distributed Computing can be defined in same way |
Parallel computing has tight coordination (implying low latency and high bandwidth communication) |
Distributed computing has looser constraints between component processes |
Continuing demands for higher performance |
Physical limits on single processor performance |
Increasing number of transistors in a single chip or on a single board implies parallel architectures |
High costs of internal concurrency |
Result is rise of multiprocessor architectures
|
Increasing importance of DISTRIBUTED computing through the Web and Internet
|
Transistors are getting cheaper and cheaper and it only takes some 0.5 million transistors to make a very high quality CPU
|
Already we build chips with some factor of ten more transistors than this and this is used for "automatic" instruction level parallelism.
|
However getting much more speedup than this requires use of "outer loop" or data parallelism. |
Actually memory bandwidth is an essential problem in any computer as doing more computations per second requires accessing more memory cells per second!
|
Large Scale Simulations in Engineering
|
Large Scale Academic Simulations (Physics, Chemistry, Biology)
|
"Other Critical Real World Applications"
|
HPCC is a maturing field with many organizations installing large scale systems |
These include NSF (academic computations) with PACI activity, DoE (Dept of Energy) with ASCI and DoD (Defense) with Modernization |
There are new applications with new algorithmic challenges
|
However these are not "qualitatively" new concepts |
Software ideas are "sound" but note simulation is only a few percent of total computer market and software immature
|
P |
Memory |
P |
P |
Node |
Node |
Node |
Node |
Typically a Bus on a board |
Multicomputer = nodes + network |
Node = processor(s) + local memory |
Access to local memory is cheap
|
Access to remote memory is expensive
|
Cost of remote memory access/communication (including synchronization)
|
Hence data locality is an important property of good parallel algorithms |
N |
T |
ts |
tw |
Data locality implies CPU finds information it needs in cache which stores most recently accessed information |
This means one reuses a given memory reference in many nearby computations e.g. |
A1 = B*C |
A2 = B*D + B*B |
.... Reuses B |
Processor |
Cache |
L2 Cache |
L3 Cache |
Main |
Memory |
Disk |
Increasing Memory |
Capacity Decreasing |
Memory Speed (factor of 100 difference between processor |
and main memory |
speed) |
For both parallel and sequential computers, cost is accessing remote memories with some form of "communication" |
Data locality addresses in both cases |
Differences are quantitative size of effect and what is done by user and what automatically |
Processor |
Cache |
L2 Cache |
L3 Cache |
Main |
Memory |
Processor |
Cache |
L2 Cache |
Processor |
Cache |
L2 Cache |
Board Level Interconnection Networks |
.... |
.... |
System Level Interconnection Network |
L3 Cache |
Main |
Memory |
L3 Cache |
Main |
Memory |
Slow |
Very Slow |
Major architecture types
|
Idealized Multicomputer Model fits current architectures pretty well |
Parallel computers allow several CPUs to contribute to a computation simultaneously and different architectures take different steps to enable this coordinated action. |
For our purposes, a parallel computer has three types of parts:
|
Some Key points: Easier in Some Architectures/Applications than others!
|
Colors Used in Following pictures |
Multiple Instruction/Multiple Data |
Processors with local memory connected by high-speed interconnection network |
Typically high bandwidth, medium latency |
Hardware support for remote memory access |
Model breaks down when topology matters |
Examples: Cray T3E, IBM SP |
Every processor has a memory others can't access. |
Advantages:
|
Disadvantages:
|
Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster are quite similar.
|
To program these machines:
|
Processors access shared memory via bus |
Low latency, high bandwidth |
Bus contention limits scalability |
Search for scalability introduces locality
|
Examples: Cray T90, SGI Power Challenge, Sun |
All processors access the same memory. |
Advantages:
|
Disadvantages:
|
Interconnection network shown here is actually for the BBN Butterfly, but C-90 is in the same spirit. |
These machines share data by direct access.
|
A hybrid of distributed and shared memory |
Small groups of processors share memory; others access across a scalable network |
Low to moderate latency, high bandwidth |
Model simplifies the multilevel hierarchy |
Examples: SGI Origin, HP Exemplar |
Combining the (dis)advantages of shared and distributed memory |
Lots of hierarchical designs are appearing.
|
Workstations connected by network |
Cost effective |
High latency, low to moderate bandwidth |
Often lack integrated software environment |
Multicomputer Model breaks down if connectivity limited |
Example Connectivities: Ethernet, ATM crossbar, Myrinet |
"Ordinary" Network |
Ordinary PC's or |
Workstations |
A parallel algorithm is a collection of tasks and a partial ordering between them. |
Design goals: We will return to this later on
|
Sources of parallelism: We will discuss this now
|
Data-parallel algorithms exploit the parallelism inherent in many large data structures.
|
Analysis:
|
Now we go through a set of semi-realistic examples of parallel computing and show they use various forms of data parallelism |
Seismic Wave Simulation: Regular mesh of grid points |
Cosmology (Universe) Simulation: Irregular collection of stars or galaxies |
Computer Chess: "Data" is now parts of a game tree with major complication of pruning algorithms
|
Defending the Nation: Tracking multiple missiles achieves parallelism from set of missiles |
Finite Element (NASTRAN) structures problem: Irregular collection of nodal (grid) points clustered near a crack |
Functional parallelism exploits the parallelism between the parts of many systems.
|
Analysis:
|
Applications are metaproblems with a mix of module (aka coarse grain functional) and data parallelism |
Modules are decomposed into parts (data parallelism) and composed hierarchically into full applications.They can be the
|
Modules are "natural" message-parallel components of problem and tend to have less stringent latency and bandwidth requirements than those needed to link data-parallel components
|
Assume that primary goal of metacomputing system is to add to existing parallel computing environments, a higher level supporting module parallelism
|
Use Java/Distributed Object Technology for modules -- note Java to growing extent used to write servers for CORBA and COM object systems |
We have multiple supercomputers in the backend -- one doing CFD simulation of airflow; another structural analysis while in more detail you have linear algebra servers (Netsolve); Optimization servers (NEOS); image processing filters(Khoros);databases (NCSA Biology workbench); visualization systems(AVS, CAVEs)
|
All linked to collaborative information systems in a sea of middle tier servers(as on previous page) to support design, crisis management, multi-disciplinary research |
Database |
Matrix Solver |
Optimization Service |
MPP |
MPP |
Parallel DB Proxy |
NEOS Control Optimization |
Origin 2000 Proxy |
NetSolve Linear Alg. Server |
IBM SP2 Proxy |
Gateway Control |
Agent-based Choice of Compute Engine |
Multidisciplinary Control (WebFlow) |
Data Analysis Server |
Many applications are what is called (essentially) embarrassingly or more kindly pleasingly parallel |
These are made up of independent concurrent components
|
In contrast points in a finite difference grid (from a differential equation) canNOT be updated independently |
Such problems are often formally data-parallel but can be handled much more easily -- like functional parallelism |
A parallel language provides an executable notation for implementing a parallel algorithm. |
Design criteria:
|
Usually a language reflects a particular style of expressing parallelism. |
Data parallel expresses concept of identical algorithm on different parts of array |
Message parallel expresses fact that at low level parallelism implis information is passed between different concurrently executing program parts |
Data-parallel languages provide an abstract, machine-independent model of parallelism.
|
Advantages:
|
Disadvantages:
|
Examples: HPF, C*, HPC++ |
Originated on SIMD machines where parallel operations are in lock-step but generalized (not so successfully as compilers too hard) to MIMD |
Program is based on typically coarse-grain tasks |
Separate address space and a processor number for each task |
Data shared by explicit messages
|
Examples: MPI, PVM, Occam for parallel computing |
Universal model for distributed computing to link naturally decomposed parts e.g. HTTP, RMI, IIOP etc. are all message passing
|
Advantages:
|
Disadvantages:
|
A parallel computation is a set of tasks |
Each task has local data, can be connected to other tasks by channels |
A task can
|
A receiving task blocks until data available |
Note ALL coordination is message passing |
Concurrency is enhanced by creating multiple tasks |
Scalability: More tasks than processor nodes |
Locality: Access local data when possible -- minimize message traffic |
A task (with local data and subtasks) is a unit for modular design -- it is an object! |
Mapping to nodes affects performance only |
Partition
|
Communication
|
Agglomeration
|
Mapping
|
Goal: identify opportunities for concurrent execution (define tasks: computation+data) |
Focus on data operated on by algorithm ...
|
... or on the operations performed
|
Identify communication requirements
|
Example: finite difference computation |
Must communicate with each neighbor |
Xi = (Xi-1 + 2*Xi + Xi+1)/4 |
Partition creates |
one task per point |
I=2 above |
Once tasks + communication determined, "agglomerate" small tasks into larger tasks |
Motivations
|
Caveats
|
Place tasks on processors, to
|
Techniques:
|
Simulate atmospheric processes
|
Represent atmosphere state by 3-D grid
|
Computation includes
|
Discretize the (continuous) domain by a regular Nx??Ny ??Nz grid
|
Approximate derivatives by finite differences
|
Use domain decomposition
|
Finite difference stencil horizontally
|
Radiation calculations vertically
|
Diagnostic sums
|
In horizontal
|
In vertical, clump all points in column
|
Resulting algorithm "reasonably scalable"
|
Technique depends on load distribution |
1) Agglomerate to one task per processor
|
2) Extend (1) to incorporate cyclic mapping
|
3) Use dynamic, local load balancing
|
A parallel computer is any old collection of processing elements that cooperate to solve large problems fast
|
Some broad issues:
|
Role of a computer architect:
|
Parallelism:
|
History: diverse and innovative organizational structures, often tied to novel programming models |
Rapidly maturing under strong technological constraints
|
Technological trends make parallel computing inevitable
|
Need to understand fundamental principles and design tradeoffs, not just taxonomies
|
Application demands: Our insatiable need for computing cycles
|
Technology Trends
|
Architecture Trends
|
Economics |
Current trends:
|
Demand for cycles fuels advances in hardware, and vice-versa
|
Range of performance demands
|
Goal of applications in using parallel machines: Speedup
|
For a fixed problem size (input data set), performance = 1/time
|
Results from March 96 |
Parallelism is pervasive |
Small to moderate scale parallelism very important |
Transition to parallel computing has occurred for scientific and engineering computing but this is 1-2% of computer market |
In rapid progress in commercial computing
|
Desktop also uses multithreaded programs, which are a lot like parallel programs (again functional not data parallel) |
Demand for improving throughput on sequential workloads
|
Solid application demand exists and will increase |
The natural building block for multiprocessors is now also about the fastest! |
Microprocessor performance increases 50% - 100% per year |
Transistor count doubles every 3 years |
DRAM size quadruples every 3 years |
Huge investment per generation is carried by huge commodity market |
Note that single-processor performance is plateauing, but that parallelism is a natural way to improve it. |
Sequential or small multiprocessors -- today 2 processor PC's |
Basic advance is decreasing feature size ( ??)
|
Die size is growing too
|
Performance > 100x per decade; clock rate 10x, rest of increase is due to transistor count |
How to use more transistors?
|
Fundamental issue is resource distribution, as in uniprocessors |
CPI = Clock Cycles per Instruction |
30% per year |
100 million transistors on chip by early 2000's A.D. |
Transistor count grows much faster than clock rate |
- 40% per year, order of magnitude more contribution in 2 decades |
Divergence between memory capacity and speed more pronounced
|
Larger memories are slower, while processors get faster
|
Parallelism increases effective size of each level of hierarchy, without increasing access time |
Parallelism and locality within memory systems too
|
Disks too: Parallel disks plus caching |
HPCC has developed good research ideas but cannot implement them as solving computing's hardest problem with 1 percent of the funding
|
We have learnt to use commodity hardware either
|
Let us do the same with software and design systems with maximum possible commodity software basis |
The world is building a wonderful distributed computing (information processing) environment using Web (dissemination) and distributed object (CORBA COM) technologies |
This includes Java, Web-linked databases and the essential standards such as HTML(documents), VRML(3D objects), JDBC (Java database connectivity).
|
We will "just" add high performance to this commodity distributed infrastructure
|
The alternative strategy starts with HPCC technologies (such as MPI,HPF) and adds links to commodity world. This approach does not easily track evolution of commodity systems and so has large maintenance costs |
Bottom of Pyramid has 100 times dollar value and 1000 times compute power of best supercomputer |
Web Software MUST be cheaper and better than MPP software as factor of 100 more money invested! |
Therefore natural strategy is to get parallel computing environment by adding synchronization of parallel algorithms to loosely coupled Web distributed computing model |
The Java Language has several good design features
|
Java has a very good set of libraries covering everything from commerce, multimedia, images to math functions (under development at http://math.nist.gov/javanumerics) |
Java has best available electronic and paper training and support resources |
Java is rapidly getting best integrated program development environments |
Java naturally integrated with network and universal machine supports powerful "write once-run anywhere" model |
There is a large and growing trained labor force |
Can we exploit this in computational science? |
Use of Java for: |
High Performance Network Computing |
Scientific and Engineering Computation |
(Distributed) Modeling and Simulation |
Parallel and Distributed Computing |
Data Intensive Computing |
Communication and Computing Intensive Commercial and Academic Applications |
HPCC Computational Grids ........ |
Very difficult to find a "conventional name" that doesn't get misunderstood by some community! |
The Web integration of Java gives it excellent "network" classes and support for message passing. |
Thus "Java plus message passing" form of parallel computing is actually somewhat easier than in Fortran or C. |
Coarse grain parallelism very natural in Java |
"Data Parallel" languages features are NOT in Java and have to be added (as a translator) of NPAC's HPJava to Java+Messaging just as HPF translates to Fortran plus message passing |
Java has built in "threads" and a given Java Program can run multiple threads at a time
|
Can be used to do more general parallel computing but only on shared memory computers
|
Combine threads on a shared memory machine with message passing between distinct distributed memories |
"Distributed" or "Virtual" Shared memory does support the JavaVM as hardware gives illusion of shared memory to JavaVM |
Message Passing |
Message Passing |
So here is a recipe for developing HPCC (parallel) applications as of August 1998 |
Use MPI for data parallel programs as alternatives are HPF, HPC++ or parallelizing compilers today
|
Today Fortran and C are good production languages |
Today Java can be used for client side applets and in systems middleware but too slow for production scientific code
|
Use metacomputers for pleasingly parallel and metaproblems -- not for closely knit problems with tight synchronization between parts |
Use where possible web and distributed object technology for "coordination" |
pleasingly parallel problems can use MPI or web/metacomputing technology |