Given by Geoffrey C. Fox at GEM Workshop Santa Fe on October 24-25 1997. Foils prepared Oct 26,97
Outside Index
Summary of Material
We briefly discuss relevant features of the national scene |
We state the computational problem for GEM and analyze parallel computing issues for key steps |
We map into a particle dynamics analogy using Green's function formalism and consider O(N2) cut-off O(N) and fast multipole methods |
We go in detail through the fast multipole method |
We expect object oriented approaches to be very important booth at language level and in using technologies like CORBA to integrate world-wide collaboration, simulations and experiments. |
Outside Index Summary of Material
Geoffrey Fox |
Syracuse University |
NPAC |
111 College Place Syracuse NY 13244 4100 |
3154432163 |
We briefly discuss relevant features of the national scene |
We state the computational problem for GEM and analyze parallel computing issues for key steps |
We map into a particle dynamics analogy using Green's function formalism and consider O(N2) cut-off O(N) and fast multipole methods |
We go in detail through the fast multipole method |
We expect object oriented approaches to be very important booth at language level and in using technologies like CORBA to integrate world-wide collaboration, simulations and experiments. |
There are national HPCC programs in:
|
The USA activities include
|
Software and Algorithms will be more important to GEM than hardware |
Hardware is driven by "commodity technology" e.g. use of chips from PC and workstation market
|
Some of current new developments focus on
|
Link of geographically distributed observations and computation
|
Special Purpose Computers for O(N2) particle dynamics that is used in some approaches to some parts of problem |
Less legacy code than other application areas
|
Green's Functions Relate Stress and Deformation to Integrals over slip deficit f(x,t) = slip(x,t)-V*t |
After discretization, Green's function integrals become sums with
|
with indices i and j running over fault segments |
The friction law then determines if there is an "earthquake" which is a correlated change in the slip deficit |
The Green's Function is in first approximation independent of time and time only enters through time dependence of slip deficit |
We are evolving a set of N differential equations -- think of each fault segment "i"as a particle -and can do this either as either
|
These are two standard choices in "ordinary" particle dynamics |
After each time/Monte Carlo step one needs to see if a slip will occur using friction law |
So basic computation is "long range" and O(N2) for N fault segments |
We can compare with other classic long range force problems such as particle dynamics (chemistry, astrophysics) and vortex method in CFD
|
But we have the same essential computational feature that O(N2) terms will lead to high computational load which needs to be addressed somehow. |
1) Ignore it and implement straightforwardly on a conventional computer |
2) Note that one can build special purpose parallel machines as need less communication bandwidth (and just one dimensional pipeline) and much less memory per CPU power than for general purpose (parallel) computer
|
3) Cutoff long range forces (common strategy in chemistry where Forces fall off faster (E.g. 1/r5) than in gravity)
|
Use the new "fast multipole" algorithms which have been very successful in other related areas
|
Instead of zeroing out contributions of far away particles, these methods do an ingenious multipole expansion of the contributions of far away regions |
These ideas came from many people: Appel, Barnes-Hut, Greengard and Rokhlin |
Salmon (Caltech) and Warren(Los Alamos) implemented very well for astrophysics with upto 108 particles |
Instead of summing over the M stars in a far away cluster, "just" replace by the contribution of the centroid |
Fast Multipole "just" applies this idea recursively with a sum over multipoles extending simple centroid formula. |
16 Pentium Pro's in a cluster (cost $30K today) reach about 1 gigaflop on a tree code with 107 particles |
6800 Pentium Pro's (ASCI Sandia machine) with around 300 million particles
|
But tree code is 105 times as efficient as O(N2) code on this problem
|
Real astrophysics is 3D but illustrate for 2D case |
Chop space recursively into square cells so that at most one (or some specified number) of particles is in each "leaf" cell |
We show this recursive break-up in a simple case |
Note refinement depends on number of particle density |
1)Simple Direct Sum over all particles in region |
2)Just Single Centroid Contribution |
3)Open up region into 8 cells and use centroid of each cell |
O(N2) Algorithms are extremely efficient on parallel machines as
|
This type of performance should be compared with classic grid problems in d dimensions where overhead proportional to edge over area and goes like 1/n1/d |
So that grain size in 3D local problems is "cube" of that needed in O(N2) problem |
Of course reason that communication overhead is so small is that computation is exceptionally large!
|
Note can "understand" 1/n dependence as write formula as Sj Si Force on i = function(i and j) -- |
important that outer sum is j not i! |
Each j is communicated to each node and is then used to increment force on n/2 (Newton's law of action and reaction) particles in given node |
This is classic "nearest neighbor" problem where one uses "domain decomposition and communicates particles around edge of domain of each processor.
|
Earthquakes cause load imbalance as "action" concentrated in processors where quake occurs |
Can be addressed at cost of increasing communication by using "block-cyclic" decomposition so each processor gets part of action |
Problem particularly bad for Monte Carlo as like classic "clustering algorithms" needed near critical points to get good performance and these have poor parallel performance |
Long range versions of Monte Carlo will perform satisfactorily as can parallelize over long range force part of computation |
In 2(3) dimensions add 2(3) binary bits at each step down tree |
Each time one goes down one level of tree, add X Y Z bit |
64 bit key supports 21 levels |
First you must decompose cells among processors which is nontrivial as irregular dynamic (maybe not in GEM) tree like structure |
Originally used orthogonal recursive bisection chopping space succesively in 2 in different dimensions |
However better to run "a space filling curve" through the cells and divide curve into chunks of equal work (which NOT equal numbers of particles as more computation in dense areas of particles)
|
Simplest is to sort keys and divide up this sorted list (Morton curve) |
Parallelism is clear -- update particles in parallel but what about communication? |
It requires a thoughtful algorithm but can efficiently fetch information needed.
|
Effectively top of tree replicated in all processors and tey just differ near bottom which is expanded in a way that depends on where processor is |
Having fetched all needed data, can update all particles in processors with high efficiency seen in O(N2) problem.
|
In astrophysics, also need to construct tree and rebalance load at essentially each time step
|
There is a straightforward parallel tree building algorithm |
These methods outperform O(N2) when more than a few thousand particles (unless very inhomogeneous) |
Note that tree codes are a simple example of general need in GEM to have technologies that bridge different physical scales
|
Information travels in O(1 or log K) steps and not K steps for a grid of size K |
If we look at |
Computational Fluid Dynamics |
Global Climate or Weather Simulations |
Flow in Porous Media |
Study of Phase Transitions |
We see as in GEM, uncertainties in some of critical basic physics but these fields have made clear progress even with a mix of phenomenology (turbulence models) and fundamental equations (e.g. Navier Stokes equations) |