Given by Nancy McCracken and Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95/96/97. Foils prepared 14 October 1997
Outside Index
Summary of Material
This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations |
We discuss Euler, Runge Kutta and predictor corrector methods |
The simple data parallel O(N2) algorithm is given in Fortran90 and HPF |
The better Pipeline version is also given |
We analyse Performance |
Outside Index Summary of Material
Nancy McCracken, |
Geoffrey Fox |
NPAC |
Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations |
We discuss Euler, Runge Kutta and predictor corrector methods |
The simple data parallel O(N2) algorithm is given in Fortran90 and HPF |
The better Pipeline version is also given |
We analyse Performance |
Consider Models of physical systems represented as sets of particles rather than densities (fields) evolving over time |
Examples:
|
Laws of Motion are typically ordinary differential Equations |
Ordinary means differentiate wrt One Variable -- typically time |
N particles, each with a mass mi , moving with velocity Vi through 3-dimensional space |
Are governed by Newton's equations of motion |
Basic Kinematics |
Newton's Second Law |
Incorporate laws into equations of motion |
Example of force law for molecular dynamics |
ODE's give an equation for the derivative of X with respect to time t |
They can be classified (if second order) by the boundary conditions used |
Initial value problems
|
Boundary value
|
Second order equations such as |
can always be rewritten as a system of 2 first-order equations involving X and a new variable Y representing the first order derivative: |
For simplicity, we assume just one first-order equation, where f is the function on right hand side which depends on X and t |
We can always solve this by setting up a grid of equidistant points with grid size h=(B-A)/n where n is an integer. |
Starting from the initial value, we calculate positions one step at a time |
Two sources of error: |
Computational error includes such things as roundoff error, etc. and is generally controlled by having enough significant digits in the computer arithmetic |
Discretization error is the accuracy of the numerical method and has two measures:
|
Euler's method is not practical, but illustrates the technique. |
It involves Linear approximation to get next point |
Use Taylor's theorem to represent Y the exact solution: |
Whenever f satisfies certain smoothness conditions, there is always a sufficiently small step size h such that the difference between the real function value at ti and the approximation Xi+1 is less than some required error magnitude e. [Burden and Faires] |
Euler's method: one computation of the derivative function f at each step. |
Other methods require less computation in order to produce the specified error e. |
Initial Value Problem with known analytical solution: |
The approximation with Euler's method and h=0.25: |
Calculate a few values of the approximation: |
Note it will take about one million iterations to get an error of order O(10-6) |
The last column shows global error is of order O(h) as expected |
Use the derivative at one time step to extrapolate the midpoint value - use midpoint derivative to extrapolate the function value at the next time step |
Evaluates the derivative function twice at each time step. Global error - O(h2), second order method |
Sometimes called the midpoint method |
Note global error is now O(h2) and we get an error of O(10-5) after 128 iterations which would take about 1000 more iterations for Euler's method to achieve |
So Euler has roughly half the computational effort per iteration but requires the square of the number of iterations |
Runge Kutta methods achieve better results than Euler by using intermediate computations at intermediate time values |
The fourth-order rule is the favorite method as it achieves good accuracy with modest computational complexity -- the algorithm is in words: |
Use derivative of first time step to get trial midpoint |
Use its derivative at first time step to get second trial midpoint |
Use its derivative to get a trial end point |
Integrate by Simpon's Rule, using average of two midpoint estimates |
Global error is fourth order |
Compared with Euler, Runge-Kutta has 4 times more calculation per time step, but should use fourth root as many time steps |
First, predict Xi+1 using an explicit equation, with O(hn) error, and known values. |
Then correct this value by using it in an implicit equation, with O(hn+1) error. |
Simple example: |
The predictor/corrector methods use previous values Xi-1, Xi-2, ...... to increase accuracy p - not extra values between Xi and Xi+1 as in Runge Kutta |
General form of multi-step difference equation: |
Xi+1 = am-1Xi + am-2 Xi-1 + ..... + a0Xi+1-m + h(bmf(ti+1, Xi+1) + bm-1 f (ti, Xi) + .....
|
if coefficient of f evaluation independent on ti+1 bm=0, this is an explicit equation |
Note these are essentially interpolation formulae for if one uses information from m t values, you can fit a degree m-1 polynomial |
Taylor expansion and Polynomial fitting are essentially the same thing! |
Implicit Multistep Methods are gotten using backwards difference interpolating polynomials starting at ti+1. But wherever we we need f(ti+1,y(ti+1)), we use f(ti+1, X*i+1) where X*i+1is derived from the explicit predictor equation |
Note that implicit formulae should be best as explicit method involves extrapolation from ti to ti+1 whereas in implicit case ti+1 is endpoint of region in which interpolation done |
Extrapolation is always unreliable and to be avoided! |
Adams-Bashforth fourth-order explicit method (4 step) |
Adams-Moulton fourth order method (3 step) is an implicit Multistep method |
Note coefficient of implicit method is a factor of 251/19 smaller than explicit method of same order -- this is why extrapolation is not so good! |
Introduce 3 vectors X V A for position,velocity and acceleration of particles |
Numerical techniques iterate equations over time using Runge Kutta (in our detailed example) or more simply:
|
Note i labels particles not time steps |
Positions and velocities are 3 N dimensional arrays, X and V |
other variables
|
subroutine for numerical method will take these arguments and update X and V. |
A subroutine called Grav (dataparallel) or MPGrav (message parallel) is assumed to compute new accelerations |
Computation of numerical method is inherently iterative: at each time step, the solution depends on the immediately preceding one. |
At each time step, Grav/MPGrav is called (several times as using Runge Kutta):
|
We will use 4th order Runge Kutta to integrate in time and the program is designed an overall routine looping over time with parallelism hidden in Grav/MPGrav routines |
We first analyse Data Parallel (starting with classic SIMD method) and then go through Message Parallel version |
C Solves Newton's equations of motion using Runge-Kutta method |
C which is globally 4th order. X and V are initial positions and |
C velocities. The system is evolved over a time interval h*ns. |
C X and V contain the updated state at that time.
|
C Grav is hard parallel algorithm and will be given later! |
Spread positions of particles into 2 3D arrays so that extra dimension is labelled by index in sum over particles that interact with a given particle |
Xj is essentially transpose of Xi in second and third dimension |
function Grav(X,M) |
C accepts positions of particles X and masses of particles M |
C returns accelerations in Grav |
C Uses completely parallel calculation, ignoring anti-symmetry of force
|
! calculates acceleration on body i due to body j in entries ( :,i,j )
|
! diag is true for diagonal of N by N slices
|
! set up arrays of particles Xi and particles Xj
|
! displacements and Euclidean distance
|
! calculate accelerations for all pairs except on main diagonal
|
end function Grav |
Symmetry of force on particles: Fij = -Fji (Newton's Law of Action and Reaction!)
|
There is a Load balancing problem with triangular arrays |
Assuming for example that processors assigned with block distribution in column direction.
|
Also, all particle information is sent to all processors, taking O(N2) space whereas natural algorithms use O(N) space and this is how special purpose machines like GRAPE get their cost effectiveness
|
Space is further wasted as everything is spread to 3 dimensional arrays even when arrays like mass are naturally one dimensional! |
Pair together the data for every particle Xi with the data for every particle Xj by iterating over a pipeline (circulating) array. |
Case when i compared to i is not needed as particles dont interact with themselves |
At step k, interact particle i with particle j= 1 + mod((i+k-1),N) |
Accumulate force on i due to j in fixed Ai |
Accumulate negative of this as force on j due to i in circulating Acj |
At the end of the algorithm, add Ai and Aci |
In parallel version, Note that Ai will be calculated in "home processor for particle i but Aci will travel around the machine being accumulated in processor holding particle j |
Thus this violates the owner computes rule and so this parallel algorithm must be implemented by hand -- the compiler will not find it automatically |
First step is to Circulate(shift) one position and calculate acelerations Fij and Fji in all index positions |
Shifting pipeline (N-1) times gives correct algorithm but does not save "Newton's factor of two". |
Just need (N-1)/2 steps when N is odd and N/2 steps when N is even which saves factor of two. |
function Grav(X,M) |
C accepts positions of particles X and masses of particles M |
C returns accelerations in Grav
|
! A is fixed accelerations - X and M are used for fixed positions and masses
|
!Shift Circulating Arrays Xc Mc Ac to the right
|
! calculate R to be distance over 3-D cordinates
|
if (( N mod 2) = 0 ) then ! final one way acceleration if N even
|
end if |
! combine accelerations for final result - circulating particle in i'th |
! position corresponds to fixed particle (i-(N-1)/2) |
Grav = A + cshift (Ac, dim=2, shift = (N-1)/2) |
end function Grav |
Distribute arrays in naive block fashion - if Nproc is the number of processors, each processor has N/Nproc particles. |
Consider time for Runge Kutta invocation of function Grav |
Shifting particles communicates one set of particle information - all processors communicate at the same time giving estimate: |
9 * tcomm (factor should be 7 as need only 1 not 3 masses as we used in simple implementation earlier)
|
Floating point calculations: roughly 3(x,y,z) of -, *, sum, sqrt, exp, /, *, +, *, + which can be summarized as estimate: > 30 tfloat |
Each communicated particle is interacted with the N/Nproc particles in the local partition of that processor and each step has one shift giving a total time for (N-1)/2 steps in Grav: |
Then the total time for the Runge-Kutta solver is: |
Giving O(N2/Nproc) running time in the number of particles. |
Note that parallel overhead or communication time/computation time is proportional to 1/n where n = N/Nproc is grain size. |
Note this algorithm has an overhead characteristic of a one dimensional problem for
|
Not only is performance model characteristic of one dimensional problem but we also we used a one-dimension parallel decomposition |
Note these one dimensional characteristics are independent of dimension of space that particles move in. |
Normally d is thought of as dimension of physical space in which problem posed. |
But the geometrical interpretation of d is only valid where interaction between particles is itself geometrical i.e. short range |
Rather N body algorithm has no geometric structure and you see d=1 characteristic of algorithm |
See Chapter 3 of Parallel Computing Works for a longer discussion of this from a "Complex Systems" point of view |
Note the simple N-body problem has some interesting features
|
! declarations of global variables |
module nbodyvars |
real, dimension ( : , : ), allocatable :: X, V, M |
integer NB ! number of particles |
real G |
end module |
! allocate arrays |
subroutine setup ( ) |
use nbodyvars |
open (10, file=fnm, status="OLD") |
read (10, *) NB |
allocate (X (3, NB)) |
allocate (V (3, NB)) |
. . . |
end subroutine |
subroutine runga-kutte(h, ns) |
use nbodyvars |
real h; integer ns |
INTERFACE
|
end interface |
. . . |
do k = 1, ns
|
end subroutine |
subroutine Grav(X, M, A)
|
!HPF ALIGN Xdelta1, Vdelta1 . . . WITH X |
. . . |
end subroutine |
! main program |
program nbody |
use nbodyvars |
. . . |
call setup ( ) |
!HPF DISTRIBUTE X ( :, BLOCK) |
!HPF ALIGN V, M WITH X |
. . . |
do k = 1, np |
call runge-kutta(timestep, ns) |
call print_state () |
end do |
end program |
These O(N2) techniques successful on astrophysical problems of size a few thousand particles. Larger problems, such as on the scale of galaxies, do not calculate all pairs of particle interactions but use "fast multipole" and estimate force for areas of distant particles. The data structure for this technique is a Barnes-Hut tree. |
Burden, Richard L. and Faires, J. Douglas, Numerical Analysis. Fourth edition, PWS-Kent Publishing Company, 1989. This is basic ODE reference |
There is also ODE chapter from the CSEP book, http://www.npac.syr.edu/projects/csep/ode/ode.html |
Chapter 9 of Solving Problems on Concurrent Processors, Volume I does parallel O(N2) message parallel |
Salmon, John K. Parallel Hierarchial N-body Methods, dissertation from Caltech, technical report SCCS-52, CRPC-90-14, 1990 is original practical fast multipole. |
Nancy McCracken, |
Geoffrey Fox |
NPAC |
Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations |
We discuss Euler, Runge Kutta and predictor corrector methods |
Various Message parallel O(N2) algorithms are described with performance comments |
There is a related data parallel module sharing the same initial foils |
3 Parallel Programming Paradigms
|
2 Important and very different Algorithms
|
Data Parallel approach is really only useful for the simple O(N2) case and even here it is quite tricky to express algorithm so that it is
|
The shared memory approach is effective for a modest number of processors in both algorithms.
|
Message Parallel approach gives you very efficient algorithms in both cases
|
The characteristic structure of N-body problem is an observable that depends on all pairs of entities from a set of N entities. |
This structure is seen in diverse applications: |
1)Look at a database of items and calculate some form of correlation between all pairs of database entries |
2)This was first used in studies of measurements of a "chaotic dynamical system" with points xi which are vectors of length m |
Put rij = distance between xi and xj in m dimensional space |
Then probability p(rij = r) is proportional to r(d-1)
|
3)Green's Function Approach to simple Partial Differential equations gives solutions as integrals of known Green's functions times "source" or "boundary" terms.
|
4)In the so called vortex method in CFD (Computational Fluid Dynamics) one models the Navier Stokes Equation as the long range interactions between entities which are the vortices |
5)Chemistry (see foil 7) uses molecular dynamics and so particles are molecules but force is not Newton's laws usually but rather Van der Waals forces which are long range but fall off faster than 1/r2 |
Let MPGrav(i) return the acceleration of i'th particle which is specified by position X(i) and velocity V(i) |
The kernel of algorithm increments X(i),V(i) from t to t+h using Runge-Kutta method. |
This involves 4 function calls to MPGrav(i) for the four different choices of position and time needed in the Runge-Kutta method. |
Let Xuse(i) be position vector used in each function call. Then we have |
(time,Xuse) = (t,X) (t+h/2, X + (h/2)Dxa) (t+h/2, X + (h/2)Dxb) (t+h, X + hDxc) |
where Dxa Dxb Dxc are shift vectors calculated by previous phase of Runge-Kutta method |
Note that MPGrav(i) result depends on the array Xuse and fixed parameters such as mass
|
The calculation involves an iteration over desired number of time steps increasing t by h each time. |
Each time step involves 9 phases -- at each phase, one loops over all N particles (or rather over all N/Nproc particles stored in a given processor) |
Some phases (4 of them) involve communication and computation; the others "just" SPMD computation. |
The number of phases depends on ODE strategy used -- with the simple Euler method, there are 1 or 2 phases, depending on how one counts. |
1)XdeltaA(i) = V(i) |
Xuse(i) = X(i) running over all i in Processor |
2)VdeltaA(i) = MPGrav(i) involves communication |
3)XdeltaB(i) = V(i) + h*VdeltaA(i)/2 |
Xuse(i) = X(i) +h*XdeltaA(i)/2 |
4)VdeltaB(i) = MPGrav(i) involves communication |
5)XdeltaC(i) = V(i) + h*VdeltaB(i)/2 |
Xuse(i) = X(i) +h*XdeltaB(i)/2 |
6)VdeltaC(i) = MPGrav(i) involves communication |
7)XdeltaD(i) = V(i) + h*VdeltaC(i) |
Xuse(i) = X(i) +h*XdeltaC(i) |
8)VdeltaD(i) = MPGrav(i) involves communication |
9)X(i) becomes (X(i) + h*(XdeltaA(i)+2*XdeltaB(i)+2*XdeltaC(i)+XdeltaD(i))/6 ) |
V(i) becomes (V(i) + h*(VdeltaA(i)+2*VdeltaB(i)+2*VdeltaC(i)+VdeltaD(i))/6 ) |
This problem is classic SPMD with different phases of computing and communication which the four 4 MPGrav phases further subdividable into subphases which are compute only -- communicate only phases |
We divide particles equally between processors
|
As no "locality" in force, all particles are "equal" and it does not matter which particle is placed in which processor |
Remember each of nine phases is a full loop over all particles in a given processor with local index running from 1 ... N/Nproc |
We get "automatic" balanced parallel computation in steps 1) 3) 5) 9) with "embarassingly parallel" (i.e. no communication) operation |
MPGrav(i) uses Xuse(i) and mass M(i) and calculates a 3 vector force from 3 vector Xuse(i,j) |
MPGrav(i) = S M(i)*M(j)*( Xuse(j)-Xuse(i)) |
where ri,j = | Xuse(i)-Xuse(j) | is distance between particle i and particle j |
This calculation involves communication as N-Nproc of values of j (and parameters Xuse(j) M(j) ) are stored outside the processor holding the i'th processor |
We describe algorithms with increasing complexity and efficiency! |
Note in discussions, we are rather sloppy as to whether i,j are "local" (1...N/Nproc) or "global" (1...N) indices |
For each i in each processor, set MPGrav(i) =0 |
We will implement a naive "owner's-compute rule" algorithm where MPGrav(i) is calculated in processor that is home to i |
Now loop over over j= 1...N (j ¹ i) |
When j is stored in processor holding i, increment MPGrav(i) by contribution due to j |
When j stored in a different processor, communicate Xuse(j),M(j) and increment MPGrav(i) |
The parallelism is perfect and both communication and computation are load balanced |
However messages are small (as described 4 words - a 3 vector Xuse and mass M) and |
communication » 4 tcomm(for small messages) |
where estimate ~15 tfloat could be higher depending on how 1/ri,j3 calculated. As involves division and square root, it is expensive if done directly. Sometimes better to calculate by table lookup but this involves slow memory access |
As typically tcomm/tfloat » 10 even for quite large messages (and worse for small ones), the above estimate suggests that communication dominates ... |
We can solve the small message size and poor communication/computation ratio by a simple change in algorithm which still preserves owner's compute rule. |
Reverse loops over i and j so that j is outer loop and i inner loop |
First for each i in processor, set MPGrav(i) = 0 |
Now looping over j, increment each MPGrav(i) by contribution of those j that are local to processor |
Now fetch those j which are off processor and communicate as before M(j), Xuse(j) |
For this j, run over all i in processor, incrementing MPGrav(i) by the contribution of this j |
This version of algorithm reduces communication by a factor of the grain size n =N/Nproc |
and total communication overhead is approximately |
which is small for n > 100 |
The algorithm as described before still has small messages but this can be addressed for both "very bad" and "much better" algorithm by "blocking" j loop so that one fetches not 1 but J values of j at a time. |
This implies messages can be "arbitarily" large and user can choose J so that:
|
See later comments on cache use and pipelining of messages for further related issues |
As owner's compute rule is obeyed, a good parallelizing compiler should be able to "automatically" find the "much better" algorithm as inverting loops and blocking are standard optimization strategies |
Note that "much better" parallel algorithm is also correct sequential algorithm as naturally uses each j value N-1 times as j block fixed in cache and i values are cycled through
|
General lesson is that amount of computation and amount of data re-use are as important as amount of communication |
The O(N2) long range force algorithm has largest communication load but the smallest known form of overhead (except pleasingly parallel problems) which is proportional to 1/n where n grain size
|
Compare Laplace (general PDE) in two dimensions which has overhead proportional to 1/n1/2 in two dimensions and 1/n1/3 in 3 dimensions |
Such PDE's have small (edge) communication but an algorithm with computation proportional to N not N2 each iteration |
N-body problem has computation proportional to N2 and general rule is that as algorithms get either more complex or more compute intense, the relative amount of communication decreases and does NOT increase
|
Previous Algorithm is not as good as it looks for it has an efficiency of 50% reduced by terms of order 1/n
|
Degradation is because of Newton's law of action and reaction which says that |
Fi,j = -Fj,i |
which reduces sequential computation load by a factor of 2 |
This is not trivial to exploit in parallel algorithm as Fi,j and Fj,i are needed in different processors and so one MUST violate owner's compute rule to exploit
|
Introduce a new array MPGrav_travel(i) which will travel through the array picking up the symmetrically generated terms |
First in each processor, initialize both MPGrav(i) and MPGrav_travel(i) to zero |
Now we as before have a an outer loop over j which in practice will be blocked into J items for message size issues
|
Now loop over each i in each processor and have some flag to decide whether or not Fi,j is to be calculated in Home of i or Home of j |
if( Fi,j is to be computed in home of i) Then |
find Fi,j and use it to increment MPGrav(i) with Fi,j and |
Increment MPGrav_travel(j) with Fj,i = - Fi,j |
We need a criterion for deciding where to compute Fi,j so that for EACH j block, there is an equal of computation in each processor |
If the criterion is calculate in Home of i if i<j (natural sequential choice), then this is NOT load balanced for a one dimensional block decomposition as amount of work decreases as i increases and one gets to later numbered processors |
However one can use a cyclic or scattered decomposition (and "interaction criterion" i<j ) as then each processor has particles distributed throughout array
|
Note that this "best" algorithm halves computation and "doubles" communication as one is transferring twice as much information (7 not 4 items for each j) |
In our description of "much better" and "best" algorithm, we assumed that one broadcasts each J block to each processor |
There are some different ways of setting this up which can be more efficient on some architectures
|
The data parallel part of foils in fact describes the natural pipeline algorithm which rotates J blocks through processors one step at a time |
This has the feature (different from previous explanation) that each processor is handling a different set of j's at a given stage in computation. |
Nancy McCracken, |
Geoffrey Fox |
NPAC |
Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations |
We discuss Euler, Runge Kutta and predictor corrector methods |
F90 and HPF Data parallel O(N2) algorithms are described with performance comments |
There is a related message parallel module sharing the same initial foils |
Positions and velocities are 3 N dimensional arrays, X and V |
other variables
|
subroutine for numerical method will take these arguments and update X and V. |
A subroutine called MPGrav (message parallel) is assumed to compute new accelerations |
Computation of numerical method is inherently iterative: at each time step, the solution depends on the immediately preceding one. |
At each time step, MPGrav is called (several times as using Runge Kutta):
|
We will use 4th order Runge Kutta to integrate in time and the program is designed an overall routine looping over time with parallelism hidden in MPGrav routines |
We analyse Message Parallel version and other foils discuss Data Parallel version |
A denotes MPGrav and Ac denotes MPGrav_travel |
Case when i compared to i is not needed as particles dont interact with themselves |
At step k, interact particle i with particle j= 1 + mod((i+k-1),N) |
Accumulate force on i due to j in fixed Ai |
Accumulate negative of this as force on j due to i in circulating Acj |
At the end of the algorithm, add Ai and Aci |
In parallel version, Note that Ai will be calculated in "home processor for particle i but Aci will travel around the machine being accumulated in processor holding particle j |
Thus this violates the owner computes rule and so this parallel algorithm must be implemented by hand -- the compiler will not find it automatically |
First step is to Circulate(shift) one position and calculate acelerations Fij and Fji in all index positions |
Shifting pipeline (N-1) times gives correct algorithm but does not save "Newton's factor of two". |
Just need (N-1)/2 steps when N is odd and N/2 steps when N is even which saves factor of two. |
Shift in blocks of J particles for message passing to block for good performance |
Symmetry of force on particles: Fij = -Fji (Newton's Law of Action and Reaction!)
|
Parallel version has two issues -- firstly one cannot use "owner-computes" rule directly and secondly one must worry about load blancing |
Assuming for example that processors |
assigned with block distribution in |
column direction.
|