Full HTML for

Scripted foilset N Body Problems using Data parallel Approach

Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on 1998 Enhancements. Foils prepared 22 February 1998
Outside Index Summary of Material


This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations
We discuss Euler, Runge Kutta and predictor corrector methods
F90 and HPF Data parallel O(N2) algorithms are described with performance comments
There is a related message parallel module sharing the same initial foils

Table of Contents for full HTML of N Body Problems using Data parallel Approach

Denote Foils where Image Critical
Denote Foils where HTML is sufficient

1 CPS 615 -- Computational Science in
Simulation Track
Data Parallel Module on ODE's and Particle Dynamics
February 20, 1998

2 Abstract of Message Parallel ODE and Particle Dynamics
3 Particle (N-Body) Applications and Ordinary Differential Equations (ODE's)
4 Particle Applications - Ordinary Differential Equations (ODE's)
5 Particle Applications - the N-body problem
6 Newton's First Law -- The Gravitational Force on a Particle
7 Equations of Motion -- Newton's Second Law
8 Numerical techniques for solving ODE's
9 Second and Higher Order Equations
10 Basic Discretization of Single First Order Equation
11 Errors in numerical approximations
12 Runge-Kutta Methods: Euler's method
13 Estimate of Error in Euler's method
14 Relationship of Error to Computation
15 Example using Euler's method from the CSEP book
16 Approximate solutions at t=1,using Euler's method with different values of h
17 Runge-Kutta Methods: Modified Euler's method
18 Approximate solutions of the ODE for et at t=1, using modified Euler's method with different values of h
19 The Classical Runge-Kutta -- In Words
20 The Classical Runge-Kutta -- Formally
21 The Classical Runge-Kutta Pictorially
22 Predictor / Corrector Methods
23 Definition of Multi-step methods
24 Features of Multi-Step Methods
25 Comparison of Explicit and Implicit Methods
26 Solving the N-body equations of motion
27 Representing the N-Body problem
28 Form of the Computation -- Data v. Message Parallel
29 Summary of Parallel N-Body Programming Methods and Algorithms
30 Status of Parallelism in Various N Body Cases
31 Other N-Body Like Problems - I
32 Other N-Body Like Problems - II
33 N-body Runge Kutta Routine in Fortran90 - I
34 Runge Kutta Routine in Fortran90 - II
35 Computation of accelerations - a simple parallel array algorithm
36 Simple Data Parallel Version of N Body Force Computation -- Grav -- I
37 The Grav Function in Data Parallel Algorithm - II
38 Some Inefficiencies of the Data Parallel N2 Algorithm - I
39 Some Inefficiencies of the Data Parallel N2 Algorithm - II
40 Better Data Parallel Pipeline Algorithm for Computation of Accelerations,
taking 1/2 the iterations of force computation - I

41 Data Parallel Pipeline Algorithm in detail
42 Basic Data Parallel pipeline operation
43 Examples of Data Parallel Pipeline Algorithm
44 Data Parallel Pipeline Algorithm Grav -- Part I
45 Data Parallel Pipeline Algorithm for Grav -- Part II
46 Data Parallel Grav Pipeline Algorithm, concluded
47 Data Parallel Parallel Decomposition
48 Data Parallel Parallel Execution Time -I
49 Data Parallel Parallel Execution Time -II
50 N-body Problem is a one dimensional Algorithm
51 Excerpts from an HPF program for this algorithm
52 HPF program excerpts - II
53 HPF program excerpts - finished
54 Notes and References

Outside Index Summary of Material



HTML version of Scripted Foils prepared 22 February 1998

Foil 1 CPS 615 -- Computational Science in
Simulation Track
Data Parallel Module on ODE's and Particle Dynamics
February 20, 1998

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Nancy McCracken,
Geoffrey Fox
NPAC
Syracuse University
111 College Place
Syracuse NY 13244-4100

HTML version of Scripted Foils prepared 22 February 1998

Foil 2 Abstract of Message Parallel ODE and Particle Dynamics

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations
We discuss Euler, Runge Kutta and predictor corrector methods
F90 and HPF Data parallel O(N2) algorithms are described with performance comments
There is a related message parallel module sharing the same initial foils

HTML version of Scripted Foils prepared 22 February 1998

Foil 3 Particle (N-Body) Applications and Ordinary Differential Equations (ODE's)

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index

HTML version of Scripted Foils prepared 22 February 1998

Foil 4 Particle Applications - Ordinary Differential Equations (ODE's)

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Consider Models of physical systems represented as sets of particles rather than densities (fields) evolving over time
Examples:
  • the solar system
  • an electrical or electonic network
  • a globular star cluster
  • the atoms in a crystal vibrating under interatomic forces
  • the molecule rotating and flexing under interatomic force
Laws of Motion are typically ordinary differential Equations
Ordinary means differentiate wrt One Variable -- typically time

HTML version of Scripted Foils prepared 22 February 1998

Foil 5 Particle Applications - the N-body problem

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
N particles, each with a mass mi , moving with velocity Vi through 3-dimensional space
Are governed by Newton's equations of motion
Basic Kinematics

HTML version of Scripted Foils prepared 22 February 1998

Foil 6 Newton's First Law -- The Gravitational Force on a Particle

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index

HTML version of Scripted Foils prepared 22 February 1998

Foil 7 Equations of Motion -- Newton's Second Law

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Newton's Second Law
Incorporate laws into equations of motion
Example of force law for molecular dynamics

HTML version of Scripted Foils prepared 22 February 1998

Foil 8 Numerical techniques for solving ODE's

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
ODE's give an equation for the derivative of X with respect to time t
They can be classified (if second order) by the boundary conditions used
Initial value problems
  • Give initial value Xo for X and first derivative X'o
  • of X, X', at A
Boundary value
  • Give endpoint
  • values X o X1
  • for X
  • at A and B

HTML version of Scripted Foils prepared 22 February 1998

Foil 9 Second and Higher Order Equations

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Second order equations such as
can always be rewritten as a system of 2 first-order equations involving X and a new variable Y representing the first order derivative:

HTML version of Scripted Foils prepared 22 February 1998

Foil 10 Basic Discretization of Single First Order Equation

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
For simplicity, we assume just one first-order equation, where f is the function on right hand side which depends on X and t
We can always solve this by setting up a grid of equidistant points with grid size h=(B-A)/n where n is an integer.
Starting from the initial value, we calculate positions one step at a time

HTML version of Scripted Foils prepared 22 February 1998

Foil 11 Errors in numerical approximations

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Two sources of error:
Computational error includes such things as roundoff error, etc. and is generally controlled by having enough significant digits in the computer arithmetic
Discretization error is the accuracy of the numerical method and has two measures:
  • Global error - difference between solution and numerical approximation at any point. Difficult or expensive to estimate.
  • Local error - difference between solution and numerical approximation over only one time step. Easier to calculate and indicates global error indirectly.

HTML version of Scripted Foils prepared 22 February 1998

Foil 12 Runge-Kutta Methods: Euler's method

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Euler's method is not practical, but illustrates the technique.
It involves Linear approximation to get next point

HTML version of Scripted Foils prepared 22 February 1998

Foil 13 Estimate of Error in Euler's method

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Use Taylor's theorem to represent Y the exact solution:

HTML version of Scripted Foils prepared 22 February 1998

Foil 14 Relationship of Error to Computation

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Whenever f satisfies certain smoothness conditions, there is always a sufficiently small step size h such that the difference between the real function value at ti and the approximation Xi+1 is less than some required error magnitude e. [Burden and Faires]
Euler's method: one computation of the derivative function f at each step.
Other methods require less computation in order to produce the specified error e.

HTML version of Scripted Foils prepared 22 February 1998

Foil 15 Example using Euler's method from the CSEP book

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Initial Value Problem with known analytical solution:
The approximation with Euler's method and h=0.25:
Calculate a few values of the approximation:

HTML version of Scripted Foils prepared 22 February 1998

Foil 16 Approximate solutions at t=1,using Euler's method with different values of h

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Note it will take about one million iterations to get an error of order O(10-6)
The last column shows global error is of order O(h) as expected

HTML version of Scripted Foils prepared 22 February 1998

Foil 17 Runge-Kutta Methods: Modified Euler's method

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Use the derivative at one time step to extrapolate the midpoint value - use midpoint derivative to extrapolate the function value at the next time step
Evaluates the derivative function twice at each time step. Global error - O(h2), second order method
Sometimes called the midpoint method

HTML version of Scripted Foils prepared 22 February 1998

Foil 18 Approximate solutions of the ODE for et at t=1, using modified Euler's method with different values of h

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Note global error is now O(h2) and we get an error of O(10-5) after 128 iterations which would take about 1000 more iterations for Euler's method to achieve
So Euler has roughly half the computational effort per iteration but requires the square of the number of iterations

HTML version of Scripted Foils prepared 22 February 1998

Foil 19 The Classical Runge-Kutta -- In Words

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Runge Kutta methods achieve better results than Euler by using intermediate computations at intermediate time values
The fourth-order rule is the favorite method as it achieves good accuracy with modest computational complexity -- the algorithm is in words:
Use derivative of first time step to get trial midpoint
Use its derivative at first time step to get second trial midpoint
Use its derivative to get a trial end point
Integrate by Simpon's Rule, using average of two midpoint estimates
Global error is fourth order

HTML version of Scripted Foils prepared 22 February 1998

Foil 20 The Classical Runge-Kutta -- Formally

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index

HTML version of Scripted Foils prepared 22 February 1998

Foil 21 The Classical Runge-Kutta Pictorially

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Compared with Euler, Runge-Kutta has 4 times more calculation per time step, but should use fourth root as many time steps

HTML version of Scripted Foils prepared 22 February 1998

Foil 22 Predictor / Corrector Methods

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
First, predict Xi+1 using an explicit equation, with O(hn) error, and known values.
Then correct this value by using it in an implicit equation, with O(hn+1) error.
Simple example:

HTML version of Scripted Foils prepared 22 February 1998

Foil 23 Definition of Multi-step methods

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
The predictor/corrector methods use previous values Xi-1, Xi-2, ...... to increase accuracy p - not extra values between Xi and Xi+1 as in Runge Kutta
General form of multi-step difference equation:
Xi+1 = am-1Xi + am-2 Xi-1 + ..... + a0Xi+1-m + h(bmf(ti+1, Xi+1) + bm-1 f (ti, Xi) + .....
  • + b0f(ti+1-m, Xi+1-m)))
if coefficient of f evaluation independent on ti+1 bm=0, this is an explicit equation

HTML version of Scripted Foils prepared 22 February 1998

Foil 24 Features of Multi-Step Methods

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Note these are essentially interpolation formulae for if one uses information from m t values, you can fit a degree m-1 polynomial
Taylor expansion and Polynomial fitting are essentially the same thing!
Implicit Multistep Methods are gotten using backwards difference interpolating polynomials starting at ti+1. But wherever we we need f(ti+1,y(ti+1)), we use f(ti+1, X*i+1) where X*i+1is derived from the explicit predictor equation
Note that implicit formulae should be best as explicit method involves extrapolation from ti to ti+1 whereas in implicit case ti+1 is endpoint of region in which interpolation done
Extrapolation is always unreliable and to be avoided!

HTML version of Scripted Foils prepared 22 February 1998

Foil 25 Comparison of Explicit and Implicit Methods

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Adams-Bashforth fourth-order explicit method (4 step)
Adams-Moulton fourth order method (3 step) is an implicit Multistep method
Note coefficient of implicit method is a factor of 251/19 smaller than explicit method of same order -- this is why extrapolation is not so good!

HTML version of Scripted Foils prepared 22 February 1998

Foil 26 Solving the N-body equations of motion

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Introduce 3 vectors X V A for position,velocity and acceleration of particles
Numerical techniques iterate equations over time using Runge Kutta (in our detailed example) or more simply:
  • Euler's Equation which gives:
  • X(t+dt) = X(t) + h * V(t)
  • V(t+dt) = V(t) + h * Grav(X(t))
Note i labels particles not time steps

HTML version of Scripted Foils prepared 22 February 1998

Foil 27 Representing the N-Body problem

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Positions and velocities are 3 N dimensional arrays, X and V
other variables
  • M - spread masses to another 3xN array
  • h - step size
  • ns - number of time steps
subroutine for numerical method will take these arguments and update X and V.
A subroutine called Grav (dataparallel) or MPGrav (message parallel) is assumed to compute new accelerations

HTML version of Scripted Foils prepared 22 February 1998

Foil 28 Form of the Computation -- Data v. Message Parallel

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Computation of numerical method is inherently iterative: at each time step, the solution depends on the immediately preceding one.
At each time step, Grav/MPGrav is called (several times as using Runge Kutta):
  • For each particle i, one must sum over the forces due to all other particles j.
  • Computation is O(N2) - potential for parallel computation in either i (Message Parallel) or j or both (Data parallel if needed).
We will use 4th order Runge Kutta to integrate in time and the program is designed an overall routine looping over time with parallelism hidden in Grav/MPGrav routines
We first analyse Data Parallel (starting with classic SIMD method) and then go through Message Parallel version

HTML version of Scripted Foils prepared 22 February 1998

Foil 29 Summary of Parallel N-Body Programming Methods and Algorithms

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
3 Parallel Programming Paradigms
  • Message Parallel (MIMD)
  • Data Parallel (HPF on MIMD and historically CMFortran on SIMD CM-2)
  • Shared Memory Parallel Fortran especially on distributed shared memory architectures such as Origin 2000
2 Important and very different Algorithms
  • Natural O(N2) which is slowest but needed in cases where objects are often very close as in dynamics of globular clusters mentioned on previous foil
  • Fast Multipole methods O(N) or O(N logN) which are getting increasing use

HTML version of Scripted Foils prepared 22 February 1998

Foil 30 Status of Parallelism in Various N Body Cases

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Data Parallel approach is really only useful for the simple O(N2) case and even here it is quite tricky to express algorithm so that it is
  • both space efficient and
  • captures factor of 2 savings from Newton's law of action and reaction
  • Fij = - Fji
  • We have discussed these issues in previous foils
The shared memory approach is effective for a modest number of processors in both algorithms.
  • It is only likely to scale properly for O(N2) case as the compiler will find it hard to capture clever ideas needed to make fast multipole efficient
Message Parallel approach gives you very efficient algorithms in both cases
  • O(N2) case has very low communication
  • O(NlogN) has quite low communication

HTML version of Scripted Foils prepared 22 February 1998

Foil 31 Other N-Body Like Problems - I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
The characteristic structure of N-body problem is an observable that depends on all pairs of entities from a set of N entities.
This structure is seen in diverse applications:
1)Look at a database of items and calculate some form of correlation between all pairs of database entries
2)This was first used in studies of measurements of a "chaotic dynamical system" with points xi which are vectors of length m
Put rij = distance between xi and xj in m dimensional space
Then probability p(rij = r) is proportional to r(d-1)
  • where d (not equal to m) is dynamical dimension of system
  • calculate by forming all the rij (for i and j running over observable points from our system -- usually a time series) and accumulating in a histogram of bins in r
  • Parallel algorithm in a nutshell: Store histograms replicated in all processors, distribute vectors equally in each processor and just pipeline xj through processors and as they pass through accumulate rij ; add histograms together at end.

HTML version of Scripted Foils prepared 22 February 1998

Foil 32 Other N-Body Like Problems - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
3)Green's Function Approach to simple Partial Differential equations gives solutions as integrals of known Green's functions times "source" or "boundary" terms.
  • For the simulation of earthquakes in GEM project the source terms are strains in faults and the stresses in any fault segment are the integral over strains in all other segments
  • Compared to particle dynamics, Force law replaced by Green's function but in each case total stress/Force is sum over contributions associated with other entities in formulation
4)In the so called vortex method in CFD (Computational Fluid Dynamics) one models the Navier Stokes Equation as the long interactions between entities which are the vortices
5)Chemistry (see foil 7) uses molecular dynamics and so particles are molecules but force is not Newton's laws usually but rather Van der Waals forces which are long range but fall off faster than 1/r2

HTML version of Scripted Foils prepared 22 February 1998

Foil 33 N-body Runge Kutta Routine in Fortran90 - I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
C Solves Newton's equations of motion using Runge-Kutta method
C which is globally 4th order. X and V are initial positions and
C velocities. The system is evolved over a time interval h*ns.
C X and V contain the updated state at that time.
  • real, dimension (1:3, 1:N) :: X,V, Xdelta1, Vdelta1, Xdelta2, Vdelta2, Xdelta3, Vdelta3, Xdelta4, Vdelta4
  • real h, h2
  • integer ns, k
  • INTERFACE
  • function Grav(X,M)
  • real array(1:3, 1:N) :: X, M , Grav
  • end function
  • end INTERFACE
C Grav is hard parallel algorithm and will be given later!

HTML version of Scripted Foils prepared 22 February 1998

Foil 34 Runge Kutta Routine in Fortran90 - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index

HTML version of Scripted Foils prepared 22 February 1998

Foil 35 Computation of accelerations - a simple parallel array algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Spread positions of particles into 2 3D arrays so that extra dimension is labelled by index in sum over particles that interact with a given particle
Xj is essentially transpose of Xi in second and third dimension

HTML version of Scripted Foils prepared 22 February 1998

Foil 36 Simple Data Parallel Version of N Body Force Computation -- Grav -- I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
function Grav(X,M)
C accepts positions of particles X and masses of particles M
C returns accelerations in Grav
C Uses completely parallel calculation, ignoring anti-symmetry of force
  • integer, parameter :: N =number of particles
  • real, parameter :: G = gravitational constant
  • real array (1:3, 1:N) :: X,M,Grav
! calculates acceleration on body i due to body j in entries ( :,i,j )
  • real array (1:3, 1:N,1:N) :: A, Xi, Xj, Ms, D, R
  • logical, array (1:3, 1:N,1:N) ::diag
  • integer i,j,k
! diag is true for diagonal of N by N slices
  • forall ( k=1:3, i=1:N, j=1:N ) diag = (i.eq.j)

HTML version of Scripted Foils prepared 22 February 1998

Foil 37 The Grav Function in Data Parallel Algorithm - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
! set up arrays of particles Xi and particles Xj
  • Xi = spread (X, dim =3, ncopies = N )
  • Xj = spread (X, dim = 2, ncopies = N )
  • Ms = spread (M, dim=2, ncopies = N )
! displacements and Euclidean distance
  • D = Xj - Xi
  • R = sqrt (spread ( sum ( D*D, dim=1), dim =1, ncopies=3)))
! calculate accelerations for all pairs except on main diagonal
  • where (diag)
    • A=0.0
  • elsewhere
    • A=G*Ms*D / R**3
  • end where
  • Grav = sum (A, dim=3)
end function Grav

HTML version of Scripted Foils prepared 22 February 1998

Foil 38 Some Inefficiencies of the Data Parallel N2 Algorithm - I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Symmetry of force on particles: Fij = -Fji (Newton's Law of Action and Reaction!)
  • only half need to be computed so should use triangular arrays
  • i.e. just do loops with sum over particles i, sum over particles j <= i and then calculate algebraic form of Fij and then accumulate
  • Force on i increments by Fij
  • Force on j decrements by Fij
There is a Load balancing problem with triangular arrays
Assuming for example that processors assigned with block distribution in column direction.
  • To calculate the force between 2 particles
  • will take N/Nproc iterations in the
  • longest running processor
  • i.e. you don't get factor of two back!

HTML version of Scripted Foils prepared 22 February 1998

Foil 39 Some Inefficiencies of the Data Parallel N2 Algorithm - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Also, all particle information is sent to all processors, taking O(N2) space whereas natural algorithms use O(N) space and this is how special purpose machines like GRAPE get their cost effectiveness
  • If N is one million as for globular cluster problem, there is a big difference between DRAM cost of 106 and 1012 units of memory
Space is further wasted as everything is spread to 3 dimensional arrays even when arrays like mass are naturally one dimensional!

HTML version of Scripted Foils prepared 22 February 1998

Foil 40 Better Data Parallel Pipeline Algorithm for Computation of Accelerations,
taking 1/2 the iterations of force computation - I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Pair together the data for every particle Xi with the data for every particle Xj by iterating over a pipeline (circulating) array.

HTML version of Scripted Foils prepared 22 February 1998

Foil 41 Data Parallel Pipeline Algorithm in detail

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Case when i compared to i is not needed as particles dont interact with themselves
At step k, interact particle i with particle j= 1 + mod((i+k-1),N)
Accumulate force on i due to j in fixed Ai
Accumulate negative of this as force on j due to i in circulating Acj
At the end of the algorithm, add Ai and Aci
In parallel version, Note that Ai will be calculated in "home processor for particle i but Aci will travel around the machine being accumulated in processor holding particle j
Thus this violates the owner computes rule and so this parallel algorithm must be implemented by hand -- the compiler will not find it automatically

HTML version of Scripted Foils prepared 22 February 1998

Foil 42 Basic Data Parallel pipeline operation

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
First step is to Circulate(shift) one position and calculate acelerations Fij and Fji in all index positions
Shifting pipeline (N-1) times gives correct algorithm but does not save "Newton's factor of two".
Just need (N-1)/2 steps when N is odd and N/2 steps when N is even which saves factor of two.

HTML version of Scripted Foils prepared 22 February 1998

Foil 43 Examples of Data Parallel Pipeline Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index

HTML version of Scripted Foils prepared 22 February 1998

Foil 44 Data Parallel Pipeline Algorithm Grav -- Part I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
function Grav(X,M)
C accepts positions of particles X and masses of particles M
C returns accelerations in Grav
  • integer, parameter :: N =number of particles
  • real, parameter :: G = gravitational constant
  • real, dimension (1:3, 1:N) :: X,M,Grav
  • real, dimension (1:3,1:N) :: A, Xc, Mc, Ac, D, R
  • integer k
! A is fixed accelerations - X and M are used for fixed positions and masses
  • A=0.0
  • Ac=0.0 ! circulating accelerations
  • Xc=X ! circulating positions
  • M=G*M ! precalculate mass * gravitational constant
  • Mc=M ! circulating masses

HTML version of Scripted Foils prepared 22 February 1998

Foil 45 Data Parallel Pipeline Algorithm for Grav -- Part II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
!Shift Circulating Arrays Xc Mc Ac to the right
  • Xc = cshift (Xc, dim=2, shift = -1)
  • Mc = cshift (Mc, dim=2, shift = -1)
  • Ac = cshift (Ac, dim=2, shift = -1)
! calculate R to be distance over 3-D cordinates
  • D= Xc-X
  • R = sqrt (spread (sum (D*D, dim=1), dim=1, ncopies=3)))
  • D = D/R**3
  • A = A + Mc*D ! fixed acceleration
  • Ac = Ac - M*D ! circulating acceleration
  • end do

HTML version of Scripted Foils prepared 22 February 1998

Foil 46 Data Parallel Grav Pipeline Algorithm, concluded

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
if (( N mod 2) = 0 ) then ! final one way acceleration if N even
  • Xc = cshift (Xc, dim=2, shift = -1)
  • Mc = cshift (Mc, dim=2, shift = -1)
  • Ac = cshift (Ac, dim=2, shift = -1)
  • D = Xc-X
  • R = sqrt (spread (sum (D*D, dim=1), dim=1, ncopies=3)))
  • D = D/R**3
  • A = A + Mc*D
end if
! combine accelerations for final result - circulating particle in i'th
! position corresponds to fixed particle (i-(N-1)/2)
Grav = A + cshift (Ac, dim=2, shift = (N-1)/2)
end function Grav

HTML version of Scripted Foils prepared 22 February 1998

Foil 47 Data Parallel Parallel Decomposition

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Distribute arrays in naive block fashion - if Nproc is the number of processors, each processor has N/Nproc particles.

HTML version of Scripted Foils prepared 22 February 1998

Foil 48 Data Parallel Parallel Execution Time -I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Consider time for Runge Kutta invocation of function Grav
Shifting particles communicates one set of particle information - all processors communicate at the same time giving estimate:
9 * tcomm (factor should be 7 as need only 1 not 3 masses as we used in simple implementation earlier)
  • we are ignoring latency which in practice means best implementation transfers several (not N-1 as in naive data parallel algorithm) particles at a time
Floating point calculations: roughly 3(x,y,z) of -, *, sum, sqrt, exp, /, *, +, *, + which can be summarized as estimate: > 30 tfloat
Each communicated particle is interacted with the N/Nproc particles in the local partition of that processor and each step has one shift giving a total time for (N-1)/2 steps in Grav:

HTML version of Scripted Foils prepared 22 February 1998

Foil 49 Data Parallel Parallel Execution Time -II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Then the total time for the Runge-Kutta solver is:
Giving O(N2/Nproc) running time in the number of particles.
Note that parallel overhead or communication time/computation time is proportional to 1/n where n = N/Nproc is grain size.
Note this algorithm has an overhead characteristic of a one dimensional problem for
  • overhead goes like (1/n)1/d in d dimensions -- which is edge over area -- giving 1/n for d=1

HTML version of Scripted Foils prepared 22 February 1998

Foil 50 N-body Problem is a one dimensional Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
Not only is performance model characteristic of one dimensional problem but we also we used a one-dimension parallel decomposition
Note these one dimensional characteristics are independent of dimension of space that particles move in.
Normally d is thought of as dimension of physical space in which problem posed.
But the geometrical interpretation of d is only valid where interaction between particles is itself geometrical i.e. short range
Rather N body algorithm has no geometric structure and you see d=1 characteristic of algorithm
See Chapter 3 of Parallel Computing Works for a longer discussion of this from a "Complex Systems" point of view
Note the simple N-body problem has some interesting features
  • Very low communication/compute ratios
  • Best parallel methods violate owner computes rule -- also true in recent molecular dynamics problems with shorter range force
  • Unexpected dimension

HTML version of Scripted Foils prepared 22 February 1998

Foil 51 Excerpts from an HPF program for this algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
! declarations of global variables
module nbodyvars
real, dimension ( : , : ), allocatable :: X, V, M
integer NB ! number of particles
real G
end module
! allocate arrays
subroutine setup ( )
use nbodyvars
open (10, file=fnm, status="OLD")
read (10, *) NB
allocate (X (3, NB))
allocate (V (3, NB))
. . .
end subroutine

HTML version of Scripted Foils prepared 22 February 1998

Foil 52 HPF program excerpts - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
subroutine runga-kutte(h, ns)
use nbodyvars
real h; integer ns
INTERFACE
  • subroutine Grav(X, M, A)
  • use nbodyvars
  • real, dimension ( :, : ), intent(in) :: X, M
  • real, dimension ( :, : ), intent(out) :: A
end interface
. . .
do k = 1, ns
  • Xdelta1 = V
  • call Grav(X, M, A)
  • Vdelta1 = A
  • . . . .
end subroutine

HTML version of Scripted Foils prepared 22 February 1998

Foil 53 HPF program excerpts - finished

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
subroutine Grav(X, M, A)
  • use nbodyvars
  • real, dimension ( :, : ), intent(in) :: X, M
  • real, dimension ( :, : ), intent(out) :: A
  • real, dimension ( 3, NB ) :: Xdelta1, Vdelta1, . . .
!HPF ALIGN Xdelta1, Vdelta1 . . . WITH X
. . .
end subroutine
! main program
program nbody
use nbodyvars
. . .
call setup ( )
!HPF DISTRIBUTE X ( :, BLOCK)
!HPF ALIGN V, M WITH X
. . .
do k = 1, np
call runge-kutta(timestep, ns)
call print_state ()
end do
end program

HTML version of Scripted Foils prepared 22 February 1998

Foil 54 Notes and References

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- 1998 Enhancements. *
Full HTML Index
These O(N2) techniques successful on astrophysical problems of size a few thousand particles. Larger problems, such as on the scale of galaxies, do not calculate all pairs of particle interactions but use "fast multipole" and estimate force for areas of distant particles. The data structure for this technique is a Barnes-Hut tree.
Burden, Richard L. and Faires, J. Douglas, Numerical Analysis. Fourth edition, PWS-Kent Publishing Company, 1989. This is basic ODE reference
There is also ODE chapter from the CSEP book, http://www.npac.syr.edu/projects/csep/ode/ode.html
Chapter 9 of Solving Problems on Concurrent Processors, Volume I does parallel O(N2) message parallel
Salmon, John K. Parallel Hierarchial N-body Methods, dissertation from Caltech, technical report SCCS-52, CRPC-90-14, 1990 is original practical fast multipole.

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sun Feb 22 1998