Full HTML for Basic CPS615 Foils -- set E: ODE's and Particle Dynamics

Full HTML for

Basic foilset CPS615 Foils -- set E: ODE's and Particle Dynamics

Given by Nancy McCracken and Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95/96/97. Foils prepared 14 October 1997
Outside Index Summary of Material

This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations

We discuss Euler, Runge Kutta and predictor corrector methods

The simple data parallel O(N2) algorithm is given in Fortran90 and HPF

The better Pipeline version is also given

We analyse Performance

Table of Contents for full HTML of CPS615 Foils -- set E: ODE's and Particle Dynamics

Denote Foils where Image Critical

Denote Foils where HTML is sufficient

1

CPS 615 -- Computational Science in
Simulation Track
Data Parallel Module on ODE's and Particle Dynamics
October 23, 1995
Updated Oct 11,1996
2

Abstract of Data Parallel ODE and Particle Dynamics Module
3

Particle (N-Body) Applications and Ordinary Differential Equations (ODE's)
4

Particle Applications - Ordinary Differential Equations (ODE's)
5

Particle Applications - the N-body problem
6

Newton's First Law -- The Gravitational Force on a Particle
7

Equations of Motion -- Newton's Second Law
8

Numerical techniques for solving ODE's
9

Second and Higher Order Equations
10

Basic Discretization of Single First Order Equation
11

Errors in numerical approximations
12

Runge-Kutta Methods: Euler's method
13

Estimate of Error in Euler's method
14

Relationship of Error to Computation
15

Example using Euler's method from the CSEP book
16

Approximate solutions at t=1,using Euler's method with different values of h
17

Runge-Kutta Methods: Modified Euler's method
18

Approximate solutions of the ODE for et at t=1, using modified Euler's method with different values of h
19

The Classical Runge-Kutta -- In Words
20

The Classical Runge-Kutta -- Formally
21

The Classical Runge-Kutta Pictorially
22

Predictor / Corrector Methods
23

Definition of Multi-step methods
24

Features of Multi-Step Methods
25

Comparison of Explicit and Implicit Methods
26

Solving the N-body equations of motion
27

Representing the Data Parallel N-Body problem
28

Form of the Computation -- Data v. Message Parallel
29

N-body Runge Kutta Routine in Fortran90 - I
30

Runge Kutta Routine in Fortran90 - II
31

Computation of accelerations - a simple parallel array algorithm
32

Simple Data Parallel Version of N Body Force Computation -- Grav -- I
33

The Grav Function in Data Parallel Algorithm - II
34

Some Inefficiencies of the Data Parallel N2 Algorithm - I
35

Some Inefficiencies of the Data Parallel N2 Algorithm - II
36

Better Data Parallel Pipeline Algorithm for Computation of Accelerations,
taking 1/2 the time for iterations over force computation
37

Data Parallel Pipeline Algorithm in detail
38

Basic Data Parallel pipeline operation
39

Examples of Data Parallel Pipeline Algorithm
40

Data Parallel Pipeline Algorithm Grav -- Part I
41

Data Parallel Pipeline Algorithm for Grav -- Part II
42

Data Parallel Grav Pipeline Algorithm, concluded
43

Data Parallel Parallel Decomposition
44

Data Parallel Parallel Execution Time -I
45

Data Parallel Parallel Execution Time -II
46

N-body Problem is a one dimensional Algorithm
47

Excerpts from an HPF program for this algorithm
48

HPF program excerpts - II
49

HPF program excerpts - finished
50

Notes and References
51

CPS 615 -- Computational Science in
Simulation Track
Message Parallel Module on ODE's and Particle Dynamics
October 20, 1997
52

Abstract of Message Parallel ODE and Particle Dynamics
53

Summary of Parallel N-Body Programming Methods and Algorithms
54

Status of Parallelism in Various N Body Cases
55

Foil 1 CPS 615 -- Computational Science in
Simulation Track
Data Parallel Module on ODE's and Particle Dynamics
October 23, 1995
Updated Oct 11,1996

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Nancy McCracken,

Geoffrey Fox

NPAC

Syracuse University

111 College Place

Syracuse NY 13244-4100

HTML version of Basic Foils prepared 14 October 1997

Foil 2 Abstract of Data Parallel ODE and Particle Dynamics Module

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations

We discuss Euler, Runge Kutta and predictor corrector methods

The simple data parallel O(N2) algorithm is given in Fortran90 and HPF

The better Pipeline version is also given

We analyse Performance

HTML version of Basic Foils prepared 14 October 1997

Foil 3 Particle (N-Body) Applications and Ordinary Differential Equations (ODE's)

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

HTML version of Basic Foils prepared 14 October 1997

Foil 4 Particle Applications - Ordinary Differential Equations (ODE's)

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Consider Models of physical systems represented as sets of particles rather than densities (fields) evolving over time

Examples:

the solar system
an electrical or electonic network
a globular star cluster
the atoms in a crystal vibrating under interatomic forces
the molecule rotating and flexing under interatomic force

Laws of Motion are typically ordinary differential Equations

Ordinary means differentiate wrt One Variable -- typically time

HTML version of Basic Foils prepared 14 October 1997

Foil 5 Particle Applications - the N-body problem

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

N particles, each with a mass mi , moving with velocity Vi through 3-dimensional space

Are governed by Newton's equations of motion

Basic Kinematics

HTML version of Basic Foils prepared 14 October 1997

Foil 6 Newton's First Law -- The Gravitational Force on a Particle

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

HTML version of Basic Foils prepared 14 October 1997

Foil 7 Equations of Motion -- Newton's Second Law

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Newton's Second Law

Incorporate laws into equations of motion

Example of force law for molecular dynamics

HTML version of Basic Foils prepared 14 October 1997

Foil 8 Numerical techniques for solving ODE's

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

ODE's give an equation for the derivative of X with respect to time t

They can be classified (if second order) by the boundary conditions used

Initial value problems

Give initial value Xo for X and first derivative X'o
of X, X', at A

Boundary value

Give endpoint
values X o X1
for X
at A and B

HTML version of Basic Foils prepared 14 October 1997

Foil 9 Second and Higher Order Equations

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Second order equations such as

can always be rewritten as a system of 2 first-order equations involving X and a new variable Y representing the first order derivative:

HTML version of Basic Foils prepared 14 October 1997

Foil 10 Basic Discretization of Single First Order Equation

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

For simplicity, we assume just one first-order equation, where f is the function on right hand side which depends on X and t

We can always solve this by setting up a grid of equidistant points with grid size h=(B-A)/n where n is an integer.

Starting from the initial value, we calculate positions one step at a time

HTML version of Basic Foils prepared 14 October 1997

Foil 11 Errors in numerical approximations

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Two sources of error:

Computational error includes such things as roundoff error, etc. and is generally controlled by having enough significant digits in the computer arithmetic

Discretization error is the accuracy of the numerical method and has two measures:

Global error - difference between solution and numerical approximation at any point. Difficult or expensive to estimate.
Local error - difference between solution and numerical approximation over only one time step. Easier to calculate and indicates global error indirectly.

HTML version of Basic Foils prepared 14 October 1997

Foil 12 Runge-Kutta Methods: Euler's method

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Euler's method is not practical, but illustrates the technique.

It involves Linear approximation to get next point

HTML version of Basic Foils prepared 14 October 1997

Foil 13 Estimate of Error in Euler's method

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Use Taylor's theorem to represent Y the exact solution:

HTML version of Basic Foils prepared 14 October 1997

Foil 14 Relationship of Error to Computation

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Whenever f satisfies certain smoothness conditions, there is always a sufficiently small step size h such that the difference between the real function value at ti and the approximation Xi+1 is less than some required error magnitude e. [Burden and Faires]

Euler's method: one computation of the derivative function f at each step.

Other methods require less computation in order to produce the specified error e.

HTML version of Basic Foils prepared 14 October 1997

Foil 15 Example using Euler's method from the CSEP book

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Initial Value Problem with known analytical solution:

The approximation with Euler's method and h=0.25:

Calculate a few values of the approximation:

HTML version of Basic Foils prepared 14 October 1997

Foil 16 Approximate solutions at t=1,using Euler's method with different values of h

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Note it will take about one million iterations to get an error of order O(10-6)

The last column shows global error is of order O(h) as expected

HTML version of Basic Foils prepared 14 October 1997

Foil 17 Runge-Kutta Methods: Modified Euler's method

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Use the derivative at one time step to extrapolate the midpoint value - use midpoint derivative to extrapolate the function value at the next time step

Evaluates the derivative function twice at each time step. Global error - O(h2), second order method

Sometimes called the midpoint method

HTML version of Basic Foils prepared 14 October 1997

Foil 18 Approximate solutions of the ODE for et at t=1, using modified Euler's method with different values of h

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Note global error is now O(h2) and we get an error of O(10-5) after 128 iterations which would take about 1000 more iterations for Euler's method to achieve

So Euler has roughly half the computational effort per iteration but requires the square of the number of iterations

HTML version of Basic Foils prepared 14 October 1997

Foil 19 The Classical Runge-Kutta -- In Words

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Runge Kutta methods achieve better results than Euler by using intermediate computations at intermediate time values

The fourth-order rule is the favorite method as it achieves good accuracy with modest computational complexity -- the algorithm is in words:

Use derivative of first time step to get trial midpoint

Use its derivative at first time step to get second trial midpoint

Use its derivative to get a trial end point

Integrate by Simpon's Rule, using average of two midpoint estimates

Global error is fourth order

HTML version of Basic Foils prepared 14 October 1997

Foil 20 The Classical Runge-Kutta -- Formally

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

HTML version of Basic Foils prepared 14 October 1997

Foil 21 The Classical Runge-Kutta Pictorially

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Compared with Euler, Runge-Kutta has 4 times more calculation per time step, but should use fourth root as many time steps

HTML version of Basic Foils prepared 14 October 1997

Foil 22 Predictor / Corrector Methods

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

First, predict Xi+1 using an explicit equation, with O(hn) error, and known values.

Then correct this value by using it in an implicit equation, with O(hn+1) error.

Simple example:

HTML version of Basic Foils prepared 14 October 1997

Foil 23 Definition of Multi-step methods

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

The predictor/corrector methods use previous values Xi-1, Xi-2, ...... to increase accuracy p - not extra values between Xi and Xi+1 as in Runge Kutta

General form of multi-step difference equation:

Xi+1 = am-1Xi + am-2 Xi-1 + ..... + a0Xi+1-m + h(bmf(ti+1, Xi+1) + bm-1 f (ti, Xi) + .....

+ b0f(ti+1-m, Xi+1-m)))

if coefficient of f evaluation independent on ti+1 bm=0, this is an explicit equation

HTML version of Basic Foils prepared 14 October 1997

Foil 24 Features of Multi-Step Methods

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Note these are essentially interpolation formulae for if one uses information from m t values, you can fit a degree m-1 polynomial

Taylor expansion and Polynomial fitting are essentially the same thing!

Implicit Multistep Methods are gotten using backwards difference interpolating polynomials starting at ti+1. But wherever we we need f(ti+1,y(ti+1)), we use f(ti+1, X*i+1) where X*i+1is derived from the explicit predictor equation

Note that implicit formulae should be best as explicit method involves extrapolation from ti to ti+1 whereas in implicit case ti+1 is endpoint of region in which interpolation done

Extrapolation is always unreliable and to be avoided!

HTML version of Basic Foils prepared 14 October 1997

Foil 25 Comparison of Explicit and Implicit Methods

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Adams-Bashforth fourth-order explicit method (4 step)

Adams-Moulton fourth order method (3 step) is an implicit Multistep method

Note coefficient of implicit method is a factor of 251/19 smaller than explicit method of same order -- this is why extrapolation is not so good!

HTML version of Basic Foils prepared 14 October 1997

Foil 26 Solving the N-body equations of motion

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Introduce 3 vectors X V A for position,velocity and acceleration of particles

Numerical techniques iterate equations over time using Runge Kutta (in our detailed example) or more simply:

Euler's Equation which gives:
X(t+dt) = X(t) + h * V(t)
V(t+dt) = V(t) + h * Grav(X(t))

Note i labels particles not time steps

HTML version of Basic Foils prepared 14 October 1997

Foil 27 Representing the Data Parallel N-Body problem

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Positions and velocities are 3 N dimensional arrays, X and V

other variables

M - spread masses to another 3xN array
h - step size
ns - number of time steps

subroutine for numerical method will take these arguments and update X and V.

A subroutine called Grav (dataparallel) or MPGrav (message parallel) is assumed to compute new accelerations

HTML version of Basic Foils prepared 14 October 1997

Foil 28 Form of the Computation -- Data v. Message Parallel

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Computation of numerical method is inherently iterative: at each time step, the solution depends on the immediately preceding one.

At each time step, Grav/MPGrav is called (several times as using Runge Kutta):

For each particle i, one must sum over the forces due to all other particles j.
Computation is O(N2) - potential for parallel computation in either i (Message Parallel) or j or both (Data parallel if needed).

We will use 4th order Runge Kutta to integrate in time and the program is designed an overall routine looping over time with parallelism hidden in Grav/MPGrav routines

We first analyse Data Parallel (starting with classic SIMD method) and then go through Message Parallel version

HTML version of Basic Foils prepared 14 October 1997

Foil 29 N-body Runge Kutta Routine in Fortran90 - I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

subroutine runge-kutta (X, V, M, h, ns )

C Solves Newton's equations of motion using Runge-Kutta method

C which is globally 4th order. X and V are initial positions and

C velocities. The system is evolved over a time interval h*ns.

C X and V contain the updated state at that time.

real, dimension (1:3, 1:N) :: X,V, Xdelta1, Vdelta1, Xdelta2, Vdelta2, Xdelta3, Vdelta3, Xdelta4, Vdelta4
real h, h2
integer ns, k
INTERFACE
function Grav(X,M)
real array(1:3, 1:N) :: X, M , Grav
end function
end INTERFACE

C Grav is hard parallel algorithm and will be given later!

HTML version of Basic Foils prepared 14 October 1997

Foil 30 Runge Kutta Routine in Fortran90 - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

h2=.5*h ! h/2, for convenience
do k = 1, ns
Xdelta1=V
Vdelta1=Grav(X,M)
Xdelta2=V+h2*Vdelta1
Vdelta2=Grav (X+h2*Xdelta1, M)
Xdelta3=V+h2*Vdelta2
Vdelta3=Grav(X+h2*Xdelta2, M)
Xdelta4=V+h*Vdelta3
Vdelta4=Grav(X+h*Xdelta3,M)
X=X+h/6. (Xdelta1+2*(Xdelta2+Xdelta3)+Xdelta4)
V=V+h/6. (Vdelta1+2*(Vdelta2+Vdelta3)+Vdelta4)
end do
end subroutine runge-kutta

HTML version of Basic Foils prepared 14 October 1997

Foil 31 Computation of accelerations - a simple parallel array algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Spread positions of particles into 2 3D arrays so that extra dimension is labelled by index in sum over particles that interact with a given particle

Xj is essentially transpose of Xi in second and third dimension

HTML version of Basic Foils prepared 14 October 1997

Foil 32 Simple Data Parallel Version of N Body Force Computation -- Grav -- I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

function Grav(X,M)

C accepts positions of particles X and masses of particles M

C returns accelerations in Grav

C Uses completely parallel calculation, ignoring anti-symmetry of force

integer, parameter :: N =number of particles
real, parameter :: G = gravitational constant
real array (1:3, 1:N) :: X,M,Grav

! calculates acceleration on body i due to body j in entries ( :,i,j )

real array (1:3, 1:N,1:N) :: A, Xi, Xj, Ms, D, R
logical, array (1:3, 1:N,1:N) ::diag
integer i,j,k

! diag is true for diagonal of N by N slices

forall ( k=1:3, i=1:N, j=1:N ) diag = (i.eq.j)

HTML version of Basic Foils prepared 14 October 1997

Foil 33 The Grav Function in Data Parallel Algorithm - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

! set up arrays of particles Xi and particles Xj

Xi = spread (X, dim =3, ncopies = N )
Xj = spread (X, dim = 2, ncopies = N )
Ms = spread (M, dim=2, ncopies = N )

! displacements and Euclidean distance

D = Xj - Xi
R = sqrt (spread ( sum ( D*D, dim=1), dim =1, ncopies=3)))

! calculate accelerations for all pairs except on main diagonal

where (diag)
- A=0.0
elsewhere
- A=G*Ms*D / R**3
end where
Grav = sum (A, dim=3)

end function Grav

HTML version of Basic Foils prepared 14 October 1997

Foil 34 Some Inefficiencies of the Data Parallel N2 Algorithm - I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Symmetry of force on particles: Fij = -Fji (Newton's Law of Action and Reaction!)

only half need to be computed so should use triangular arrays
i.e. just do loops with sum over particles i, sum over particles j <= i and then calculate algebraic form of Fij and then accumulate
Force on i increments by Fij
Force on j decrements by Fij

There is a Load balancing problem with triangular arrays

Assuming for example that processors assigned with block distribution in column direction.

To calculate the force between 2 particles
will take N/Nproc iterations in the
longest running processor
i.e. you don't get factor of two back!

HTML version of Basic Foils prepared 14 October 1997

Foil 35 Some Inefficiencies of the Data Parallel N2 Algorithm - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Also, all particle information is sent to all processors, taking O(N2) space whereas natural algorithms use O(N) space and this is how special purpose machines like GRAPE get their cost effectiveness

If N is one million as for globular cluster problem, there is a big difference between DRAM cost of 106 and 1012 units of memory

Space is further wasted as everything is spread to 3 dimensional arrays even when arrays like mass are naturally one dimensional!

HTML version of Basic Foils prepared 14 October 1997

Foil 36 Better Data Parallel Pipeline Algorithm for Computation of Accelerations,
taking 1/2 the time for iterations over force computation

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Pair together the data for every particle Xi with the data for every particle Xj by iterating over a pipeline (circulating) array.

HTML version of Basic Foils prepared 14 October 1997

Foil 37 Data Parallel Pipeline Algorithm in detail

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Case when i compared to i is not needed as particles dont interact with themselves

At step k, interact particle i with particle j= 1 + mod((i+k-1),N)

Accumulate force on i due to j in fixed Ai

Accumulate negative of this as force on j due to i in circulating Acj

At the end of the algorithm, add Ai and Aci

In parallel version, Note that Ai will be calculated in "home processor for particle i but Aci will travel around the machine being accumulated in processor holding particle j

Thus this violates the owner computes rule and so this parallel algorithm must be implemented by hand -- the compiler will not find it automatically

HTML version of Basic Foils prepared 14 October 1997

Foil 38 Basic Data Parallel pipeline operation

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

First step is to Circulate(shift) one position and calculate acelerations Fij and Fji in all index positions

Shifting pipeline (N-1) times gives correct algorithm but does not save "Newton's factor of two".

Just need (N-1)/2 steps when N is odd and N/2 steps when N is even which saves factor of two.

HTML version of Basic Foils prepared 14 October 1997

Foil 39 Examples of Data Parallel Pipeline Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

HTML version of Basic Foils prepared 14 October 1997

Foil 40 Data Parallel Pipeline Algorithm Grav -- Part I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

function Grav(X,M)

C accepts positions of particles X and masses of particles M

C returns accelerations in Grav

integer, parameter :: N =number of particles
real, parameter :: G = gravitational constant
real, dimension (1:3, 1:N) :: X,M,Grav
real, dimension (1:3,1:N) :: A, Xc, Mc, Ac, D, R
integer k

! A is fixed accelerations - X and M are used for fixed positions and masses

A=0.0
Ac=0.0 ! circulating accelerations
Xc=X ! circulating positions
M=G*M ! precalculate mass * gravitational constant
Mc=M ! circulating masses

HTML version of Basic Foils prepared 14 October 1997

Foil 41 Data Parallel Pipeline Algorithm for Grav -- Part II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

do k=1, (N-1)/2 ! Loop over shifts of circulating copy

!Shift Circulating Arrays Xc Mc Ac to the right

Xc = cshift (Xc, dim=2, shift = -1)
Mc = cshift (Mc, dim=2, shift = -1)
Ac = cshift (Ac, dim=2, shift = -1)

! calculate R to be distance over 3-D cordinates

D= Xc-X
R = sqrt (spread (sum (D*D, dim=1), dim=1, ncopies=3)))
D = D/R**3
A = A + Mc*D ! fixed acceleration
Ac = Ac - M*D ! circulating acceleration
end do

HTML version of Basic Foils prepared 14 October 1997

Foil 42 Data Parallel Grav Pipeline Algorithm, concluded

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

if (( N mod 2) = 0 ) then ! final one way acceleration if N even

Xc = cshift (Xc, dim=2, shift = -1)
Mc = cshift (Mc, dim=2, shift = -1)
Ac = cshift (Ac, dim=2, shift = -1)
D = Xc-X
R = sqrt (spread (sum (D*D, dim=1), dim=1, ncopies=3)))
D = D/R**3
A = A + Mc*D

end if

! combine accelerations for final result - circulating particle in i'th

! position corresponds to fixed particle (i-(N-1)/2)

Grav = A + cshift (Ac, dim=2, shift = (N-1)/2)

end function Grav

HTML version of Basic Foils prepared 14 October 1997

Foil 43 Data Parallel Parallel Decomposition

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Distribute arrays in naive block fashion - if Nproc is the number of processors, each processor has N/Nproc particles.

HTML version of Basic Foils prepared 14 October 1997

Foil 44 Data Parallel Parallel Execution Time -I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Consider time for Runge Kutta invocation of function Grav

Shifting particles communicates one set of particle information - all processors communicate at the same time giving estimate:

9 * tcomm (factor should be 7 as need only 1 not 3 masses as we used in simple implementation earlier)

we are ignoring latency which in practice means best implementation transfers several (not N-1 as in naive data parallel algorithm) particles at a time

Floating point calculations: roughly 3(x,y,z) of -, *, sum, sqrt, exp, /, *, +, *, + which can be summarized as estimate: > 30 tfloat

Each communicated particle is interacted with the N/Nproc particles in the local partition of that processor and each step has one shift giving a total time for (N-1)/2 steps in Grav:

HTML version of Basic Foils prepared 14 October 1997

Foil 45 Data Parallel Parallel Execution Time -II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Then the total time for the Runge-Kutta solver is:

Giving O(N2/Nproc) running time in the number of particles.

Note that parallel overhead or communication time/computation time is proportional to 1/n where n = N/Nproc is grain size.

Note this algorithm has an overhead characteristic of a one dimensional problem for

overhead goes like (1/n)1/d in d dimensions -- which is edge over area -- giving 1/n for d=1

HTML version of Basic Foils prepared 14 October 1997

Foil 46 N-body Problem is a one dimensional Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Not only is performance model characteristic of one dimensional problem but we also we used a one-dimension parallel decomposition

Note these one dimensional characteristics are independent of dimension of space that particles move in.

Normally d is thought of as dimension of physical space in which problem posed.

But the geometrical interpretation of d is only valid where interaction between particles is itself geometrical i.e. short range

Rather N body algorithm has no geometric structure and you see d=1 characteristic of algorithm

See Chapter 3 of Parallel Computing Works for a longer discussion of this from a "Complex Systems" point of view

Note the simple N-body problem has some interesting features

Very low communication/compute ratios
Best parallel methods violate owner computes rule -- also true in recent molecular dynamics problems with shorter range force
Unexpected dimension

HTML version of Basic Foils prepared 14 October 1997

Foil 47 Excerpts from an HPF program for this algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

! declarations of global variables

module nbodyvars

real, dimension ( : , : ), allocatable :: X, V, M

integer NB ! number of particles

real G

end module

! allocate arrays

subroutine setup ( )

use nbodyvars

open (10, file=fnm, status="OLD")

read (10, *) NB

allocate (X (3, NB))

allocate (V (3, NB))

. . .

end subroutine

HTML version of Basic Foils prepared 14 October 1997

Foil 48 HPF program excerpts - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

subroutine runga-kutte(h, ns)

use nbodyvars

real h; integer ns

INTERFACE

subroutine Grav(X, M, A)
use nbodyvars
real, dimension ( :, : ), intent(in) :: X, M
real, dimension ( :, : ), intent(out) :: A

end interface

. . .

do k = 1, ns

Xdelta1 = V
call Grav(X, M, A)
Vdelta1 = A
. . . .

end subroutine

HTML version of Basic Foils prepared 14 October 1997

Foil 49 HPF program excerpts - finished

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

subroutine Grav(X, M, A)

use nbodyvars
real, dimension ( :, : ), intent(in) :: X, M
real, dimension ( :, : ), intent(out) :: A
real, dimension ( 3, NB ) :: Xdelta1, Vdelta1, . . .

!HPF ALIGN Xdelta1, Vdelta1 . . . WITH X

. . .

end subroutine

! main program

program nbody

use nbodyvars

. . .

call setup ( )

!HPF DISTRIBUTE X ( :, BLOCK)

!HPF ALIGN V, M WITH X

. . .

do k = 1, np

call runge-kutta(timestep, ns)

call print_state ()

end do

end program

HTML version of Basic Foils prepared 14 October 1997

Foil 50 Notes and References

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

These O(N2) techniques successful on astrophysical problems of size a few thousand particles. Larger problems, such as on the scale of galaxies, do not calculate all pairs of particle interactions but use "fast multipole" and estimate force for areas of distant particles. The data structure for this technique is a Barnes-Hut tree.

Burden, Richard L. and Faires, J. Douglas, Numerical Analysis. Fourth edition, PWS-Kent Publishing Company, 1989. This is basic ODE reference

There is also ODE chapter from the CSEP book, http://www.npac.syr.edu/projects/csep/ode/ode.html

Chapter 9 of Solving Problems on Concurrent Processors, Volume I does parallel O(N2) message parallel

Salmon, John K. Parallel Hierarchial N-body Methods, dissertation from Caltech, technical report SCCS-52, CRPC-90-14, 1990 is original practical fast multipole.

HTML version of Basic Foils prepared 14 October 1997

Foil 51 CPS 615 -- Computational Science in
Simulation Track
Message Parallel Module on ODE's and Particle Dynamics
October 20, 1997

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Nancy McCracken,

Geoffrey Fox

NPAC

Syracuse University

111 College Place

Syracuse NY 13244-4100

HTML version of Basic Foils prepared 14 October 1997

Foil 52 Abstract of Message Parallel ODE and Particle Dynamics

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations

We discuss Euler, Runge Kutta and predictor corrector methods

Various Message parallel O(N2) algorithms are described with performance comments

There is a related data parallel module sharing the same initial foils

HTML version of Basic Foils prepared 14 October 1997

Foil 53 Summary of Parallel N-Body Programming Methods and Algorithms

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

3 Parallel Programming Paradigms

Message Parallel (MIMD)
Data Parallel (HPF on MIMD and historically CMFortran on SIMD CM-2)
Shared Memory Parallel Fortran especially on distributed shared memory architectures such as Origin 2000

2 Important and very different Algorithms

Natural O(N2) which is slowest but needed in cases where objects are often very close as in dynamics of globular clusters mentioned on previous foil
Fast Multipole methods O(N) or O(N logN) which are getting increasing use

HTML version of Basic Foils prepared 14 October 1997

Foil 54 Status of Parallelism in Various N Body Cases

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Data Parallel approach is really only useful for the simple O(N2) case and even here it is quite tricky to express algorithm so that it is

both space efficient and
captures factor of 2 savings from Newton's law of action and reaction
Fij = - Fji
We have discussed these issues in previous foils

The shared memory approach is effective for a modest number of processors in both algorithms.

It is only likely to scale properly for O(N2) case as the compiler will find it hard to capture clever ideas needed to make fast multipole efficient

Message Parallel approach gives you very efficient algorithms in both cases

O(N2) case has very low communication
O(NlogN) has quite low communication

HTML version of Basic Foils prepared 14 October 1997

Foil 55 Other N-Body Like Problems - I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

The characteristic structure of N-body problem is an observable that depends on all pairs of entities from a set of N entities.

This structure is seen in diverse applications:

1)Look at a database of items and calculate some form of correlation between all pairs of database entries

2)This was first used in studies of measurements of a "chaotic dynamical system" with points xi which are vectors of length m

Put rij = distance between xi and xj in m dimensional space

Then probability p(rij = r) is proportional to r(d-1)

where d (not equal to m) is dynamical dimension of system
calculate by forming all the rij (for i and j running over observable points from our system -- usually a time series) and accumulating in a histogram of bins in r
Parallel algorithm in a nutshell: Store histograms replicated in all processors, distribute vectors equally in each processor and just pipeline xj through processors and as they pass through accumulate rij ; add histograms together at end.

HTML version of Basic Foils prepared 14 October 1997

Foil 56 Other N-Body Like Problems - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

3)Green's Function Approach to simple Partial Differential equations gives solutions as integrals of known Green's functions times "source" or "boundary" terms.

For the simulation of earthquakes in GEM project the source terms are strains in faults and the stresses in any fault segment are the integral over strains in all other segments
Compared to particle dynamics, Force law replaced by Green's function but in each case total stress/Force is sum over contributions associated with other entities in formulation

4)In the so called vortex method in CFD (Computational Fluid Dynamics) one models the Navier Stokes Equation as the long range interactions between entities which are the vortices

5)Chemistry (see foil 7) uses molecular dynamics and so particles are molecules but force is not Newton's laws usually but rather Van der Waals forces which are long range but fall off faster than 1/r2

HTML version of Basic Foils prepared 14 October 1997

Foil 57 Essential Structure of Message Parallel O(N2) Algorithm - I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Let MPGrav(i) return the acceleration of i'th particle which is specified by position X(i) and velocity V(i)

The kernel of algorithm increments X(i),V(i) from t to t+h using Runge-Kutta method.

This involves 4 function calls to MPGrav(i) for the four different choices of position and time needed in the Runge-Kutta method.

Let Xuse(i) be position vector used in each function call. Then we have

(time,Xuse) = (t,X) (t+h/2, X + (h/2)Dxa) (t+h/2, X + (h/2)Dxb) (t+h, X + hDxc)

where Dxa Dxb Dxc are shift vectors calculated by previous phase of Runge-Kutta method

HTML version of Basic Foils prepared 14 October 1997

Foil 58 Essential Structure of Message Parallel O(N2) Algorithm - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Note that MPGrav(i) result depends on the array Xuse and fixed parameters such as mass

It happens to be independent of time and velocity although sucvh velocity and time dependent forces could be accomodated

The calculation involves an iteration over desired number of time steps increasing t by h each time.

Each time step involves 9 phases -- at each phase, one loops over all N particles (or rather over all N/Nproc particles stored in a given processor)

Some phases (4 of them) involve communication and computation; the others "just" SPMD computation.

The number of phases depends on ODE strategy used -- with the simple Euler method, there are 1 or 2 phases, depending on how one counts.

HTML version of Basic Foils prepared 14 October 1997

Foil 59 Structure of Runge Kutta Phases

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

HTML version of Basic Foils prepared 14 October 1997

Foil 60 The 9 Fortran Phases in Runge Kutta Update

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

1)XdeltaA(i) = V(i)

Xuse(i) = X(i) running over all i in Processor

2)VdeltaA(i) = MPGrav(i) involves communication

3)XdeltaB(i) = V(i) + h*VdeltaA(i)/2

Xuse(i) = X(i) +h*XdeltaA(i)/2

4)VdeltaB(i) = MPGrav(i) involves communication

5)XdeltaC(i) = V(i) + h*VdeltaB(i)/2

Xuse(i) = X(i) +h*XdeltaB(i)/2

6)VdeltaC(i) = MPGrav(i) involves communication

7)XdeltaD(i) = V(i) + h*VdeltaC(i)

Xuse(i) = X(i) +h*XdeltaC(i)

8)VdeltaD(i) = MPGrav(i) involves communication

9)X(i) becomes (X(i) + h*(XdeltaA(i)+2*XdeltaB(i)+2*XdeltaC(i)+XdeltaD(i))/6 )

V(i) becomes (V(i) + h*(VdeltaA(i)+2*VdeltaB(i)+2*VdeltaC(i)+VdeltaD(i))/6 )

HTML version of Basic Foils prepared 14 October 1997

Foil 61 Features of Message Parallel Computation

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

This problem is classic SPMD with different phases of computing and communication which the four 4 MPGrav phases further subdividable into subphases which are compute only -- communicate only phases

We divide particles equally between processors

N Particles
Nproc Processors
N/Nproc particles per processor

As no "locality" in force, all particles are "equal" and it does not matter which particle is placed in which processor

Remember each of nine phases is a full loop over all particles in a given processor with local index running from 1 ... N/Nproc

We get "automatic" balanced parallel computation in steps 1) 3) 5) 9) with "embarassingly parallel" (i.e. no communication) operation

HTML version of Basic Foils prepared 14 October 1997

Foil 62 Message Parallel Force Computation MPGrav

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

MPGrav(i) uses Xuse(i) and mass M(i) and calculates a 3 vector force from 3 vector Xuse(i,j)

MPGrav(i) = S M(i)*M(j)*( Xuse(j)-Xuse(i))

where ri,j = | Xuse(i)-Xuse(j) | is distance between particle i and particle j

This calculation involves communication as N-Nproc of values of j (and parameters Xuse(j) M(j) ) are stored outside the processor holding the i'th processor

We describe algorithms with increasing complexity and efficiency!

Note in discussions, we are rather sloppy as to whether i,j are "local" (1...N/Nproc) or "global" (1...N) indices

HTML version of Basic Foils prepared 14 October 1997

Foil 63 Very Bad Naive Message Parallel Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

For each i in each processor, set MPGrav(i) =0

We will implement a naive "owner's-compute rule" algorithm where MPGrav(i) is calculated in processor that is home to i

Now loop over over j= 1...N (j � i)

When j is stored in processor holding i, increment MPGrav(i) by contribution due to j

When j stored in a different processor, communicate Xuse(j),M(j) and increment MPGrav(i)

HTML version of Basic Foils prepared 14 October 1997

Foil 64 Features of Very Bad Naive Message Parallel Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

The parallelism is perfect and both communication and computation are load balanced

However messages are small (as described 4 words - a 3 vector Xuse and mass M) and

communication � 4 tcomm(for small messages)

where estimate ~15 tfloat could be higher depending on how 1/ri,j3 calculated. As involves division and square root, it is expensive if done directly. Sometimes better to calculate by table lookup but this involves slow memory access

As typically tcomm/tfloat � 10 even for quite large messages (and worse for small ones), the above estimate suggests that communication dominates ...

HTML version of Basic Foils prepared 14 October 1997

Foil 65 Much Better Message Parallel Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

We can solve the small message size and poor communication/computation ratio by a simple change in algorithm which still preserves owner's compute rule.

Reverse loops over i and j so that j is outer loop and i inner loop

First for each i in processor, set MPGrav(i) = 0

Now looping over j, increment each MPGrav(i) by contribution of those j that are local to processor

Now fetch those j which are off processor and communicate as before M(j), Xuse(j)

For this j, run over all i in processor, incrementing MPGrav(i) by the contribution of this j

This version of algorithm reduces communication by a factor of the grain size n =N/Nproc

and total communication overhead is approximately

which is small for n > 100

HTML version of Basic Foils prepared 14 October 1997

Foil 66 Blocking of Messages

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

The algorithm as described before still has small messages but this can be addressed for both "very bad" and "much better" algorithm by "blocking" j loop so that one fetches not 1 but J values of j at a time.

This implies messages can be "arbitarily" large and user can choose J so that:

Messages are long enough to avoid latency (start-up) performance degradation
Messages are short enough so that don't use too much space in memory of each processor (otherwise choose J= N - Nproc)

See later comments on cache use and pipelining of messages for further related issues

HTML version of Basic Foils prepared 14 October 1997

Foil 67 Compilers, Caches and Data Locality

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

As owner's compute rule is obeyed, a good parallelizing compiler should be able to "automatically" find the "much better" algorithm as inverting loops and blocking are standard optimization strategies

Note that "much better" parallel algorithm is also correct sequential algorithm as naturally uses each j value N-1 times as j block fixed in cache and i values are cycled through

Now size of block J controlled by cache size and not processor memory size as in parallel case
Note each i value used J times, each j value N-1 times

General lesson is that amount of computation and amount of data re-use are as important as amount of communication

HTML version of Basic Foils prepared 14 October 1997

Foil 68 Communication and Computation Complexity

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

The O(N2) long range force algorithm has largest communication load but the smallest known form of overhead (except pleasingly parallel problems) which is proportional to 1/n where n grain size

Computation is even larger than communication!

Compare Laplace (general PDE) in two dimensions which has overhead proportional to 1/n1/2 in two dimensions and 1/n1/3 in 3 dimensions

Such PDE's have small (edge) communication but an algorithm with computation proportional to N not N2 each iteration

N-body problem has computation proportional to N2 and general rule is that as algorithms get either more complex or more compute intense, the relative amount of communication decreases and does NOT increase

complexity increases computation more than communication

HTML version of Basic Foils prepared 14 October 1997

Foil 69 Best Message Parallel N Body Algorithm - I

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Previous Algorithm is not as good as it looks for it has an efficiency of 50% reduced by terms of order 1/n

For some cases, efficiency is 100% if only want "potential" and not forces -- this is case of correlation histogram example given earlier

Degradation is because of Newton's law of action and reaction which says that

Fi,j = -Fj,i

which reduces sequential computation load by a factor of 2

This is not trivial to exploit in parallel algorithm as Fi,j and Fj,i are needed in different processors and so one MUST violate owner's compute rule to exploit

Thus very hard for a compiler to find the best algorithm although as in other set of data parallel foils, one can express in HPF with some difficulty
Most natural to express in message parallel syntax

HTML version of Basic Foils prepared 14 October 1997

Foil 70 Best Message Parallel N Body Algorithm - II

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Introduce a new array MPGrav_travel(i) which will travel through the array picking up the symmetrically generated terms

First in each processor, initialize both MPGrav(i) and MPGrav_travel(i) to zero

Now we as before have a an outer loop over j which in practice will be blocked into J items for message size issues

This time communicate in block Xuse(j) M(j) and MPGrav_travel(j)

Now loop over each i in each processor and have some flag to decide whether or not Fi,j is to be calculated in Home of i or Home of j

if( Fi,j is to be computed in home of i) Then

find Fi,j and use it to increment MPGrav(i) with Fi,j and

Increment MPGrav_travel(j) with Fj,i = - Fi,j

HTML version of Basic Foils prepared 14 October 1997

Foil 71 Choice of Place to Compute Fi,j

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

We need a criterion for deciding where to compute Fi,j so that for EACH j block, there is an equal of computation in each processor

If the criterion is calculate in Home of i if i<j (natural sequential choice), then this is NOT load balanced for a one dimensional block decomposition as amount of work decreases as i increases and one gets to later numbered processors

However one can use a cyclic or scattered decomposition (and "interaction criterion" i<j ) as then each processor has particles distributed throughout array

This is typical of load balancing triangular matrix algorithms such as Gaussian elimination

Note that this "best" algorithm halves computation and "doubles" communication as one is transferring twice as much information (7 not 4 items for each j)

HTML version of Basic Foils prepared 14 October 1997

Foil 72 Final Remarks on Best Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

In our description of "much better" and "best" algorithm, we assumed that one broadcasts each J block to each processor

There are some different ways of setting this up which can be more efficient on some architectures

Especially on classic architectures of times gone by where there different costs for different communication paths

The data parallel part of foils in fact describes the natural pipeline algorithm which rotates J blocks through processors one step at a time

This has the feature (different from previous explanation) that each processor is handling a different set of j's at a given stage in computation.

HTML version of Basic Foils prepared 14 October 1997

Foil 73 CPS 615 -- Computational Science in
Simulation Track
Data Parallel Module on ODE's and Particle Dynamics
February 20, 1998

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Nancy McCracken,

Geoffrey Fox

NPAC

Syracuse University

111 College Place

Syracuse NY 13244-4100

HTML version of Basic Foils prepared 14 October 1997

Foil 74 Abstract of Message Parallel ODE and Particle Dynamics

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

This uses the simple O(N2) Particle Dynamics Problem as a motivator to discuss solution of ordinary differential equations

We discuss Euler, Runge Kutta and predictor corrector methods

F90 and HPF Data parallel O(N2) algorithms are described with performance comments

There is a related message parallel module sharing the same initial foils

HTML version of Basic Foils prepared 14 October 1997

Foil 75 Representing the Message Parallel N-Body problem

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Positions and velocities are 3 N dimensional arrays, X and V

other variables

M - Masses are a N dimensional array
h - step size
ns - number of time steps

subroutine for numerical method will take these arguments and update X and V.

A subroutine called MPGrav (message parallel) is assumed to compute new accelerations

HTML version of Basic Foils prepared 14 October 1997

Foil 76 Form of the Computation -- Message Parallel

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Computation of numerical method is inherently iterative: at each time step, the solution depends on the immediately preceding one.

At each time step, MPGrav is called (several times as using Runge Kutta):

For each particle i, one must sum over the forces due to all other particles j.
Computation is O(N2) - there is a potential for parallel computation in either i (normal case in standard Message Parallel) or j or both.

We will use 4th order Runge Kutta to integrate in time and the program is designed an overall routine looping over time with parallelism hidden in MPGrav routines

We analyse Message Parallel version and other foils discuss Data Parallel version

HTML version of Basic Foils prepared 14 October 1997

Foil 77 Pipeline Algorithm in detail

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

A denotes MPGrav and Ac denotes MPGrav_travel

Case when i compared to i is not needed as particles dont interact with themselves

At step k, interact particle i with particle j= 1 + mod((i+k-1),N)

Accumulate force on i due to j in fixed Ai

Accumulate negative of this as force on j due to i in circulating Acj

At the end of the algorithm, add Ai and Aci

In parallel version, Note that Ai will be calculated in "home processor for particle i but Aci will travel around the machine being accumulated in processor holding particle j

Thus this violates the owner computes rule and so this parallel algorithm must be implemented by hand -- the compiler will not find it automatically

HTML version of Basic Foils prepared 14 October 1997

Foil 78 Basic Message Parallel pipeline operation

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

First step is to Circulate(shift) one position and calculate acelerations Fij and Fji in all index positions

Shifting pipeline (N-1) times gives correct algorithm but does not save "Newton's factor of two".

Just need (N-1)/2 steps when N is odd and N/2 steps when N is even which saves factor of two.

Shift in blocks of J particles for message passing to block for good performance

HTML version of Basic Foils prepared 14 October 1997

Foil 79 Examples of Message Parallel Pipeline Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

HTML version of Basic Foils prepared 14 October 1997

Foil 80 Factor of Two in the Parallel O(N2) Algorithm

From Fox Presentation Fall 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95/96/97. *

Full HTML Index

Symmetry of force on particles: Fij = -Fji (Newton's Law of Action and Reaction!)

only half need to be computed
Sequentially this can easily be taken into account for one just does loops with sum over particles i, and then sums for force calculation over particles j <= i and then calculate algebraic form of Fij and then accumulates -- Force on i increments by Fij and Force on j decrements by Fij

Parallel version has two issues -- firstly one cannot use "owner-computes" rule directly and secondly one must worry about load blancing

Assuming for example that processors

assigned with block distribution in

column direction.

To calculate the force between
2 particles will take N/Nproc iterations
in the longest running processor

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Fri Oct 2 1998

Basic foilset CPS615 Foils -- set E: ODE's and Particle Dynamics

Table of Contents for full HTML of CPS615 Foils -- set E: ODE's and Particle Dynamics

Foil 1 CPS 615 -- Computational Science inSimulation TrackData Parallel Module on ODE's and Particle DynamicsOctober 23, 1995Updated Oct 11,1996

Foil 2 Abstract of Data Parallel ODE and Particle Dynamics Module

Foil 3 Particle (N-Body) Applications and Ordinary Differential Equations (ODE's)

Foil 4 Particle Applications - Ordinary Differential Equations (ODE's)

Foil 5 Particle Applications - the N-body problem

Foil 6 Newton's First Law -- The Gravitational Force on a Particle

Foil 7 Equations of Motion -- Newton's Second Law

Foil 8 Numerical techniques for solving ODE's

Foil 9 Second and Higher Order Equations

Foil 10 Basic Discretization of Single First Order Equation

Foil 11 Errors in numerical approximations

Foil 12 Runge-Kutta Methods: Euler's method

Foil 13 Estimate of Error in Euler's method

Foil 14 Relationship of Error to Computation

Foil 15 Example using Euler's method from the CSEP book

Foil 16 Approximate solutions at t=1,using Euler's method with different values of h

Foil 17 Runge-Kutta Methods: Modified Euler's method

Foil 18 Approximate solutions of the ODE for et at t=1, using modified Euler's method with different values of h

Foil 19 The Classical Runge-Kutta -- In Words

Foil 20 The Classical Runge-Kutta -- Formally

Foil 21 The Classical Runge-Kutta Pictorially

Foil 22 Predictor / Corrector Methods

Foil 23 Definition of Multi-step methods

Foil 24 Features of Multi-Step Methods

Foil 25 Comparison of Explicit and Implicit Methods

Foil 26 Solving the N-body equations of motion

Foil 27 Representing the Data Parallel N-Body problem

Foil 28 Form of the Computation -- Data v. Message Parallel

Foil 29 N-body Runge Kutta Routine in Fortran90 - I

Foil 30 Runge Kutta Routine in Fortran90 - II

Foil 31 Computation of accelerations - a simple parallel array algorithm

Foil 32 Simple Data Parallel Version of N Body Force Computation -- Grav -- I

Foil 33 The Grav Function in Data Parallel Algorithm - II

Foil 34 Some Inefficiencies of the Data Parallel N2 Algorithm - I

Foil 35 Some Inefficiencies of the Data Parallel N2 Algorithm - II

Foil 36 Better Data Parallel Pipeline Algorithm for Computation of Accelerations,taking 1/2 the time for iterations over force computation

Foil 37 Data Parallel Pipeline Algorithm in detail

Foil 38 Basic Data Parallel pipeline operation

Foil 39 Examples of Data Parallel Pipeline Algorithm

Foil 40 Data Parallel Pipeline Algorithm Grav -- Part I

Foil 41 Data Parallel Pipeline Algorithm for Grav -- Part II

Foil 42 Data Parallel Grav Pipeline Algorithm, concluded

Foil 43 Data Parallel Parallel Decomposition

Foil 44 Data Parallel Parallel Execution Time -I

Foil 45 Data Parallel Parallel Execution Time -II

Foil 46 N-body Problem is a one dimensional Algorithm

Foil 47 Excerpts from an HPF program for this algorithm

Foil 48 HPF program excerpts - II

Foil 49 HPF program excerpts - finished

Foil 50 Notes and References

Foil 51 CPS 615 -- Computational Science inSimulation TrackMessage Parallel Module on ODE's and Particle DynamicsOctober 20, 1997

Foil 52 Abstract of Message Parallel ODE and Particle Dynamics

Foil 53 Summary of Parallel N-Body Programming Methods and Algorithms

Foil 54 Status of Parallelism in Various N Body Cases

Foil 55 Other N-Body Like Problems - I

Foil 56 Other N-Body Like Problems - II

Foil 57 Essential Structure of Message Parallel O(N2) Algorithm - I

Foil 58 Essential Structure of Message Parallel O(N2) Algorithm - II

Foil 59 Structure of Runge Kutta Phases

Foil 60 The 9 Fortran Phases in Runge Kutta Update

Foil 61 Features of Message Parallel Computation

Foil 62 Message Parallel Force Computation MPGrav

Foil 63 Very Bad Naive Message Parallel Algorithm

Foil 64 Features of Very Bad Naive Message Parallel Algorithm

Foil 65 Much Better Message Parallel Algorithm

Foil 66 Blocking of Messages

Foil 67 Compilers, Caches and Data Locality

Foil 68 Communication and Computation Complexity

Foil 69 Best Message Parallel N Body Algorithm - I

Foil 70 Best Message Parallel N Body Algorithm - II

Foil 71 Choice of Place to Compute Fi,j

Foil 72 Final Remarks on Best Algorithm

Foil 73 CPS 615 -- Computational Science inSimulation TrackData Parallel Module on ODE's and Particle DynamicsFebruary 20, 1998

Foil 74 Abstract of Message Parallel ODE and Particle Dynamics

Foil 75 Representing the Message Parallel N-Body problem

Foil 76 Form of the Computation -- Message Parallel

Foil 77 Pipeline Algorithm in detail

Foil 78 Basic Message Parallel pipeline operation

Foil 1 CPS 615 -- Computational Science in
Simulation Track
Data Parallel Module on ODE's and Particle Dynamics
October 23, 1995
Updated Oct 11,1996

Foil 36 Better Data Parallel Pipeline Algorithm for Computation of Accelerations,
taking 1/2 the time for iterations over force computation

Foil 51 CPS 615 -- Computational Science in
Simulation Track
Message Parallel Module on ODE's and Particle Dynamics
October 20, 1997

Foil 73 CPS 615 -- Computational Science in
Simulation Track
Data Parallel Module on ODE's and Particle Dynamics
February 20, 1998