Given by Ian Foster, Gina Goff, Ehtesham Hayder, Chuck Koelbel at DoD Modernization Tutorial on 1995-1998. Foils prepared August 29 98
Outside Index
Summary of Material
Introduction |
Parallelism, Synchronization, and Environments |
Restructuring/Designing Programs in OpenMP |
Example Programs |
Outside Index
Summary of Material
Day 1
|
Day 2
|
Introduction |
Parallelism, Synchronization, and Environments |
Restructuring/Designing Programs in OpenMP |
Example Programs |
A portable fork-join parallel model for shared-memory architectures |
Portable
|
Fork-join model
|
Shared memory
|
Computation(s) using several processors
|
Synchronization
|
Two basic flavors of parallelism:
|
"A flexible standard, easily implemented across different platforms" |
Control structures
|
Data environment
|
Synchronization
|
Runtime library
|
Introduction |
Parallelism, Synchronization, and Environments |
Restructuring/Designing Programs in OpenMP |
Example Programs |
PARALLEL, END PARALLEL
|
SECTIONS, END SECTIONS
|
SECTION
|
SINGLE, END SINGLE
|
DO, END DO
|
Static Scheduling (default)
|
Dynamic Scheduling
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
!$OMP PARALLEL DO & |
!$OMP SCHEDULE(DYNAMIC,1) |
DO J = 1, 36 |
CALL SUBR(J) |
END DO |
!$OMP END DO |
!$OMP PARALLEL DO & |
!$OMP SCHEDULE(GUIDED,1) |
DO J = 1, 36 |
CALL SUBR(J) |
END DO |
!$OMP END DO |
PROGRAM main |
!$OMP PARALLEL |
CALL foo() |
CALL bar() |
CALL error() |
!$OMP END PARALLEL |
SUBROUTINE error() |
! Not allowed due to |
! nested control structs |
!$OMP SECTIONS |
!$OMP SECTION |
CALL foo() |
!$OMP SECTION |
CALL bar() |
!$OMP END SECTIONS |
END |
SUBROUTINE foo() |
!$OMP DO |
DO i = 1, n |
... |
END DO |
!$OMP END DO |
END |
SUBROUTINE bar() |
!$OMP SECTIONS |
!$OMP SECTION |
CALL section1() |
!$OMP SECTION |
... |
!$OMP SECTION |
... |
!$OMP END SECTIONS |
END |
Implicit barriers wait for all threads
|
Explicit directives provide finer control
|
Data can be PRIVATE or SHARED |
Private data is for local variables |
Shared data is global |
Data can be private to a thread -- all processors in thread can access the data, but other threads can't see it |
COMMON /mine/ z |
INTEGER x(3), y(3), z |
!$OMP THREADPRIVATE(mine) |
!$OMP PARALLEL DO DEFAULT(PRIVATE), SHARED(x) |
DO k = 1, 3 |
x(k) = k |
y(k) = k*k |
z = z + x(i)*y(i) |
END DO |
!$OMP END PARALLEL DO |
SHARED MEMORY |
x |
1 |
2 |
3 |
z |
36 |
Thread 0 |
z' |
1 |
y |
1 |
Thread 1 |
z' |
4 |
y |
4 |
Thread 2 |
z' |
9 |
y |
9 |
For controlling execution
|
OMP_NUM_THREADS: How many to use in parallel region?
|
OMP_DYNAMIC: Should runtime system choose number of threads?
|
OMP_NESTED: Should nested parallel regions be supported?
|
OMP_SCHEDULE: Choose DO scheduling option
|
OMP_IN_PARALLEL: Is the program in a parallel region? |
Introduction |
Parallelism, Synchronization, and Environments |
Restructuring/Designing Programs in OpenMP |
Example Programs |
Profiling |
Walk the loop nests |
Multiple parallel loops |
Is dataset large enough? |
At the top of the list, should find
|
What is cumulative percent? |
Watch for system libraries near top
|
Usually the outermost parallel loop
|
Don't be put off by --
|
Nested parallel loops are good
|
Non nested parallel loops
|
subroutine fem3d(...) |
10 call addmon(...) |
if(numelh.ne0) call solide |
subroutine solide |
do 20 i=1,nelt |
do 20 j=1,nelg |
call unpki |
call strain |
call force |
20 continue |
if(...) return |
goto 10 |
subroutine force(...) |
do 10 i=lft,llt |
sgv(i) = sig1(i)-qp(i)*vol(i) |
10 continue |
do 50 n=1,nnc |
i0=ia(n) |
i1=ia(n+1)-1 |
do 50 i=i0,i1 |
e(1,ix(i))=e(1,ix(i))+ep11(i) |
50 continue |
Two level strategy for parallel processing |
Determining shared and local variables |
Adding synchronization |
Two level approach isolates major concerns -- makes code easier to update |
Algorithm/Architecture Level
|
Platform Specific Level
|
What are the variable classes? |
Process for determining class |
First private/last private |
Start with access patterns
|
Goal: determine storage classes
|
In general, big things are shared
|
Program local vars are parallel private vars
|
Move up from leaf subroutines to parallel region |
Equivalences: ick |
Examine refs to each var to determine shared list
|
Construct private list and declare private commons by examining the types of remaining variables |
Examine |
Refs |
Only Read |
in P Region |
Put on |
Shared list |
Modified |
in P Region |
Contains parallel |
loop index |
(Diff iterations |
reference diff parts) |
Put on |
Shared list |
Does not contain |
parallel loop index |
Go to |
next page |
LASTPRIVATE copies value(s) from local copy assigned on last iteration of loop to global copy of variables or arrays |
FIRSTPRIVATE copies value(s) from global variables or arrays to local copy for first iteration of loop on each processor |
Parallelizing a loop and not knowing whether there are side effects? |
subroutine foo(n) |
common /foobar/a(1000),b(1000),x |
c$omp parallel do shared(a,b,n) lastprivate(x) |
do 10 i=1,n |
x=a(i)**2 + b(i)**2 |
10 b(i)= sqrt(x) |
end |
Use lastprivate because don't |
know where or if x in common |
/foobar/ will be used again |
Finding variables that need to be synchronized |
Two frequently used types
|
Doing reductions |
Updates: parallel do invariant variables that are read then written |
Place critical/ordered section around groups of updates |
Pay attention to control flow
|
if (ncycle.eq.0) then |
do 60 i=lft,llt |
dt2=amin1(dtx(i),dt2) |
if (dt2.eq.dtx(i)) then |
ielmtc=128*(ndum-1)+i |
ielmtc=nhex(ielmtc) |
ityptc=1 |
endif |
ielmtd=128*(ndum-1)+i |
ielmtd=nhex(ielmtd) |
write (13,90) ielmtd,dtx(i) |
write (13,100)ielmtc |
60 continue |
endif |
do 70 i=lft,llt |
70 dt2=amin1(dtx(i),dt2) |
if (mess.ne.'sw2.') return |
do 80 i=lft,llt |
if (dt2.eq.dtx(i)) then |
ielmtc=128*(ndum-1)+i |
ielmtc=nhex(ielmtc) |
ityptc=1 |
endif |
80 continue |
Serial program is a reduction: |
sum = 0.0 |
do 10 i=1,n |
10 sum = sum + a(i) |
Correct (but slow) program: |
sum = 0.0 |
c$omp parallel private(i) shared(sum,a,n) |
c$omp pdo |
do 10 i=1,n |
c$omp critical |
sum = sum + a(i) |
c$omp end critical |
10 continue |
c$omp end parallel |
Incorrect parallel program: |
c$omp parallel private(suml,i) |
c$omp& shared(sum,a,n) |
suml = 0.0 |
c$omp do |
do 10 i=1,n |
10 suml = suml + a(i) |
cbug -- need critical section next |
sum = sum + suml |
c$omp end parallel |
Correct reduction: |
c$omp parallel private(suml,i) |
c$omp& shared(sum,a,n) |
suml = 0.0 |
c$omp do |
do 10 i=1,n |
10 suml = suml + a(i) |
c$omp critical |
sum = sum + suml |
c$omp end critical |
c$omp end parallel |
Using Reduction does the same: |
c$omp parallel private(i) |
c$omp& shared(a,n) |
c$omp& reduction(+:sum) |
c$omp do |
do 10 i=1,n |
10 sum = sum + a(i) |
c$omp end parallel |
Problem: incorrectly pointing to the same place
|
Problem: incorrectly pointing to different places
|
Problem: incorrect initialization of parallel regions
|
Problem: not saving values from parallel regions
|
Problem: unsynchronized access
|
Problem: numerical inconsistency
|
Problem: inconsistently synchronized I/O stmts
|
Problem: inconsistent declarations of common vars
|
Problem: parallel stack size problems
|
Introduction |
Parallelism, Synchronization, and Environments |
Restructuring/Designing Programs in OpenMP |
Example Programs |
Partition
|
Communicate
|
Agglomerate
|
Map
|
Partition
|
Communicate
|
Agglomerate
|
Map
|
Numerically solve a PDE on a square mesh |
Method:
|
Partitioning does not change at all
|
Communication does not change at all
|
Agglomeration analysis changes a little
|
Minimize forking and synchronization overhead
|
Keep each processor working on the same data
|
Lay out data to be contiguous
|
(to be continued) |
The Problem
|
The Approach
|
REAL x(nnode), y(nnode), flux |
INTEGER iedge(nedge,2) |
err = tol * 1e6 |
DO WHILE (err > tol)
|
END DO |
Flux computations are data-parallel
|
Node updates are nearly data-parallel
|
Error check is a reduction
|
Communication needed for all parts
|
Because of the tight ties between flux, x, and err, it is best to keep the loop intact
|
No differences between computation in different iterations
|
There may be significant differences in data movement based on scheduling |
The ideal:
|
The reality:
|
Divide edge list among processors |
Ideally, would like all edges referring to a given vertex to be assigned to the same processor
|
covered on earlier slide; don't print |
covered on earlier slide; don't print |
Based on fork-join parallelism in shared memory
|
Very good for sharing data and incremental parallelization |
Unclear if it is feasible for distributed memory |
More information at http://www.openmp.org |
HPF
|
MPI
|
OpenMP
|
Can we combine paradigms?
|
Modern parallel machines are often shared memory nodes connected by message passing |
Can be programmed by calling MPI from OpenMP
|
ASCI project is using this heavily |
Many applications (like the atmosphere/ocean model) consist of several data-parallel modules |
Can link HPF codes on different machines using MPI
|
HPFMPI project at Argonne has done proof-of-concept |
HPF can be implemented by translating it to OpenMP
|
HPF may call OpenMP directly
|