Full HTML for

Basic foilset Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language

Given by Ian Foster, Gina Goff, Ehtesham Hayder, Chuck Koelbel at DoD Modernization Tutorial on 1995-1998. Foils prepared August 29 98
Outside Index Summary of Material


Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

Table of Contents for full HTML of Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language

Denote Foils where Image Critical
Denote Foils where Image has important information
Denote Foils where HTML is sufficient

1 Outline
2 Outline
3 OpenMP
4 OpenMP (2)
5 Shared Memory
6 Shared Memory in Pictures
7 OpenMP
8 OpenMP in Pictures
9 Design of OpenMP
10 Design of OpenMP (2)
11 Outline
12 Control Structures
13 Control Structures (2)
14 DO Scheduling
15 DO Scheduling (2)
16 Orphaned Directives
17 OpenMP Synchronization
18 OpenMP Synchronization (2)
19 OpenMP Data Environments
20 OpenMP Data Environments
21 OpenMP Environment & Runtime Library
22 OpenMP Environment & Runtime (2)
23 OpenMP Environment & Runtime (3)
24 Outline
25 Analyzing for Parallelism
26 Program Profile
27 Walking the Key Loop Nest
28 Multiple Parallel Loops
29 Example --Loop Nest
30 Restructuring Applications
31 Two Levels of Parallel Processing
32 Two Levels of Parallel Processing (cont.)
33 Determining Shared and Private
34 Types of Variables
35 Guidelines for Classifying Variables
36 Process of Classifying Variables
37 Process of Classifying Variables (2)
38 Process of Classifying Vars (3)
39 Firstprivate and Lastprivate
40 Firstprivate and Lastprivate (2)
41 Choosing & Placing Synchronization
42 What to Synchronize
43 Example -- Critical/Ordered Section
44 Reductions
45 (Flawed) Plan For a Good Reduction
46 Good Reductions
47 Typical Parallel Bugs
48 Typical Parallel Bugs (2)
49 Typical Parallel Bugs (3)
50 Outline
51 Designing Parallel Programs in OpenMP
52 Designing Parallel Programs in OpenMP (2)
53 Jacobi Iteration: The Problem
54 Jacobi Iteration: OpenMP Partitioning, Communication, and Agglomeration
55 Partitioning, Communication, and Agglomeration (2)
56 Jacobi Iteration: OpenMP Mapping
57 Jacobi Iteration: OpenMP Program
58 Jacobi Iteration/Program (2)
59 Irregular Mesh: The Problem
60 Irregular Mesh: Sequential Program
61 Irregular Mesh: OpenMP Partitioning
62 Irregular Mesh: OpenMP Communication
63 Irregular Mesh: OpenMP Agglomeration
64 Irregular Mesh: OpenMP Mapping
65 Irregular Mesh: OpenMP Mapping (2)
66 Irregular Mesh: OpenMP Program
67 Irregular Mesh
68 Irregular Mesh: Pictures
69 Irregular Mesh: Bad Data Order
70 Irregular Mesh: Bad Data Order
71 Irregular Mesh: Good Data Order
72 Irregular Mesh: Good Data Order
73 OpenMP Summary
74 Three Systems Compared
75 Three Systems Compared (2)
76 OpenMP + MPI
77 MPI + HPF
78 HPF + OpenMP

Outside Index Summary of Material



HTML version of Basic Foils prepared August 29 98

Foil 1 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Day 1
  • Introduction to Parallel Programming
  • The OpenMP Programming Language
    • Introduction
    • Parallelism, Synchronization, and Environments
    • Restructuring/Designing Programs in OpenMP
    • Example Programs
Day 2
  • Introduction to MPI
  • The PETSc Library

HTML version of Basic Foils prepared August 29 98

Foil 2 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

HTML version of Basic Foils prepared August 29 98

Foil 3 OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
A portable fork-join parallel model for shared-memory architectures
Portable
  • Based on Parallel Computing Forum (PCF)
  • Fortran 77 binding here today; C coming this year

HTML version of Basic Foils prepared August 29 98

Foil 4 OpenMP (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Fork-join model
  • Execution starts with one thread of control
  • Parallel regions fork off new threads on entry
  • Threads join back together at the end of the region
Shared memory
  • (Some) Memory can be accessed by all threads

HTML version of Basic Foils prepared August 29 98

Foil 5 Shared Memory

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Computation(s) using several processors
  • Each processor has some private memory
  • Each processor has access to a memory shared with the other processors
Synchronization
  • Used to protect integrity of parallel program
  • Prevents unsafe memory accesses
  • Fine-grained synchronization (point to point)
  • Barrier: use for global synchronization

HTML version of Basic Foils prepared August 29 98

Foil 6 Shared Memory in Pictures

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 7 OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Two basic flavors of parallelism:
  • Coarse-grained
    • Program is broken into segments (threads) that can be executed in parallel
    • Use barriers to re-synchronize execution at the end
  • Fine-grained
    • Execute iterations of DO loop(s) in parallel

HTML version of Basic Foils prepared August 29 98

Foil 8 OpenMP in Pictures

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 9 Design of OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
"A flexible standard, easily implemented across different platforms"
Control structures
  • Minimal for simplicity and encouraging common cases
  • PARALLEL, DO, SECTIONS, SINGLE, MASTER
Data environment
  • New data access capabilities for forked threads
  • SHARED, PRIVATE, REDUCTION

HTML version of Basic Foils prepared August 29 98

Foil 10 Design of OpenMP (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Synchronization
  • Simple implicit synch at beginning and end of control structures
  • Explicit synch for more complex patterns: BARRIER, CRITICAL, ATOMIC, FLUSH, ORDERED
Runtime library
  • Manages modes for forking and scheduling threads
  • E.g, OMP_GET_THREAD_NUM

HTML version of Basic Foils prepared August 29 98

Foil 11 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

HTML version of Basic Foils prepared August 29 98

Foil 12 Control Structures

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
PARALLEL, END PARALLEL
  • Marks beginning and end of parallel segment
SECTIONS, END SECTIONS
  • Segment contains several sections
SECTION
  • Marks beginning of one section among many
SINGLE, END SINGLE
  • Only one section in segment
DO, END DO
  • Marks beginning and end of parallel DO

HTML version of Basic Foils prepared August 29 98

Foil 13 Control Structures (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 14 DO Scheduling

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Static Scheduling (default)
  • Divides loop into equal size iteration chunks
  • Based on runtime loop limits
  • Totally parallel scheduling algorithm
Dynamic Scheduling
  • Threads go to scheduler to get next chunk
  • Guided: chunks taper down at end of loop

HTML version of Basic Foils prepared August 29 98

Foil 15 DO Scheduling (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
!$OMP PARALLEL DO &
!$OMP SCHEDULE(DYNAMIC,1)
DO J = 1, 36
CALL SUBR(J)
END DO
!$OMP END DO
!$OMP PARALLEL DO &
!$OMP SCHEDULE(GUIDED,1)
DO J = 1, 36
CALL SUBR(J)
END DO
!$OMP END DO

HTML version of Basic Foils prepared August 29 98

Foil 16 Orphaned Directives

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
PROGRAM main
!$OMP PARALLEL
CALL foo()
CALL bar()
CALL error()
!$OMP END PARALLEL
SUBROUTINE error()
! Not allowed due to
! nested control structs
!$OMP SECTIONS
!$OMP SECTION
CALL foo()
!$OMP SECTION
CALL bar()
!$OMP END SECTIONS
END
SUBROUTINE foo()
!$OMP DO
DO i = 1, n
...
END DO
!$OMP END DO
END
SUBROUTINE bar()
!$OMP SECTIONS
!$OMP SECTION
CALL section1()
!$OMP SECTION
...
!$OMP SECTION
...
!$OMP END SECTIONS
END

HTML version of Basic Foils prepared August 29 98

Foil 17 OpenMP Synchronization

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Implicit barriers wait for all threads
  • DO, END DO
  • SECTIONS, END SECTIONS
  • SINGLE, END SINGLE
  • MASTER, END MASTER
  • NOWAIT at END can override synch
  • Global barriers ? all threads must hit in the
  • same order

HTML version of Basic Foils prepared August 29 98

Foil 18 OpenMP Synchronization (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Explicit directives provide finer control
  • BARRIER -- must be hit by all threads in team
  • CRITICAL (name), END CRITICAL --
  • Only one thread may enter at a time
  • ATOMIC -- Single-statement critical section
  • for reduction
  • FLUSH (list) -- "Synchronization point at
  • which the implementation is required to
  • provide a consistent view of memory"
  • ORDERED -- For pipelining loop iterations

HTML version of Basic Foils prepared August 29 98

Foil 19 OpenMP Data Environments

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Data can be PRIVATE or SHARED
Private data is for local variables
Shared data is global
Data can be private to a thread -- all processors in thread can access the data, but other threads can't see it

HTML version of Basic Foils prepared August 29 98

Foil 20 OpenMP Data Environments

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
COMMON /mine/ z
INTEGER x(3), y(3), z
!$OMP THREADPRIVATE(mine)
!$OMP PARALLEL DO DEFAULT(PRIVATE), SHARED(x)
DO k = 1, 3
x(k) = k
y(k) = k*k
z = z + x(i)*y(i)
END DO
!$OMP END PARALLEL DO
SHARED MEMORY
x
1
2
3
z
36
Thread 0
z'
1
y
1
Thread 1
z'
4
y
4
Thread 2
z'
9
y
9

HTML version of Basic Foils prepared August 29 98

Foil 21 OpenMP Environment & Runtime Library

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
For controlling execution
  • Needed for tuning, but may limit portability
  • Control through environment variables or runtime library calls
    • Runtime library takes precedence in conflict

HTML version of Basic Foils prepared August 29 98

Foil 22 OpenMP Environment & Runtime (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
OMP_NUM_THREADS: How many to use in parallel region?
  • OMP_GET_NUM_THREADS,
  • OMP_SET_NUM_THREADS
  • Related: OMP_GET_THREAD_NUM,
  • OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS
OMP_DYNAMIC: Should runtime system choose number of threads?
  • OMP_GET_DYNAMIC, OMP_SET_DYNAMIC

HTML version of Basic Foils prepared August 29 98

Foil 23 OpenMP Environment & Runtime (3)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
OMP_NESTED: Should nested parallel regions be supported?
  • OMP_GET_NESTED, OMP_SET_NESTED
OMP_SCHEDULE: Choose DO scheduling option
  • Used by RUNTIME clause
OMP_IN_PARALLEL: Is the program in a parallel region?

HTML version of Basic Foils prepared August 29 98

Foil 24 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

HTML version of Basic Foils prepared August 29 98

Foil 25 Analyzing for Parallelism

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Profiling
Walk the loop nests
Multiple parallel loops

HTML version of Basic Foils prepared August 29 98

Foil 26 Program Profile

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Is dataset large enough?
At the top of the list, should find
  • parallel regions
  • routines called within them
What is cumulative percent?
Watch for system libraries near top
  • e.g., spin_wait_join_barrier

HTML version of Basic Foils prepared August 29 98

Foil 27 Walking the Key Loop Nest

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Usually the outermost parallel loop
  • Ignore timestep and convergence loops
  • Ignore loops with few iterations
  • Ignore loops that call unimportant subroutines
Don't be put off by --
  • Loops that write shared data
  • Loops that step through linked lists
  • Loops with I/O

HTML version of Basic Foils prepared August 29 98

Foil 28 Multiple Parallel Loops

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Nested parallel loops are good
  • Pick easiest or most parallel code
  • Think about locality
  • Use IF clause to select best based on dataset
  • Plan on doing one across clusters
Non nested parallel loops
  • Consider loop fusion (impacts locality)
  • Execute code between in parallel region

HTML version of Basic Foils prepared August 29 98

Foil 29 Example --Loop Nest

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
subroutine fem3d(...)
10 call addmon(...)
if(numelh.ne0) call solide
subroutine solide
do 20 i=1,nelt
do 20 j=1,nelg
call unpki
call strain
call force
20 continue
if(...) return
goto 10
subroutine force(...)
do 10 i=lft,llt
sgv(i) = sig1(i)-qp(i)*vol(i)
10 continue
do 50 n=1,nnc
i0=ia(n)
i1=ia(n+1)-1
do 50 i=i0,i1
e(1,ix(i))=e(1,ix(i))+ep11(i)
50 continue

HTML version of Basic Foils prepared August 29 98

Foil 30 Restructuring Applications

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Two level strategy for parallel processing
Determining shared and local variables
Adding synchronization

HTML version of Basic Foils prepared August 29 98

Foil 31 Two Levels of Parallel Processing

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Two level approach isolates major concerns -- makes code easier to update
Algorithm/Architecture Level
  • Unique to your software
  • Provides majority of SMP performance

HTML version of Basic Foils prepared August 29 98

Foil 32 Two Levels of Parallel Processing (cont.)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Platform Specific Level
  • Vendor provides insight
  • Remove last performance obstacles
  • Be careful to limit use of non-portable constructs

HTML version of Basic Foils prepared August 29 98

Foil 33 Determining Shared and Private

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
What are the variable classes?
Process for determining class
First private/last private

HTML version of Basic Foils prepared August 29 98

Foil 34 Types of Variables

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Start with access patterns
  • Read Only -- disposition elsewhere
  • Write then Read -- possibly local
  • Read then Write -- independent or reductions
  • Written -- live on exit?
Goal: determine storage classes
  • Local or private variables are local per thread
  • Shared variables are everything else

HTML version of Basic Foils prepared August 29 98

Foil 35 Guidelines for Classifying Variables

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
In general, big things are shared
  • The major arrays that take all the space
  • It's the thread's default model
Program local vars are parallel private vars
  • Temporaries used require one copy per thread
  • Subroutine locals become private automatically
Move up from leaf subroutines to parallel region
Equivalences: ick

HTML version of Basic Foils prepared August 29 98

Foil 36 Process of Classifying Variables

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Examine refs to each var to determine shared list
  • Split common into shared common and private common if vars require different storage classes
  • Use copy-in to private common as alternative
Construct private list and declare private commons by examining the types of remaining variables

HTML version of Basic Foils prepared August 29 98

Foil 37 Process of Classifying Variables (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Examine
Refs
Only Read
in P Region
Put on
Shared list
Modified
in P Region
Contains parallel
loop index
(Diff iterations
reference diff parts)
Put on
Shared list
Does not contain
parallel loop index
Go to
next page

HTML version of Basic Foils prepared August 29 98

Foil 38 Process of Classifying Vars (3)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 39 Firstprivate and Lastprivate

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
LASTPRIVATE copies value(s) from local copy assigned on last iteration of loop to global copy of variables or arrays
FIRSTPRIVATE copies value(s) from global variables or arrays to local copy for first iteration of loop on each processor

HTML version of Basic Foils prepared August 29 98

Foil 40 Firstprivate and Lastprivate (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Parallelizing a loop and not knowing whether there are side effects?
subroutine foo(n)
common /foobar/a(1000),b(1000),x
c$omp parallel do shared(a,b,n) lastprivate(x)
do 10 i=1,n
x=a(i)**2 + b(i)**2
10 b(i)= sqrt(x)
end
Use lastprivate because don't
know where or if x in common
/foobar/ will be used again

HTML version of Basic Foils prepared August 29 98

Foil 41 Choosing & Placing Synchronization

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Finding variables that need to be synchronized
Two frequently used types
  • Critical/ordered sections: small updates to shared structures
  • Barriers: delimit phases of computation
Doing reductions

HTML version of Basic Foils prepared August 29 98

Foil 42 What to Synchronize

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Updates: parallel do invariant variables that are read then written
Place critical/ordered section around groups of updates
Pay attention to control flow
  • Make sure you don't branch in or out
  • Pull it outside loop or region if efficient

HTML version of Basic Foils prepared August 29 98

Foil 43 Example -- Critical/Ordered Section

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
if (ncycle.eq.0) then
do 60 i=lft,llt
dt2=amin1(dtx(i),dt2)
if (dt2.eq.dtx(i)) then
ielmtc=128*(ndum-1)+i
ielmtc=nhex(ielmtc)
ityptc=1
endif
ielmtd=128*(ndum-1)+i
ielmtd=nhex(ielmtd)
write (13,90) ielmtd,dtx(i)
write (13,100)ielmtc
60 continue
endif
do 70 i=lft,llt
70 dt2=amin1(dtx(i),dt2)
if (mess.ne.'sw2.') return
do 80 i=lft,llt
if (dt2.eq.dtx(i)) then
ielmtc=128*(ndum-1)+i
ielmtc=nhex(ielmtc)
ityptc=1
endif
80 continue

HTML version of Basic Foils prepared August 29 98

Foil 44 Reductions

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Serial program is a reduction:
sum = 0.0
do 10 i=1,n
10 sum = sum + a(i)
Correct (but slow) program:
sum = 0.0
c$omp parallel private(i) shared(sum,a,n)
c$omp pdo
do 10 i=1,n
c$omp critical
sum = sum + a(i)
c$omp end critical
10 continue
c$omp end parallel

HTML version of Basic Foils prepared August 29 98

Foil 45 (Flawed) Plan For a Good Reduction

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Incorrect parallel program:
c$omp parallel private(suml,i)
c$omp& shared(sum,a,n)
suml = 0.0
c$omp do
do 10 i=1,n
10 suml = suml + a(i)
cbug -- need critical section next
sum = sum + suml
c$omp end parallel

HTML version of Basic Foils prepared August 29 98

Foil 46 Good Reductions

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Correct reduction:
c$omp parallel private(suml,i)
c$omp& shared(sum,a,n)
suml = 0.0
c$omp do
do 10 i=1,n
10 suml = suml + a(i)
c$omp critical
sum = sum + suml
c$omp end critical
c$omp end parallel
Using Reduction does the same:
c$omp parallel private(i)
c$omp& shared(a,n)
c$omp& reduction(+:sum)
c$omp do
do 10 i=1,n
10 sum = sum + a(i)
c$omp end parallel

HTML version of Basic Foils prepared August 29 98

Foil 47 Typical Parallel Bugs

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Problem: incorrectly pointing to the same place
  • Symptom: bad answers
  • Fix: initialization of local pointers
Problem: incorrectly pointing to different places
  • Symptom: segmentation violation
  • Fix: localization of shared data
Problem: incorrect initialization of parallel regions
  • Symptom: bad answers
  • Fix: Copy in? / Use parallel region outside parallel do.

HTML version of Basic Foils prepared August 29 98

Foil 48 Typical Parallel Bugs (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Problem: not saving values from parallel regions
  • Symptom: bad answers, core dump
  • Fix: transfer from local into shared
Problem: unsynchronized access
  • Symptom: bad answers
  • Fix: critical section / barrier / local accumulation
Problem: numerical inconsistency
  • Symptom: run-to-run variation in answers
  • Fix: different scheduling mechanisms / ordered sections / consistent parallel reductions

HTML version of Basic Foils prepared August 29 98

Foil 49 Typical Parallel Bugs (3)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Problem: inconsistently synchronized I/O stmts
  • Symptom: jumbled output, system error messages
  • Fix: critical/ordered section around I/O
Problem: inconsistent declarations of common vars
  • Symptom: segment violation
  • Fix: verify consistent declaration
Problem: parallel stack size problems
  • Symptom: core dump
  • Fix: increase stack size

HTML version of Basic Foils prepared August 29 98

Foil 50 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

HTML version of Basic Foils prepared August 29 98

Foil 51 Designing Parallel Programs in OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Partition
  • Divide problem into tasks
Communicate
  • Determine amount and pattern of communication
Agglomerate
  • Combine tasks
Map
  • Assign agglomerated tasks to physical processors

HTML version of Basic Foils prepared August 29 98

Foil 52 Designing Parallel Programs in OpenMP (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Partition
  • In OpenMP, look for any independent operations (loop parallel, task parallel)
Communicate
  • In OpenMP, look for synch points and dependences
Agglomerate
  • In OpenMP, create parallel loops an/or parallel sections
Map
  • In OpenMP, implicit or explicit scheduling
  • Data mapping goes outside the standard

HTML version of Basic Foils prepared August 29 98

Foil 53 Jacobi Iteration: The Problem

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Numerically solve a PDE on a square mesh
Method:
  • Update each mesh point by the average of its neighbors
  • Repeat until converged

HTML version of Basic Foils prepared August 29 98

Foil 54 Jacobi Iteration: OpenMP Partitioning, Communication, and Agglomeration

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Partitioning does not change at all
  • Data parallelism natural for this problem
Communication does not change at all
  • Related directly to task partitioning

HTML version of Basic Foils prepared August 29 98

Foil 55 Partitioning, Communication, and Agglomeration (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Agglomeration analysis changes a little
  • OpenMP cannot nest control constructs easily
    • Requires intervening parallel section, with OMP_NESTED turned on
  • Major issue on shared memory machines is locality in memory layout
    • Nearest neighbors agglomerated together as blocks
    • Therefore, encourage each processor to keep using the same contiguous section(s) of memory

HTML version of Basic Foils prepared August 29 98

Foil 56 Jacobi Iteration: OpenMP Mapping

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Minimize forking and synchronization overhead
  • One parallel region at highest possible level
  • Mark outermost possible loop for work sharing
Keep each processor working on the same data
  • Consistent schedule for DO loops
  • Trust underlying system not to migrate threads for no reason
Lay out data to be contiguous
  • Column-major ordering in Fortran
  • Therefore, make dimension of outermost work-shared loop the column

HTML version of Basic Foils prepared August 29 98

Foil 57 Jacobi Iteration: OpenMP Program

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
(to be continued)

HTML version of Basic Foils prepared August 29 98

Foil 58 Jacobi Iteration/Program (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 59 Irregular Mesh: The Problem

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
The Problem
  • Given an irregular mesh of values
  • Update each value using its neighbors in the mesh
The Approach
  • Store the mesh as a list of edges
  • Process all edges in parallel
    • Compute contribution of edge
    • Add to one endpoint, subtract from the other

HTML version of Basic Foils prepared August 29 98

Foil 60 Irregular Mesh: Sequential Program

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
REAL x(nnode), y(nnode), flux
INTEGER iedge(nedge,2)
err = tol * 1e6
DO WHILE (err > tol)
  • DO i = 1, nedge
    • flux = (y(iedge(i,1))-y(iedge(i,2))) / 2
    • x(iedge(i,1)) = x(iedge(i,1)) - flux
    • x(iedge(i,2)) = x(iedge(i,2)) + flux
    • err = err + flux(i)*flux(i)
  • END DO
  • err = err / nedge
  • DO j = 1, nnode
    • y(i) = x(i)
  • END DO
END DO

HTML version of Basic Foils prepared August 29 98

Foil 61 Irregular Mesh: OpenMP Partitioning

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Flux computations are data-parallel
  • flux = (x(iedge(i,1))-x(iedge(i,2)))/2
  • Independent because edge_val ? node_val
Node updates are nearly data-parallel
  • x(iedge(i,1)) = x(iedge(i,1)) - flux(i);
  • Not truly independent because sometimes iedge(iY,1) = iedge(iX,2)
  • But ATOMIC supports updates using associative operators
Error check is a reduction
  • err = err + flux(i)*flux(i)
  • REDUCTION class for variables

HTML version of Basic Foils prepared August 29 98

Foil 62 Irregular Mesh: OpenMP Communication

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Communication needed for all parts
  • Between edges and nodes to compute flux
  • Edge-node and node-node to compute x
  • Reduction to compute err
  • Edge and node communication is static, local with respect to grid
    • But unstructured with respect to array indices
  • Reduction communication is static, global

HTML version of Basic Foils prepared August 29 98

Foil 63 Irregular Mesh: OpenMP Agglomeration

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Because of the tight ties between flux, x, and err, it is best to keep the loop intact
  • Incremental parallelization via OpenMP works perfectly
No differences between computation in different iterations
  • Any agglomeration scheme is likely to work well for load balance
  • Don't specify SCHEDULE
    • Make the system earn its keep

HTML version of Basic Foils prepared August 29 98

Foil 64 Irregular Mesh: OpenMP Mapping

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
There may be significant differences in data movement based on scheduling
The ideal:
  • Every processor runs over its own edges (static scheduling)
  • Endpoints of these edges are not shared by other processors
  • Data moves to its home on the first pass, then stays put

HTML version of Basic Foils prepared August 29 98

Foil 65 Irregular Mesh: OpenMP Mapping (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
The reality:
  • The graph is connected ? some endpoints must be shared
  • Memory systems move more than one word at a time ? false sharing
  • OpenMP does not standardize how to resolve this
    • Best bet: Once the program is working, look for good orderings of data

HTML version of Basic Foils prepared August 29 98

Foil 66 Irregular Mesh: OpenMP Program

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 67 Irregular Mesh

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Divide edge list among processors
Ideally, would like all edges referring to a given vertex to be assigned to the same processor
  • Often easier said than done

HTML version of Basic Foils prepared August 29 98

Foil 68 Irregular Mesh: Pictures

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 69 Irregular Mesh: Bad Data Order

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
covered on earlier slide; don't print

HTML version of Basic Foils prepared August 29 98

Foil 70 Irregular Mesh: Bad Data Order

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 71 Irregular Mesh: Good Data Order

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
covered on earlier slide; don't print

HTML version of Basic Foils prepared August 29 98

Foil 72 Irregular Mesh: Good Data Order

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 73 OpenMP Summary

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Based on fork-join parallelism in shared memory
  • Threads start at beginning of parallel region, come back together at end
  • Close to some hardware
  • Linked from traditional languages
Very good for sharing data and incremental parallelization
Unclear if it is feasible for distributed memory
More information at http://www.openmp.org

HTML version of Basic Foils prepared August 29 98

Foil 74 Three Systems Compared

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
HPF
  • Good abstraction: data parallelism
  • System can hide many details from programmer
    • Two-edged sword
  • Well-suited for regular problems on machines with locality
MPI
  • Lower-level abstraction: message passing
  • System works everywhere, is usually the first tool available on new systems
  • Well-suited to handling data on distributed memory machines, but requires work up front

HTML version of Basic Foils prepared August 29 98

Foil 75 Three Systems Compared (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
OpenMP
  • Good abstraction: fork-join
  • System excellent for incremental parallelization on shared memory
    • No implementations yet on distributed memory
  • Well-suited for any parallel application if locality is not an issue
Can we combine paradigms?
  • Yes, although it's still research

HTML version of Basic Foils prepared August 29 98

Foil 76 OpenMP + MPI

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Modern parallel machines are often shared memory nodes connected by message passing
Can be programmed by calling MPI from OpenMP
  • MPI implementation must be thread-safe
ASCI project is using this heavily

HTML version of Basic Foils prepared August 29 98

Foil 77 MPI + HPF

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Many applications (like the atmosphere/ocean model) consist of several data-parallel modules
Can link HPF codes on different machines using MPI
  • Requires special MPI implementation and runtime
HPFMPI project at Argonne has done proof-of-concept

HTML version of Basic Foils prepared August 29 98

Foil 78 HPF + OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
HPF can be implemented by translating it to OpenMP
  • Good idea on shared-memory machines
  • May have real advantages for optimizing locality and data layout
HPF may call OpenMP directly
  • Proposal being made at HPF Users Group meeting next week
  • Not trivial, since HPF and OpenMP may not agree on data layout
    • Things could get worse if MPI is also implemented on OpenMP

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sat Aug 29 1998