Full HTML for Basic Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language

Full HTML for

Basic foilset Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language

Given by Ian Foster, Gina Goff, Ehtesham Hayder, Chuck Koelbel at DoD Modernization Tutorial on 1995-1998. Foils prepared August 29 98
Outside Index Summary of Material

Introduction

Parallelism, Synchronization, and Environments

Restructuring/Designing Programs in OpenMP

Example Programs

Table of Contents for full HTML of Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language

Denote Foils where Image Critical

Denote Foils where Image has important information

Denote Foils where HTML is sufficient

1

Outline
2

Outline
3

OpenMP
4

OpenMP (2)
5

Shared Memory
6

Shared Memory in Pictures
7

OpenMP
8

OpenMP in Pictures
9

Design of OpenMP
10

Design of OpenMP (2)
11

Outline
12

Control Structures
13

Control Structures (2)
14

DO Scheduling
15

DO Scheduling (2)
16

Orphaned Directives
17

OpenMP Synchronization
18

OpenMP Synchronization (2)
19

OpenMP Data Environments
20

OpenMP Data Environments
21

OpenMP Environment & Runtime Library
22

OpenMP Environment & Runtime (2)
23

OpenMP Environment & Runtime (3)
24

Outline
25

Analyzing for Parallelism
26

Program Profile
27

Walking the Key Loop Nest
28

Multiple Parallel Loops
29

Example --Loop Nest
30

Restructuring Applications
31

Two Levels of Parallel Processing
32

Two Levels of Parallel Processing (cont.)
33

Determining Shared and Private
34

Types of Variables
35

Guidelines for Classifying Variables
36

Process of Classifying Variables
37

Process of Classifying Variables (2)
38

Process of Classifying Vars (3)
39

Firstprivate and Lastprivate
40

Firstprivate and Lastprivate (2)
41

Choosing & Placing Synchronization
42

What to Synchronize
43

Example -- Critical/Ordered Section
44

Reductions
45

(Flawed) Plan For a Good Reduction
46

Good Reductions
47

Typical Parallel Bugs
48

Typical Parallel Bugs (2)
49

Typical Parallel Bugs (3)
50

Outline
51

Designing Parallel Programs in OpenMP
52

Designing Parallel Programs in OpenMP (2)
53

Jacobi Iteration: The Problem
54

Jacobi Iteration: OpenMP Partitioning, Communication, and Agglomeration
55

Partitioning, Communication, and Agglomeration (2)
56

Jacobi Iteration: OpenMP Mapping
57

Jacobi Iteration: OpenMP Program
58

Jacobi Iteration/Program (2)
59

Irregular Mesh: The Problem
60

Irregular Mesh: Sequential Program
61

Irregular Mesh: OpenMP Partitioning
62

Irregular Mesh: OpenMP Communication
63

Irregular Mesh: OpenMP Agglomeration
64

Irregular Mesh: OpenMP Mapping
65

Irregular Mesh: OpenMP Mapping (2)
66

Irregular Mesh: OpenMP Program
67

Irregular Mesh
68

Irregular Mesh: Pictures
69

Irregular Mesh: Bad Data Order
70

Irregular Mesh: Bad Data Order
71

Irregular Mesh: Good Data Order
72

Irregular Mesh: Good Data Order
73

OpenMP Summary
74

Three Systems Compared
75

Three Systems Compared (2)
76

OpenMP + MPI
77

MPI + HPF
78

HPF + OpenMP

Outside Index Summary of Material

HTML version of Basic Foils prepared August 29 98

Foil 1 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Day 1

Introduction to Parallel Programming
The OpenMP Programming Language
- Introduction
- Parallelism, Synchronization, and Environments
- Restructuring/Designing Programs in OpenMP
- Example Programs

Day 2

Introduction to MPI
The PETSc Library

HTML version of Basic Foils prepared August 29 98

Foil 2 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Introduction

Parallelism, Synchronization, and Environments

Restructuring/Designing Programs in OpenMP

Example Programs

HTML version of Basic Foils prepared August 29 98

Foil 3 OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

A portable fork-join parallel model for shared-memory architectures

Portable

Based on Parallel Computing Forum (PCF)
Fortran 77 binding here today; C coming this year

HTML version of Basic Foils prepared August 29 98

Foil 4 OpenMP (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Fork-join model

Execution starts with one thread of control
Parallel regions fork off new threads on entry
Threads join back together at the end of the region

Shared memory

(Some) Memory can be accessed by all threads

HTML version of Basic Foils prepared August 29 98

Foil 5 Shared Memory

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Computation(s) using several processors

Each processor has some private memory
Each processor has access to a memory shared with the other processors

Synchronization

Used to protect integrity of parallel program
Prevents unsafe memory accesses
Fine-grained synchronization (point to point)
Barrier: use for global synchronization

HTML version of Basic Foils prepared August 29 98

Foil 6 Shared Memory in Pictures

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 7 OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Two basic flavors of parallelism:

Coarse-grained
- Program is broken into segments (threads) that can be executed in parallel
- Use barriers to re-synchronize execution at the end
Fine-grained
- Execute iterations of DO loop(s) in parallel

HTML version of Basic Foils prepared August 29 98

Foil 8 OpenMP in Pictures

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 9 Design of OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

"A flexible standard, easily implemented across different platforms"

Control structures

Minimal for simplicity and encouraging common cases
PARALLEL, DO, SECTIONS, SINGLE, MASTER

Data environment

New data access capabilities for forked threads
SHARED, PRIVATE, REDUCTION

HTML version of Basic Foils prepared August 29 98

Foil 10 Design of OpenMP (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Synchronization

Simple implicit synch at beginning and end of control structures
Explicit synch for more complex patterns: BARRIER, CRITICAL, ATOMIC, FLUSH, ORDERED

Runtime library

Manages modes for forking and scheduling threads
E.g, OMP_GET_THREAD_NUM

HTML version of Basic Foils prepared August 29 98

Foil 11 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Introduction

Parallelism, Synchronization, and Environments

Restructuring/Designing Programs in OpenMP

Example Programs

HTML version of Basic Foils prepared August 29 98

Foil 12 Control Structures

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

PARALLEL, END PARALLEL

Marks beginning and end of parallel segment

SECTIONS, END SECTIONS

Segment contains several sections

SECTION

Marks beginning of one section among many

SINGLE, END SINGLE

Only one section in segment

DO, END DO

Marks beginning and end of parallel DO

HTML version of Basic Foils prepared August 29 98

Foil 13 Control Structures (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 14 DO Scheduling

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Static Scheduling (default)

Divides loop into equal size iteration chunks
Based on runtime loop limits
Totally parallel scheduling algorithm

Dynamic Scheduling

Threads go to scheduler to get next chunk
Guided: chunks taper down at end of loop

HTML version of Basic Foils prepared August 29 98

Foil 15 DO Scheduling (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
!$OMP PARALLEL DO &
!$OMP SCHEDULE(DYNAMIC,1)
DO J = 1, 36
CALL SUBR(J)
END DO
!$OMP END DO
!$OMP PARALLEL DO &
!$OMP SCHEDULE(GUIDED,1)
DO J = 1, 36
CALL SUBR(J)
END DO
!$OMP END DO

HTML version of Basic Foils prepared August 29 98

Foil 16 Orphaned Directives

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

PROGRAM main

!$OMP PARALLEL

CALL foo()

CALL bar()

CALL error()

!$OMP END PARALLEL

SUBROUTINE error()

! Not allowed due to

! nested control structs

!$OMP SECTIONS

!$OMP SECTION

CALL foo()

!$OMP SECTION

CALL bar()

!$OMP END SECTIONS

END

SUBROUTINE foo()

!$OMP DO

DO i = 1, n

...

END DO

!$OMP END DO

END

SUBROUTINE bar()

!$OMP SECTIONS

!$OMP SECTION

CALL section1()

!$OMP SECTION

...

!$OMP SECTION

...

!$OMP END SECTIONS

END

HTML version of Basic Foils prepared August 29 98

Foil 17 OpenMP Synchronization

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Implicit barriers wait for all threads

DO, END DO
SECTIONS, END SECTIONS
SINGLE, END SINGLE
MASTER, END MASTER
NOWAIT at END can override synch
Global barriers ? all threads must hit in the
same order

HTML version of Basic Foils prepared August 29 98

Foil 18 OpenMP Synchronization (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Explicit directives provide finer control

BARRIER -- must be hit by all threads in team
CRITICAL (name), END CRITICAL --
Only one thread may enter at a time
ATOMIC -- Single-statement critical section
for reduction
FLUSH (list) -- "Synchronization point at
which the implementation is required to
provide a consistent view of memory"
ORDERED -- For pipelining loop iterations

HTML version of Basic Foils prepared August 29 98

Foil 19 OpenMP Data Environments

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Data can be PRIVATE or SHARED

Private data is for local variables

Shared data is global

Data can be private to a thread -- all processors in thread can access the data, but other threads can't see it

HTML version of Basic Foils prepared August 29 98

Foil 20 OpenMP Data Environments

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

COMMON /mine/ z

INTEGER x(3), y(3), z

!$OMP THREADPRIVATE(mine)

!$OMP PARALLEL DO DEFAULT(PRIVATE), SHARED(x)

DO k = 1, 3

x(k) = k

y(k) = k*k

z = z + x(i)*y(i)

END DO

!$OMP END PARALLEL DO

SHARED MEMORY

Thread 0

Thread 1

Thread 2

HTML version of Basic Foils prepared August 29 98

Foil 21 OpenMP Environment & Runtime Library

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

For controlling execution

Needed for tuning, but may limit portability
Control through environment variables or runtime library calls
- Runtime library takes precedence in conflict

HTML version of Basic Foils prepared August 29 98

Foil 22 OpenMP Environment & Runtime (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

OMP_NUM_THREADS: How many to use in parallel region?

OMP_GET_NUM_THREADS,
OMP_SET_NUM_THREADS
Related: OMP_GET_THREAD_NUM,
OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS

OMP_DYNAMIC: Should runtime system choose number of threads?

OMP_GET_DYNAMIC, OMP_SET_DYNAMIC

HTML version of Basic Foils prepared August 29 98

Foil 23 OpenMP Environment & Runtime (3)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

OMP_NESTED: Should nested parallel regions be supported?

OMP_GET_NESTED, OMP_SET_NESTED

OMP_SCHEDULE: Choose DO scheduling option

Used by RUNTIME clause

OMP_IN_PARALLEL: Is the program in a parallel region?

HTML version of Basic Foils prepared August 29 98

Foil 24 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Introduction

Parallelism, Synchronization, and Environments

Restructuring/Designing Programs in OpenMP

Example Programs

HTML version of Basic Foils prepared August 29 98

Foil 25 Analyzing for Parallelism

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Profiling

Walk the loop nests

Multiple parallel loops

HTML version of Basic Foils prepared August 29 98

Foil 26 Program Profile

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Is dataset large enough?

At the top of the list, should find

parallel regions
routines called within them

What is cumulative percent?

Watch for system libraries near top

e.g., spin_wait_join_barrier

HTML version of Basic Foils prepared August 29 98

Foil 27 Walking the Key Loop Nest

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Usually the outermost parallel loop

Ignore timestep and convergence loops
Ignore loops with few iterations
Ignore loops that call unimportant subroutines

Don't be put off by --

Loops that write shared data
Loops that step through linked lists
Loops with I/O

HTML version of Basic Foils prepared August 29 98

Foil 28 Multiple Parallel Loops

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Nested parallel loops are good

Pick easiest or most parallel code
Think about locality
Use IF clause to select best based on dataset
Plan on doing one across clusters

Non nested parallel loops

Consider loop fusion (impacts locality)
Execute code between in parallel region

HTML version of Basic Foils prepared August 29 98

Foil 29 Example --Loop Nest

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

subroutine fem3d(...)

10 call addmon(...)

if(numelh.ne0) call solide

subroutine solide

do 20 i=1,nelt

do 20 j=1,nelg

call unpki

call strain

call force

20 continue

if(...) return

goto 10

subroutine force(...)

do 10 i=lft,llt

sgv(i) = sig1(i)-qp(i)*vol(i)

10 continue

do 50 n=1,nnc

i0=ia(n)

i1=ia(n+1)-1

do 50 i=i0,i1

e(1,ix(i))=e(1,ix(i))+ep11(i)

50 continue

HTML version of Basic Foils prepared August 29 98

Foil 30 Restructuring Applications

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Two level strategy for parallel processing

Determining shared and local variables

Adding synchronization

HTML version of Basic Foils prepared August 29 98

Foil 31 Two Levels of Parallel Processing

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Two level approach isolates major concerns -- makes code easier to update

Algorithm/Architecture Level

Unique to your software
Provides majority of SMP performance

HTML version of Basic Foils prepared August 29 98

Foil 32 Two Levels of Parallel Processing (cont.)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Platform Specific Level

Vendor provides insight
Remove last performance obstacles
Be careful to limit use of non-portable constructs

HTML version of Basic Foils prepared August 29 98

Foil 33 Determining Shared and Private

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

What are the variable classes?

Process for determining class

First private/last private

HTML version of Basic Foils prepared August 29 98

Foil 34 Types of Variables

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Start with access patterns

Read Only -- disposition elsewhere
Write then Read -- possibly local
Read then Write -- independent or reductions
Written -- live on exit?

Goal: determine storage classes

Local or private variables are local per thread
Shared variables are everything else

HTML version of Basic Foils prepared August 29 98

Foil 35 Guidelines for Classifying Variables

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

In general, big things are shared

The major arrays that take all the space
It's the thread's default model

Program local vars are parallel private vars

Temporaries used require one copy per thread
Subroutine locals become private automatically

Move up from leaf subroutines to parallel region

Equivalences: ick

HTML version of Basic Foils prepared August 29 98

Foil 36 Process of Classifying Variables

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Examine refs to each var to determine shared list

Split common into shared common and private common if vars require different storage classes
Use copy-in to private common as alternative

Construct private list and declare private commons by examining the types of remaining variables

HTML version of Basic Foils prepared August 29 98

Foil 37 Process of Classifying Variables (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Examine

Refs

Only Read

in P Region

Put on

Shared list

Modified

in P Region

Contains parallel

loop index

(Diff iterations

reference diff parts)

Put on

Shared list

Does not contain

parallel loop index

Go to

next page

HTML version of Basic Foils prepared August 29 98

Foil 38 Process of Classifying Vars (3)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 39 Firstprivate and Lastprivate

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

LASTPRIVATE copies value(s) from local copy assigned on last iteration of loop to global copy of variables or arrays

FIRSTPRIVATE copies value(s) from global variables or arrays to local copy for first iteration of loop on each processor

HTML version of Basic Foils prepared August 29 98

Foil 40 Firstprivate and Lastprivate (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Parallelizing a loop and not knowing whether there are side effects?

subroutine foo(n)

common /foobar/a(1000),b(1000),x

c$omp parallel do shared(a,b,n) lastprivate(x)

do 10 i=1,n

x=a(i)**2 + b(i)**2

10 b(i)= sqrt(x)

end

Use lastprivate because don't

know where or if x in common

/foobar/ will be used again

HTML version of Basic Foils prepared August 29 98

Foil 41 Choosing & Placing Synchronization

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Finding variables that need to be synchronized

Two frequently used types

Critical/ordered sections: small updates to shared structures
Barriers: delimit phases of computation

Doing reductions

HTML version of Basic Foils prepared August 29 98

Foil 42 What to Synchronize

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Updates: parallel do invariant variables that are read then written

Place critical/ordered section around groups of updates

Pay attention to control flow

Make sure you don't branch in or out
Pull it outside loop or region if efficient

HTML version of Basic Foils prepared August 29 98

Foil 43 Example -- Critical/Ordered Section

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

if (ncycle.eq.0) then

do 60 i=lft,llt

dt2=amin1(dtx(i),dt2)

if (dt2.eq.dtx(i)) then

ielmtc=128*(ndum-1)+i

ielmtc=nhex(ielmtc)

ityptc=1

endif

ielmtd=128*(ndum-1)+i

ielmtd=nhex(ielmtd)

write (13,90) ielmtd,dtx(i)

write (13,100)ielmtc

60 continue

endif

do 70 i=lft,llt

70 dt2=amin1(dtx(i),dt2)

if (mess.ne.'sw2.') return

do 80 i=lft,llt

if (dt2.eq.dtx(i)) then

ielmtc=128*(ndum-1)+i

ielmtc=nhex(ielmtc)

ityptc=1

endif

80 continue

HTML version of Basic Foils prepared August 29 98

Foil 44 Reductions

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Serial program is a reduction:

sum = 0.0

do 10 i=1,n

10 sum = sum + a(i)

Correct (but slow) program:

sum = 0.0

c$omp parallel private(i) shared(sum,a,n)

c$omp pdo

do 10 i=1,n

c$omp critical

sum = sum + a(i)

c$omp end critical

10 continue

c$omp end parallel

HTML version of Basic Foils prepared August 29 98

Foil 45 (Flawed) Plan For a Good Reduction

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Incorrect parallel program:

c$omp parallel private(suml,i)

c$omp& shared(sum,a,n)

suml = 0.0

c$omp do

do 10 i=1,n

10 suml = suml + a(i)

cbug -- need critical section next

sum = sum + suml

c$omp end parallel

HTML version of Basic Foils prepared August 29 98

Foil 46 Good Reductions

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Correct reduction:

c$omp parallel private(suml,i)

c$omp& shared(sum,a,n)

suml = 0.0

c$omp do

do 10 i=1,n

10 suml = suml + a(i)

c$omp critical

sum = sum + suml

c$omp end critical

c$omp end parallel

Using Reduction does the same:

c$omp parallel private(i)

c$omp& shared(a,n)

c$omp& reduction(+:sum)

c$omp do

do 10 i=1,n

10 sum = sum + a(i)

c$omp end parallel

HTML version of Basic Foils prepared August 29 98

Foil 47 Typical Parallel Bugs

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Problem: incorrectly pointing to the same place

Symptom: bad answers
Fix: initialization of local pointers

Problem: incorrectly pointing to different places

Symptom: segmentation violation
Fix: localization of shared data

Problem: incorrect initialization of parallel regions

Symptom: bad answers
Fix: Copy in? / Use parallel region outside parallel do.

HTML version of Basic Foils prepared August 29 98

Foil 48 Typical Parallel Bugs (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Problem: not saving values from parallel regions

Symptom: bad answers, core dump
Fix: transfer from local into shared

Problem: unsynchronized access

Symptom: bad answers
Fix: critical section / barrier / local accumulation

Problem: numerical inconsistency

Symptom: run-to-run variation in answers
Fix: different scheduling mechanisms / ordered sections / consistent parallel reductions

HTML version of Basic Foils prepared August 29 98

Foil 49 Typical Parallel Bugs (3)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Problem: inconsistently synchronized I/O stmts

Symptom: jumbled output, system error messages
Fix: critical/ordered section around I/O

Problem: inconsistent declarations of common vars

Symptom: segment violation
Fix: verify consistent declaration

Problem: parallel stack size problems

Symptom: core dump
Fix: increase stack size

HTML version of Basic Foils prepared August 29 98

Foil 50 Outline

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Introduction

Parallelism, Synchronization, and Environments

Restructuring/Designing Programs in OpenMP

Example Programs

HTML version of Basic Foils prepared August 29 98

Foil 51 Designing Parallel Programs in OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Partition

Divide problem into tasks

Communicate

Determine amount and pattern of communication

Agglomerate

Combine tasks

Map

Assign agglomerated tasks to physical processors

HTML version of Basic Foils prepared August 29 98

Foil 52 Designing Parallel Programs in OpenMP (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Partition

In OpenMP, look for any independent operations (loop parallel, task parallel)

Communicate

In OpenMP, look for synch points and dependences

Agglomerate

In OpenMP, create parallel loops an/or parallel sections

Map

In OpenMP, implicit or explicit scheduling
Data mapping goes outside the standard

HTML version of Basic Foils prepared August 29 98

Foil 53 Jacobi Iteration: The Problem

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Numerically solve a PDE on a square mesh

Method:

Update each mesh point by the average of its neighbors
Repeat until converged

HTML version of Basic Foils prepared August 29 98

Foil 54 Jacobi Iteration: OpenMP Partitioning, Communication, and Agglomeration

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Partitioning does not change at all

Data parallelism natural for this problem

Communication does not change at all

Related directly to task partitioning

HTML version of Basic Foils prepared August 29 98

Foil 55 Partitioning, Communication, and Agglomeration (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Agglomeration analysis changes a little

OpenMP cannot nest control constructs easily
- Requires intervening parallel section, with OMP_NESTED turned on
Major issue on shared memory machines is locality in memory layout
- Nearest neighbors agglomerated together as blocks
- Therefore, encourage each processor to keep using the same contiguous section(s) of memory

HTML version of Basic Foils prepared August 29 98

Foil 56 Jacobi Iteration: OpenMP Mapping

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Minimize forking and synchronization overhead

One parallel region at highest possible level
Mark outermost possible loop for work sharing

Keep each processor working on the same data

Consistent schedule for DO loops
Trust underlying system not to migrate threads for no reason

Lay out data to be contiguous

Column-major ordering in Fortran
Therefore, make dimension of outermost work-shared loop the column

HTML version of Basic Foils prepared August 29 98

Foil 57 Jacobi Iteration: OpenMP Program

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

(to be continued)

HTML version of Basic Foils prepared August 29 98

Foil 58 Jacobi Iteration/Program (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 59 Irregular Mesh: The Problem

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

The Problem

Given an irregular mesh of values
Update each value using its neighbors in the mesh

The Approach

Store the mesh as a list of edges
Process all edges in parallel
- Compute contribution of edge
- Add to one endpoint, subtract from the other

HTML version of Basic Foils prepared August 29 98

Foil 60 Irregular Mesh: Sequential Program

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

REAL x(nnode), y(nnode), flux

INTEGER iedge(nedge,2)

err = tol * 1e6

DO WHILE (err > tol)

DO i = 1, nedge
- flux = (y(iedge(i,1))-y(iedge(i,2))) / 2
- x(iedge(i,1)) = x(iedge(i,1)) - flux
- x(iedge(i,2)) = x(iedge(i,2)) + flux
- err = err + flux(i)*flux(i)
END DO
err = err / nedge
DO j = 1, nnode
- y(i) = x(i)
END DO

END DO

HTML version of Basic Foils prepared August 29 98

Foil 61 Irregular Mesh: OpenMP Partitioning

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Flux computations are data-parallel

flux = (x(iedge(i,1))-x(iedge(i,2)))/2
Independent because edge_val ? node_val

Node updates are nearly data-parallel

x(iedge(i,1)) = x(iedge(i,1)) - flux(i);
Not truly independent because sometimes iedge(iY,1) = iedge(iX,2)
But ATOMIC supports updates using associative operators

Error check is a reduction

err = err + flux(i)*flux(i)
REDUCTION class for variables

HTML version of Basic Foils prepared August 29 98

Foil 62 Irregular Mesh: OpenMP Communication

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Communication needed for all parts

Between edges and nodes to compute flux
Edge-node and node-node to compute x
Reduction to compute err
Edge and node communication is static, local with respect to grid
- But unstructured with respect to array indices
Reduction communication is static, global

HTML version of Basic Foils prepared August 29 98

Foil 63 Irregular Mesh: OpenMP Agglomeration

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Because of the tight ties between flux, x, and err, it is best to keep the loop intact

Incremental parallelization via OpenMP works perfectly

No differences between computation in different iterations

Any agglomeration scheme is likely to work well for load balance
Don't specify SCHEDULE
- Make the system earn its keep

HTML version of Basic Foils prepared August 29 98

Foil 64 Irregular Mesh: OpenMP Mapping

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

There may be significant differences in data movement based on scheduling

The ideal:

Every processor runs over its own edges (static scheduling)
Endpoints of these edges are not shared by other processors
Data moves to its home on the first pass, then stays put

HTML version of Basic Foils prepared August 29 98

Foil 65 Irregular Mesh: OpenMP Mapping (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

The reality:

The graph is connected ? some endpoints must be shared
Memory systems move more than one word at a time ? false sharing
OpenMP does not standardize how to resolve this
- Best bet: Once the program is working, look for good orderings of data

HTML version of Basic Foils prepared August 29 98

Foil 66 Irregular Mesh: OpenMP Program

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 67 Irregular Mesh

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Divide edge list among processors

Ideally, would like all edges referring to a given vertex to be assigned to the same processor

Often easier said than done

HTML version of Basic Foils prepared August 29 98

Foil 68 Irregular Mesh: Pictures

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 69 Irregular Mesh: Bad Data Order

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

covered on earlier slide; don't print

HTML version of Basic Foils prepared August 29 98

Foil 70 Irregular Mesh: Bad Data Order

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 71 Irregular Mesh: Good Data Order

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

covered on earlier slide; don't print

HTML version of Basic Foils prepared August 29 98

Foil 72 Irregular Mesh: Good Data Order

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 73 OpenMP Summary

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Based on fork-join parallelism in shared memory

Threads start at beginning of parallel region, come back together at end
Close to some hardware
Linked from traditional languages

Very good for sharing data and incremental parallelization

Unclear if it is feasible for distributed memory

More information at http://www.openmp.org

HTML version of Basic Foils prepared August 29 98

Foil 74 Three Systems Compared

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HPF

Good abstraction: data parallelism
System can hide many details from programmer
- Two-edged sword
Well-suited for regular problems on machines with locality

MPI

Lower-level abstraction: message passing
System works everywhere, is usually the first tool available on new systems
Well-suited to handling data on distributed memory machines, but requires work up front

HTML version of Basic Foils prepared August 29 98

Foil 75 Three Systems Compared (2)

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

OpenMP

Good abstraction: fork-join
System excellent for incremental parallelization on shared memory
- No implementations yet on distributed memory
Well-suited for any parallel application if locality is not an issue

Can we combine paradigms?

Yes, although it's still research

HTML version of Basic Foils prepared August 29 98

Foil 76 OpenMP + MPI

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Modern parallel machines are often shared memory nodes connected by message passing

Can be programmed by calling MPI from OpenMP

MPI implementation must be thread-safe

ASCI project is using this heavily

HTML version of Basic Foils prepared August 29 98

Foil 77 MPI + HPF

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Many applications (like the atmosphere/ocean model) consist of several data-parallel modules

Can link HPF codes on different machines using MPI

Requires special MPI implementation and runtime

HPFMPI project at Argonne has done proof-of-concept

HTML version of Basic Foils prepared August 29 98

Foil 78 HPF + OpenMP

From Designing and Building Parallel Programs 2: openMP Shared Memory Programming Language DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HPF can be implemented by translating it to OpenMP

Good idea on shared-memory machines
May have real advantages for optimizing locality and data layout

HPF may call OpenMP directly

Proposal being made at HPF Users Group meeting next week
Not trivial, since HPF and OpenMP may not agree on data layout
- Things could get worse if MPI is also implemented on OpenMP

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sat Aug 29 1998