Full HTML for Basic Designing and Building Parallel Programs I: Introduction

Full HTML for

Basic foilset Designing and Building Parallel Programs I: Introduction

Given by Ian Foster, Gina Goff, Ehtesham Hayder, Chuck Koelbel at DoD Modernization Tutorial on 1995-1998. Foils prepared August 29 98
Outside Index Summary of Material

This is First in Four Part Tutorial built loosely around Ian Foster's Book

Designing and Building Parallel Programs
At Argonne and Other Sites"

This Introduction covers

Parallel Computers and Algorithms
Understanding Performance
Parallel Programming Tools

Later talks cover

2:OpenMP Shared Memory Programming
3:MPI Message Passing
4:The PETSc Library

Table of Contents for full HTML of Designing and Building Parallel Programs I: Introduction

Denote Foils where Image Critical

Denote Foils where Image has important information

Denote Foils where HTML is sufficient

1

Designing and Building Parallel Programs
2

Outline
3

Outline
4

Why Parallel Computing?
5

The Multicomputer: an Idealized Parallel Computer
6

Multicomputer Architecture
7

Multicomputer Cost Model
8

How do Real Parallel Computers Fit the Model?
9

Distributed Memory MIMD Multiprocessor
10

Shared Memory MIMD Multiprocessor
11

Distributed Shared Memory (DSM)
12

Workstation Clusters
13

A Simple Parallel Programming Model
14

Properties
15

Parallel Algorithm Design
16

A Design Methodology
17

Partitioning
18

Communication
19

Agglomeration
20

Mapping
21

Example: Atmosphere Model
22

Atmosphere Model: Numerical Methods
23

Atmosphere Model: Partition
24

Atmosphere Model: Communication
25

Atmosphere Model: Agglomeration
26

Atmosphere Model: Mapping
27

Modeling Performance
28

Bandwidth and Latency
29

Measured Costs
30

Typical Communication Costs
31

Example: Finite Difference
32

Time for Finite Difference
33

Using Performance Models
34

Design Alternatives: Finite Difference
35

Design Alternatives (2)
36

Finding Model Discrepancies
37

Impact of Network Topology
38

Competition for Bandwidth
39

Bandwidth-Constrained Model Versus. Observations
40

Tool Survey
41

High Performance Fortran (HPF)
42

HPF Example
43

HPF Analysis
44

Message Passing Interface (MPI)
45

MPI Example
46

MPI Analysis
47

PCF
48

PCF and OpenMP
49

PCF Example (SGI Variant)
50

OpenMP Example
51

PCF/OpenMP Analysis
52

Portable, Extensible Toolkit for Scientific Computations (PETSc)
53

PETSc Example
54

PETSc Analysis

Outside Index Summary of Material

HTML version of Basic Foils prepared August 29 98

Foil 1 Designing and Building Parallel Programs

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Ian Foster Gina Goff

Ehtesham Hayder Charles Koelbel

A Tutorial Presented by the Department of Defense HPC Modernization Programming Environment & Training Program

HTML version of Basic Foils prepared August 29 98

Foil 2 Outline

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Day 1

Introduction to Parallel Programming
The OpenMP Programming Language

Day 2

Introduction to MPI
The PETSc Library

Individual consulting in afternoons

HTML version of Basic Foils prepared August 29 98

Foil 3 Outline

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Day 1

Introduction to Parallel Programming
- Parallel Computers & Algorithms
- Understanding Performance
- Parallel Programming Tools
The OpenMP Programming Language

Day 2

Introduction to MPI
The PETSc Library

HTML version of Basic Foils prepared August 29 98

Foil 4 Why Parallel Computing?

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Continuing demands for higher performance

Physical limits on single processor performance

High costs of internal concurrency

Result is rise of multiprocessor architectures

And number of processors will continue to increase

Networking is another contributing factor

Future software must be concurrent & scalable

HTML version of Basic Foils prepared August 29 98

Foil 5 The Multicomputer: an Idealized Parallel Computer

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 6 Multicomputer Architecture

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Multicomputer = nodes + network

Node = processor(s) + local memory

Access to local memory is cheap

10s of cycles; does not involve network
Conventional memory reference

Access to remote memory is expensive

100s or 1000s of cycles; uses network
Use I/O-like mechanisms (e.g., send/receive)

HTML version of Basic Foils prepared August 29 98

Foil 7 Multicomputer Cost Model

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Cost of remote memory access/communication (including synchronization)

T = ts + N tw
ts = per-message cost ("latency")
tw = per-word cost
N = message size in words

Hence locality is an important property of good parallel algorithms

HTML version of Basic Foils prepared August 29 98

Foil 8 How do Real Parallel Computers Fit the Model?

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Major architecture types

Distributed memory MIMD
Shared memory MIMD
Distributed shared memory (DSM)
Workstation clusters and "metacomputers"

Model fits current architectures pretty well

HTML version of Basic Foils prepared August 29 98

Foil 9 Distributed Memory MIMD Multiprocessor

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Multiple Instruction/Multiple Data

Processors with local memory connected by high-speed interconnection network

Typically high bandwidth, medium latency

Hardware support for remote memory access

Model breaks down when topology matters

Examples: Cray T3E, IBM SP

HTML version of Basic Foils prepared August 29 98

Foil 10 Shared Memory MIMD Multiprocessor

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Processors access shared memory via bus

Low latency, high bandwidth

Bus contention limits scalability

Search for scalability introduces locality

Cache (a form of local memory)
Multistage architectures (some memory closer)

Examples: Cray T90, SGI PCA, Sun

HTML version of Basic Foils prepared August 29 98

Foil 11 Distributed Shared Memory (DSM)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

A hybrid of distributed and shared memory

Small groups of processors share memory; others access across a scalable network

Low to moderate latency, high bandwidth

Model simplifies the multilevel hierarchy

Examples: SGI Origin, HP Exemplar

HTML version of Basic Foils prepared August 29 98

Foil 12 Workstation Clusters

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Workstations connected by network

Cost effective

High latency, low to moderate bandwidth

Often lack integrated software environment

Model breaks down if connectivity limited

Examples: Ethernet, ATM crossbar, Myrinet

HTML version of Basic Foils prepared August 29 98

Foil 13 A Simple Parallel Programming Model

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

A parallel computation is a set of tasks

Each task has local data, can be connected to other tasks by channels

A task can

Compute using local data
Send to/receive from other tasks
Create new tasks, or terminate itself

A receiving task blocks until data available

HTML version of Basic Foils prepared August 29 98

Foil 14 Properties

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Concurrency is enhanced by creating multiple tasks

Scalability: More tasks than nodes

Locality: Access local data when possible

A task (with local data and subtasks) is a unit for modular design

Mapping to nodes affects performance only

HTML version of Basic Foils prepared August 29 98

Foil 15 Parallel Algorithm Design

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Goal: "Develop an efficient (parallel) solution to a programming problem"

Identify sensible parallel algorithms
Evaluate performance and complexity

We present

A systematic design methodology
Some basic design techniques
Illustrative examples

HTML version of Basic Foils prepared August 29 98

Foil 16 A Design Methodology

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Partition

Define tasks

Communication

Identify requirements

Agglomeration

Enhance locality

Mapping

Place tasks

HTML version of Basic Foils prepared August 29 98

Foil 17 Partitioning

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Goal: identify opportunities for concurrent execution (define tasks: computation+data)

Focus on data operated on by algorithm ...

Then distribute computation appropriately
Domain decomposition

... or on the operations performed

Then distribute data appropriately
Functional decomposition

HTML version of Basic Foils prepared August 29 98

Foil 18 Communication

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Identify communication requirements

If computation in one task requires data located in another, communication is needed

Example: finite difference computation

Must communicate with each neighbor

Xi = (Xi-1 + 2*Xi + Xi+1)/4

Partition creates

one task per point

HTML version of Basic Foils prepared August 29 98

Foil 19 Agglomeration

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Once tasks + communication determined, "agglomerate" small tasks into larger tasks

Motivations

To reduce communication costs
If tasks cannot execute concurrently
To reduce software engineering costs

Caveats

May involve replicating computation or data

HTML version of Basic Foils prepared August 29 98

Foil 20 Mapping

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Place tasks on processors, to

Maximize concurrency } note potential
Minimize communication } conflict

Techniques:

Regular problems: agglomerate to P tasks
Irregular problems: use static load balancing
If irregular in time: dynamic load balancing

HTML version of Basic Foils prepared August 29 98

Foil 21 Example: Atmosphere Model

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Simulate atmospheric processes

Conservation of momentum, mass, energy
Ideal gas law, hydrostatic approximation

Represent atmosphere state by 3-D grid

Periodic in two horizontal dimensions
Nx.Ny.Nz: e.g., Ny=50-500, Nx=2Ny, Nz=15-30

Computation includes

Atmospheric dynamics: finite difference
"Physics" (radiation etc.) in vertical only

HTML version of Basic Foils prepared August 29 98

Foil 22 Atmosphere Model: Numerical Methods

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Discretize the (continuous) domain by a regular Nx??Ny ??Nz grid

Store p, u, v, T, ? at every grid point

Approximate derivatives by finite differences

Leads to stencils in vertical and horizontal

HTML version of Basic Foils prepared August 29 98

Foil 23 Atmosphere Model: Partition

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Use domain decomposition

Because model operates on large, regular grid
Can decompose in 1, 2, or 3 dimensions
3-D decomposition offers greatest flexibility

HTML version of Basic Foils prepared August 29 98

Foil 24 Atmosphere Model: Communication

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Finite difference stencil horizontally

Local, regular, structured

Radiation calculations vertically

Global, regular, structured

Diagnostic sums

Global, regular, structured

HTML version of Basic Foils prepared August 29 98

Foil 25 Atmosphere Model: Agglomeration

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

In horizontal

Clump so that 4 points per task
Efficiency: communicate with 4 neighbors only

In vertical, clump all points in column

Performance: avoid communication
Modularity: Reuse physics modules

Resulting algorithm "reasonably scalable"

(Nx.Ny)/4 : at least 1250 tasks

HTML version of Basic Foils prepared August 29 98

Foil 26 Atmosphere Model: Mapping

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Technique depends on load distribution

1) Agglomerate to one task per processor

Appropriate if little load imbalance

2) Extend (1) to incorporate cyclic mapping

Works well for diurnal cycle imbalance

3) Use dynamic, local load balancing

Works well for unpredictable, local imbalances

HTML version of Basic Foils prepared August 29 98

Foil 27 Modeling Performance

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Execution time (sums are over P nodes) is

T = (sum[Tcomp]+ sum[Tcomm] + sum[Tidle])/P

Computation time comprises both

Operations required by sequential algorithm
Additional work, replicated work

Idle time due to

Load imbalance, and/or
Latency (waiting for remote data)

HTML version of Basic Foils prepared August 29 98

Foil 28 Bandwidth and Latency

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Recall cost model

T = ts + N tw
ts = per-message cost ("latency")
tw = per-word cost (1/ tw = bandwidth)
N = message size in words

Model works well for many algorithms, and on many computers

HTML version of Basic Foils prepared August 29 98

Foil 29 Measured Costs

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 30 Typical Communication Costs

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Computer ts tw

IBM SP2 40 0.11

Intel Paragon 121 0.07

Meiko CS-2 87 0.08

Sparc/Ethernet 1500 5.0

Sparc/FDDI 1150 1.1

Times in microseconds

HTML version of Basic Foils prepared August 29 98

Foil 31 Example: Finite Difference

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Finite difference computation on N2Z grid

9-point stencil
Similar to atmosphere model earlier

Decompose along one horizontal dimension

HTML version of Basic Foils prepared August 29 98

Foil 32 Time for Finite Difference

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Identical computations at each grid point

Tcomp = tcN2Z (tc is compute time/point)

1-D decomposition, so each node sends 2NZ data to 2 neighbors [if ? 2 rows/node]

Tcomm = P(ts2 + tw4NZ)

No significant idle time if load balanced

Tidle = 0

Therefore, T = tcN2Z/P + ts2 + tw4NZ

HTML version of Basic Foils prepared August 29 98

Foil 33 Using Performance Models

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

During design

Use models to study qualitative behavior
Calibrate models by measuring tc, ts, tw, etc.
Use calibrated models to evaluate design alternatives

During implementation

Compare predictions with observations
Relate discrepancies to implementation or model
Use models to guide optimization process

HTML version of Basic Foils prepared August 29 98

Foil 34 Design Alternatives: Finite Difference

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Consider 2-D and 3-D decompositions

Are they ever a win?
If so, when?

HTML version of Basic Foils prepared August 29 98

Foil 35 Design Alternatives (2)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

2-D Decomposition - On a ?P????P processor grid, messages of size 2N/?P???Z to 4 neighbors, so

T = tcN2Z/P + ts4 + tw2NZ/?P

Good if ts < twNZ(2-1/?P)

3-D Decomposition - On a Px ? Py ? Pz grid,

T = tcN2Z/P + ts6 + tw2(N2/(PxPy) + (NZ)/(PxPz) +(NZ)/(PyPz) )

HTML version of Basic Foils prepared August 29 98

Foil 36 Finding Model Discrepancies

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

What we have here is a failure to communicate...

HTML version of Basic Foils prepared August 29 98

Foil 37 Impact of Network Topology

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Multicomputer model assumes comm cost independent of location & other comms

Real networks are not fully connected

Multicomputer model can break down

Ethernet

2-D Mesh

HTML version of Basic Foils prepared August 29 98

Foil 38 Competition for Bandwidth

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

In many cases, a bandwidth constrained model can give sufficiently accurate results

"If S processes must communicate over same wire at same time, each has 1/S bandwidth"

Example: finite difference on Ethernet

All processors share single Ethernet
Hence bandwidth term scaled by P
T = tcN2Z/P + ts2 + tw4NZP

HTML version of Basic Foils prepared August 29 98

Foil 39 Bandwidth-Constrained Model Versus. Observations

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Bandwidth-constrained model gives better fit

HTML version of Basic Foils prepared August 29 98

Foil 40 Tool Survey

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

High Performance Fortran (HPF)

Message Passing Interface (MPI)

Parallel Computing Forum (PCF) and OpenMP�

Portable, Extensible Toolkit for Scientific Computations (PETSc)

HTML version of Basic Foils prepared August 29 98

Foil 41 High Performance Fortran (HPF)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

A standard data-parallel language

CM Fortran, C*, HPC++ are related

Programmer specifies

Concurrency (concurrent operations on arrays)
Locality (data distribution)

Compiler infers

Mapping of computation ("owner-computes" rule)
Communication

HTML version of Basic Foils prepared August 29 98

Foil 42 HPF Example

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

PROGRAM hpf_finite_difference

!HPF$ PROCESSORS pr(4)

REAL x(100,100), new(100,100)

!HPF$ ALIGN new(:,:) WITH x(:,:)

!HPF$ DISTRIBUTE x(BLOCK, *) ONTO pr

new(2:99,2:99) = (x(1:98,2:99)+x(3:100,2:99)+ &

& x(2:99,1:98)+x(2:99,3:100)) / 4

diff = MAXVAL(ABS(new-x))

end

HTML version of Basic Foils prepared August 29 98

Foil 43 HPF Analysis

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Advantages

High level: preserves sequential semantics
Standard

Disadvantages

Restricted applicability
Requires sophisticated compiler technology

Good for regular, SPMD problems

HTML version of Basic Foils prepared August 29 98

Foil 44 Message Passing Interface (MPI)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

A standard message-passing library

p4, NX, p4, Express, PARMACS are precursors

An MPI program defines a set of processes

(usually one process per node)

... that communicate by calling MPI functions

(point-to-point and collective)

... and can be constructed in a modular fashion

(communicators are the key)

HTML version of Basic Foils prepared August 29 98

Foil 45 MPI Example

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

main(int argc, char *argv[]) {

MPI_Comm com = MPI_COMM_WORLD;

MPI_Init(&argc,&argv);

MPI_Comm_size(com,&np);

MPI_Send(local+1,1,MPI_FLOAT,lnbr,10,com);

MPI_Recv(local,1,MPI_FLOAT,rnbr,10,com,&status);

MPI_Send(local+lsize,1,MPI_FLOAT,rnbr,10,com);

MPI_Recv(local+lsize+1,1,MPI_FLOAT,lnbr,10,com,&status);

ldiff = maxerror(local);

MPI_Allreduce(&ldiff,&diff,1,MPI_FLOAT,MPI_MAX,com);

MPI_Finalize();

}

HTML version of Basic Foils prepared August 29 98

Foil 46 MPI Analysis

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Advantages:

Wide availability of efficient implementations
Support for modular design, code reuse

Disadvantages:

Low level ("parallel assembly code")
Less well-suited to shared-memory machines

Good for performance-critical codes with natural task modularity

HTML version of Basic Foils prepared August 29 98

Foil 47 PCF

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Standardization (circa 1993) of shared memory parallelism in Fortran

A PCF program is multithreaded, with explicit synchronization between the threads and shared variables

A PCF program is divided into regions

Serial regions: Only the master thread executes
Parallel regions: Work shared by all threads

HTML version of Basic Foils prepared August 29 98

Foil 48 PCF and OpenMP

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

PCF per se was not widely implemented

Timing: Distributed memory became popular
Complexity: Many details for special cases
Not Invented Here (NIH) syndrome

Its ideas resurfaced in OpenMP

Primary differences are the spelling and low-level controls
Also some significant simplification (claimed to add scalability)

HTML version of Basic Foils prepared August 29 98

Foil 49 PCF Example (SGI Variant)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

!$DOACROSS, LOCAL(I), SHARE(A,B,C),

!$& REDUCTION(X),

!$& IF (N.GT.1000),

!$& MP_SCHEDTYPE=DYNAMIC, CHUNK=100

DO I = 2, N-1
- A(I) = (B(I-1)+B(I)+B(I+1)) / 3
- X = X + A(I)/C(I)
END DO

!$DOACROSS, LOCAL(I), SHARE(D,E),

!$& MP_SCHEDTYPE=SIMPLE

DO I = 1, N
- D(I) = SIN(E(I))
END DO

X is a summation

Conditional parallelization

Iterations managed first-come, first-served, in blocks of 100

Iterations blocked evenly among threads (INTERLEAVE, GSS, RUNTIME scheduling also available)

PCF standard

HTML version of Basic Foils prepared August 29 98

Foil 50 OpenMP Example

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

!$OMP PARALLEL DO, SHARED(A,B,C), &

!$OMP REDUCTION(X), &

!$OMP SCHEDULE(DYNAMIC, 100)

DO I = 2, N-1
- A(I) = (B(I-1)+B(I)+B(I+1)) / 3
- X = X + A(I)/C(I)
END DO

!$OMP END DO

!$OMP PARALLEL DO, SHARED(D,E), &

!$OMP SCHEDULE(STATIC)

DO I = 1, N
- D(I) = SIN(E(I))
END DO

!$OMP END DO

X is a summation

Iterations managed first-come, first-served, in blocks of 100

Iterations blocked evenly among threads (GUIDED scheduling also available)

HTML version of Basic Foils prepared August 29 98

Foil 51 PCF/OpenMP Analysis

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Advantages

Convenient for shared memory, especially when using vendor extensions

Disadvantages

Tied strongly to shared memory
Few standard features for locality control

A good choice for shared-memory and DSM machines, but portability is still hard

HTML version of Basic Foils prepared August 29 98

Foil 52 Portable, Extensible Toolkit for Scientific Computations (PETSc)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

A higher-level approach to solving PDEs

Not parallel per se, but easy to use that way

User-level library provides:

Linear and nonlinear solvers
Standard and advanced options (e.g. parallel preconditioners)

Programmer supplies:

Application-specific set-up, data structures, PDE operators

HTML version of Basic Foils prepared August 29 98

Foil 53 PETSc Example

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

SLES snes

MAT A

Vec x,F

integer n,its,ierr

call MatCreat(MPI_COMM_WORLD,n,n,J,ierr)

call VecCreate(MPI_COMM_WORLD,n,x,ierr)

call VecDuplicate(x,F,ierr)

call SNESCreate(MPI_COMM_WORLD,SNES_NONLINEAR_EQUATIONS,snes,ierr)

call SNESSetFunction(snes,F,EvaluateFunction,PETSC_NULL,ierr)

call SNESSetJacobian(snes,J,EvaluateJacobian,PETSC_NULL,ierr)

call SNESSetFromOptions(sles,ierr)

call SNESSolve(sles,b,x,its,ierr)

call SNESDestroy(snes,ierr)

HTML version of Basic Foils prepared August 29 98

Foil 54 PETSc Analysis

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

A rather different beast from the other tools

The "P" does not stand for "Parallel"
Most concurrency in user code, often MPI

Advantages:

Easy access to advanced numerical methods

Disadvantages:

Limited scope

Good for implicit or explicit PDE solutions

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sat Aug 29 1998