Full HTML for

Basic foilset Designing and Building Parallel Programs I: Introduction

Given by Ian Foster, Gina Goff, Ehtesham Hayder, Chuck Koelbel at DoD Modernization Tutorial on 1995-1998. Foils prepared August 29 98
Outside Index Summary of Material


This is First in Four Part Tutorial built loosely around Ian Foster's Book
This Introduction covers
  • Parallel Computers and Algorithms
  • Understanding Performance
  • Parallel Programming Tools
Later talks cover
  • 2:OpenMP Shared Memory Programming
  • 3:MPI Message Passing
  • 4:The PETSc Library

Table of Contents for full HTML of Designing and Building Parallel Programs I: Introduction

Denote Foils where Image Critical
Denote Foils where Image has important information
Denote Foils where HTML is sufficient

1 Designing and Building Parallel Programs
2 Outline
3 Outline
4 Why Parallel Computing?
5 The Multicomputer: an Idealized Parallel Computer
6 Multicomputer Architecture
7 Multicomputer Cost Model
8 How do Real Parallel Computers Fit the Model?
9 Distributed Memory MIMD Multiprocessor
10 Shared Memory MIMD Multiprocessor
11 Distributed Shared Memory (DSM)
12 Workstation Clusters
13 A Simple Parallel Programming Model
14 Properties
15 Parallel Algorithm Design
16 A Design Methodology
17 Partitioning
18 Communication
19 Agglomeration
20 Mapping
21 Example: Atmosphere Model
22 Atmosphere Model: Numerical Methods
23 Atmosphere Model: Partition
24 Atmosphere Model: Communication
25 Atmosphere Model: Agglomeration
26 Atmosphere Model: Mapping
27 Modeling Performance
28 Bandwidth and Latency
29 Measured Costs
30 Typical Communication Costs
31 Example: Finite Difference
32 Time for Finite Difference
33 Using Performance Models
34 Design Alternatives: Finite Difference
35 Design Alternatives (2)
36 Finding Model Discrepancies
37 Impact of Network Topology
38 Competition for Bandwidth
39 Bandwidth-Constrained Model Versus. Observations
40 Tool Survey
41 High Performance Fortran (HPF)
42 HPF Example
43 HPF Analysis
44 Message Passing Interface (MPI)
45 MPI Example
46 MPI Analysis
47 PCF
48 PCF and OpenMP
49 PCF Example (SGI Variant)
50 OpenMP Example
51 PCF/OpenMP Analysis
52 Portable, Extensible Toolkit for Scientific Computations (PETSc)
53 PETSc Example
54 PETSc Analysis

Outside Index Summary of Material



HTML version of Basic Foils prepared August 29 98

Foil 1 Designing and Building Parallel Programs

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
(c) 1995, 1996, 1997, 1998
Ian Foster Gina Goff
Ehtesham Hayder Charles Koelbel
A Tutorial Presented by the Department of Defense HPC Modernization Programming Environment & Training Program

HTML version of Basic Foils prepared August 29 98

Foil 2 Outline

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Day 1
  • Introduction to Parallel Programming
  • The OpenMP Programming Language
Day 2
  • Introduction to MPI
  • The PETSc Library
Individual consulting in afternoons

HTML version of Basic Foils prepared August 29 98

Foil 3 Outline

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Day 1
  • Introduction to Parallel Programming
    • Parallel Computers & Algorithms
    • Understanding Performance
    • Parallel Programming Tools
  • The OpenMP Programming Language
Day 2
  • Introduction to MPI
  • The PETSc Library

HTML version of Basic Foils prepared August 29 98

Foil 4 Why Parallel Computing?

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Continuing demands for higher performance
Physical limits on single processor performance
High costs of internal concurrency
Result is rise of multiprocessor architectures
  • And number of processors will continue to increase
Networking is another contributing factor
Future software must be concurrent & scalable

HTML version of Basic Foils prepared August 29 98

Foil 5 The Multicomputer: an Idealized Parallel Computer

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 6 Multicomputer Architecture

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Multicomputer = nodes + network
Node = processor(s) + local memory
Access to local memory is cheap
  • 10s of cycles; does not involve network
  • Conventional memory reference
Access to remote memory is expensive
  • 100s or 1000s of cycles; uses network
  • Use I/O-like mechanisms (e.g., send/receive)

HTML version of Basic Foils prepared August 29 98

Foil 7 Multicomputer Cost Model

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Cost of remote memory access/communication (including synchronization)
  • T = ts + N tw
  • ts = per-message cost ("latency")
  • tw = per-word cost
  • N = message size in words
Hence locality is an important property of good parallel algorithms

HTML version of Basic Foils prepared August 29 98

Foil 8 How do Real Parallel Computers Fit the Model?

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Major architecture types
  • Distributed memory MIMD
  • Shared memory MIMD
  • Distributed shared memory (DSM)
  • Workstation clusters and "metacomputers"
Model fits current architectures pretty well

HTML version of Basic Foils prepared August 29 98

Foil 9 Distributed Memory MIMD Multiprocessor

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Multiple Instruction/Multiple Data
Processors with local memory connected by high-speed interconnection network
Typically high bandwidth, medium latency
Hardware support for remote memory access
Model breaks down when topology matters
Examples: Cray T3E, IBM SP

HTML version of Basic Foils prepared August 29 98

Foil 10 Shared Memory MIMD Multiprocessor

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Processors access shared memory via bus
Low latency, high bandwidth
Bus contention limits scalability
Search for scalability introduces locality
  • Cache (a form of local memory)
  • Multistage architectures (some memory closer)
Examples: Cray T90, SGI PCA, Sun

HTML version of Basic Foils prepared August 29 98

Foil 11 Distributed Shared Memory (DSM)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
A hybrid of distributed and shared memory
Small groups of processors share memory; others access across a scalable network
Low to moderate latency, high bandwidth
Model simplifies the multilevel hierarchy
Examples: SGI Origin, HP Exemplar

HTML version of Basic Foils prepared August 29 98

Foil 12 Workstation Clusters

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Workstations connected by network
Cost effective
High latency, low to moderate bandwidth
Often lack integrated software environment
Model breaks down if connectivity limited
Examples: Ethernet, ATM crossbar, Myrinet

HTML version of Basic Foils prepared August 29 98

Foil 13 A Simple Parallel Programming Model

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
A parallel computation is a set of tasks
Each task has local data, can be connected to other tasks by channels
A task can
  • Compute using local data
  • Send to/receive from other tasks
  • Create new tasks, or terminate itself
A receiving task blocks until data available

HTML version of Basic Foils prepared August 29 98

Foil 14 Properties

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Concurrency is enhanced by creating multiple tasks
Scalability: More tasks than nodes
Locality: Access local data when possible
A task (with local data and subtasks) is a unit for modular design
Mapping to nodes affects performance only

HTML version of Basic Foils prepared August 29 98

Foil 15 Parallel Algorithm Design

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Goal: "Develop an efficient (parallel) solution to a programming problem"
  • Identify sensible parallel algorithms
  • Evaluate performance and complexity
We present
  • A systematic design methodology
  • Some basic design techniques
  • Illustrative examples

HTML version of Basic Foils prepared August 29 98

Foil 16 A Design Methodology

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Partition
  • Define tasks
Communication
  • Identify requirements
Agglomeration
  • Enhance locality
Mapping
  • Place tasks

HTML version of Basic Foils prepared August 29 98

Foil 17 Partitioning

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Goal: identify opportunities for concurrent execution (define tasks: computation+data)
Focus on data operated on by algorithm ...
  • Then distribute computation appropriately
  • Domain decomposition
... or on the operations performed
  • Then distribute data appropriately
  • Functional decomposition

HTML version of Basic Foils prepared August 29 98

Foil 18 Communication

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Identify communication requirements
  • If computation in one task requires data located in another, communication is needed
Example: finite difference computation
Must communicate with each neighbor
Xi = (Xi-1 + 2*Xi + Xi+1)/4
X1
X2
X3
Partition creates
one task per point

HTML version of Basic Foils prepared August 29 98

Foil 19 Agglomeration

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Once tasks + communication determined, "agglomerate" small tasks into larger tasks
Motivations
  • To reduce communication costs
  • If tasks cannot execute concurrently
  • To reduce software engineering costs
Caveats
  • May involve replicating computation or data

HTML version of Basic Foils prepared August 29 98

Foil 20 Mapping

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Place tasks on processors, to
  • Maximize concurrency } note potential
  • Minimize communication } conflict
Techniques:
  • Regular problems: agglomerate to P tasks
  • Irregular problems: use static load balancing
  • If irregular in time: dynamic load balancing

HTML version of Basic Foils prepared August 29 98

Foil 21 Example: Atmosphere Model

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Simulate atmospheric processes
  • Conservation of momentum, mass, energy
  • Ideal gas law, hydrostatic approximation
Represent atmosphere state by 3-D grid
  • Periodic in two horizontal dimensions
  • Nx.Ny.Nz: e.g., Ny=50-500, Nx=2Ny, Nz=15-30
Computation includes
  • Atmospheric dynamics: finite difference
  • "Physics" (radiation etc.) in vertical only

HTML version of Basic Foils prepared August 29 98

Foil 22 Atmosphere Model: Numerical Methods

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Discretize the (continuous) domain by a regular Nx??Ny ??Nz grid
  • Store p, u, v, T, ? at every grid point
Approximate derivatives by finite differences
  • Leads to stencils in vertical and horizontal

HTML version of Basic Foils prepared August 29 98

Foil 23 Atmosphere Model: Partition

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Use domain decomposition
  • Because model operates on large, regular grid
  • Can decompose in 1, 2, or 3 dimensions
  • 3-D decomposition offers greatest flexibility

HTML version of Basic Foils prepared August 29 98

Foil 24 Atmosphere Model: Communication

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Finite difference stencil horizontally
  • Local, regular, structured
Radiation calculations vertically
  • Global, regular, structured
Diagnostic sums
  • Global, regular, structured

HTML version of Basic Foils prepared August 29 98

Foil 25 Atmosphere Model: Agglomeration

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
In horizontal
  • Clump so that 4 points per task
  • Efficiency: communicate with 4 neighbors only
In vertical, clump all points in column
  • Performance: avoid communication
  • Modularity: Reuse physics modules
Resulting algorithm "reasonably scalable"
  • (Nx.Ny)/4 : at least 1250 tasks

HTML version of Basic Foils prepared August 29 98

Foil 26 Atmosphere Model: Mapping

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Technique depends on load distribution
1) Agglomerate to one task per processor
  • Appropriate if little load imbalance
2) Extend (1) to incorporate cyclic mapping
  • Works well for diurnal cycle imbalance
3) Use dynamic, local load balancing
  • Works well for unpredictable, local imbalances

HTML version of Basic Foils prepared August 29 98

Foil 27 Modeling Performance

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Execution time (sums are over P nodes) is
  • T = (sum[Tcomp]+ sum[Tcomm] + sum[Tidle])/P
Computation time comprises both
  • Operations required by sequential algorithm
  • Additional work, replicated work
Idle time due to
  • Load imbalance, and/or
  • Latency (waiting for remote data)

HTML version of Basic Foils prepared August 29 98

Foil 28 Bandwidth and Latency

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Recall cost model
  • T = ts + N tw
  • ts = per-message cost ("latency")
  • tw = per-word cost (1/ tw = bandwidth)
  • N = message size in words
Model works well for many algorithms, and on many computers

HTML version of Basic Foils prepared August 29 98

Foil 29 Measured Costs

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 30 Typical Communication Costs

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Computer ts tw
IBM SP2 40 0.11
Intel Paragon 121 0.07
Meiko CS-2 87 0.08
Sparc/Ethernet 1500 5.0
Sparc/FDDI 1150 1.1
Times in microseconds

HTML version of Basic Foils prepared August 29 98

Foil 31 Example: Finite Difference

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Finite difference computation on N2Z grid
  • 9-point stencil
  • Similar to atmosphere model earlier
Decompose along one horizontal dimension

HTML version of Basic Foils prepared August 29 98

Foil 32 Time for Finite Difference

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Identical computations at each grid point
  • Tcomp = tcN2Z (tc is compute time/point)
1-D decomposition, so each node sends 2NZ data to 2 neighbors [if ? 2 rows/node]
  • Tcomm = P(ts2 + tw4NZ)
No significant idle time if load balanced
  • Tidle = 0
Therefore, T = tcN2Z/P + ts2 + tw4NZ

HTML version of Basic Foils prepared August 29 98

Foil 33 Using Performance Models

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
During design
  • Use models to study qualitative behavior
  • Calibrate models by measuring tc, ts, tw, etc.
  • Use calibrated models to evaluate design alternatives
During implementation
  • Compare predictions with observations
  • Relate discrepancies to implementation or model
  • Use models to guide optimization process

HTML version of Basic Foils prepared August 29 98

Foil 34 Design Alternatives: Finite Difference

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Consider 2-D and 3-D decompositions
  • Are they ever a win?
  • If so, when?

HTML version of Basic Foils prepared August 29 98

Foil 35 Design Alternatives (2)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
2-D Decomposition - On a ?P????P processor grid, messages of size 2N/?P???Z to 4 neighbors, so
  • T = tcN2Z/P + ts4 + tw2NZ/?P
Good if ts < twNZ(2-1/?P)
3-D Decomposition - On a Px ? Py ? Pz grid,
  • T = tcN2Z/P + ts6 + tw2(N2/(PxPy) + (NZ)/(PxPz) +(NZ)/(PyPz) )

HTML version of Basic Foils prepared August 29 98

Foil 36 Finding Model Discrepancies

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
What we have here is a failure to communicate...

HTML version of Basic Foils prepared August 29 98

Foil 37 Impact of Network Topology

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Multicomputer model assumes comm cost independent of location & other comms
Real networks are not fully connected
Multicomputer model can break down
Ethernet
2-D Mesh

HTML version of Basic Foils prepared August 29 98

Foil 38 Competition for Bandwidth

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
In many cases, a bandwidth constrained model can give sufficiently accurate results
  • "If S processes must communicate over same wire at same time, each has 1/S bandwidth"
Example: finite difference on Ethernet
  • All processors share single Ethernet
  • Hence bandwidth term scaled by P
  • T = tcN2Z/P + ts2 + tw4NZP

HTML version of Basic Foils prepared August 29 98

Foil 39 Bandwidth-Constrained Model Versus. Observations

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Bandwidth-constrained model gives better fit

HTML version of Basic Foils prepared August 29 98

Foil 40 Tool Survey

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
High Performance Fortran (HPF)
Message Passing Interface (MPI)
Parallel Computing Forum (PCF) and OpenMP™
Portable, Extensible Toolkit for Scientific Computations (PETSc)

HTML version of Basic Foils prepared August 29 98

Foil 41 High Performance Fortran (HPF)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
A standard data-parallel language
  • CM Fortran, C*, HPC++ are related
Programmer specifies
  • Concurrency (concurrent operations on arrays)
  • Locality (data distribution)
Compiler infers
  • Mapping of computation ("owner-computes" rule)
  • Communication

HTML version of Basic Foils prepared August 29 98

Foil 42 HPF Example

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
PROGRAM hpf_finite_difference
!HPF$ PROCESSORS pr(4)
REAL x(100,100), new(100,100)
!HPF$ ALIGN new(:,:) WITH x(:,:)
!HPF$ DISTRIBUTE x(BLOCK, *) ONTO pr
new(2:99,2:99) = (x(1:98,2:99)+x(3:100,2:99)+ &
& x(2:99,1:98)+x(2:99,3:100)) / 4
diff = MAXVAL(ABS(new-x))
end

HTML version of Basic Foils prepared August 29 98

Foil 43 HPF Analysis

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Advantages
  • High level: preserves sequential semantics
  • Standard
Disadvantages
  • Restricted applicability
  • Requires sophisticated compiler technology
Good for regular, SPMD problems

HTML version of Basic Foils prepared August 29 98

Foil 44 Message Passing Interface (MPI)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
A standard message-passing library
  • p4, NX, p4, Express, PARMACS are precursors
An MPI program defines a set of processes
  • (usually one process per node)
... that communicate by calling MPI functions
  • (point-to-point and collective)
... and can be constructed in a modular fashion
  • (communicators are the key)

HTML version of Basic Foils prepared August 29 98

Foil 45 MPI Example

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
main(int argc, char *argv[]) {
MPI_Comm com = MPI_COMM_WORLD;
MPI_Init(&argc,&argv);
MPI_Comm_size(com,&np);
MPI_Send(local+1,1,MPI_FLOAT,lnbr,10,com);
MPI_Recv(local,1,MPI_FLOAT,rnbr,10,com,&status);
MPI_Send(local+lsize,1,MPI_FLOAT,rnbr,10,com);
MPI_Recv(local+lsize+1,1,MPI_FLOAT,lnbr,10,com,&status);
ldiff = maxerror(local);
MPI_Allreduce(&ldiff,&diff,1,MPI_FLOAT,MPI_MAX,com);
MPI_Finalize();
}

HTML version of Basic Foils prepared August 29 98

Foil 46 MPI Analysis

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Advantages:
  • Wide availability of efficient implementations
  • Support for modular design, code reuse
Disadvantages:
  • Low level ("parallel assembly code")
  • Less well-suited to shared-memory machines
Good for performance-critical codes with natural task modularity

HTML version of Basic Foils prepared August 29 98

Foil 47 PCF

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Standardization (circa 1993) of shared memory parallelism in Fortran
A PCF program is multithreaded, with explicit synchronization between the threads and shared variables
A PCF program is divided into regions
  • Serial regions: Only the master thread executes
  • Parallel regions: Work shared by all threads

HTML version of Basic Foils prepared August 29 98

Foil 48 PCF and OpenMP

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
PCF per se was not widely implemented
  • Timing: Distributed memory became popular
  • Complexity: Many details for special cases
  • Not Invented Here (NIH) syndrome
Its ideas resurfaced in OpenMP
  • Primary differences are the spelling and low-level controls
  • Also some significant simplification (claimed to add scalability)

HTML version of Basic Foils prepared August 29 98

Foil 49 PCF Example (SGI Variant)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
!$DOACROSS, LOCAL(I), SHARE(A,B,C),
!$& REDUCTION(X),
!$& IF (N.GT.1000),
!$& MP_SCHEDTYPE=DYNAMIC, CHUNK=100
  • DO I = 2, N-1
    • A(I) = (B(I-1)+B(I)+B(I+1)) / 3
    • X = X + A(I)/C(I)
  • END DO
!$DOACROSS, LOCAL(I), SHARE(D,E),
!$& MP_SCHEDTYPE=SIMPLE
  • DO I = 1, N
    • D(I) = SIN(E(I))
  • END DO
X is a summation
Conditional parallelization
Iterations managed first-come, first-served, in blocks of 100
Iterations blocked evenly among threads (INTERLEAVE, GSS, RUNTIME scheduling also available)
PCF standard

HTML version of Basic Foils prepared August 29 98

Foil 50 OpenMP Example

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
!$OMP PARALLEL DO, SHARED(A,B,C), &
!$OMP REDUCTION(X), &
!$OMP SCHEDULE(DYNAMIC, 100)
  • DO I = 2, N-1
    • A(I) = (B(I-1)+B(I)+B(I+1)) / 3
    • X = X + A(I)/C(I)
  • END DO
!$OMP END DO
!$OMP PARALLEL DO, SHARED(D,E), &
!$OMP SCHEDULE(STATIC)
  • DO I = 1, N
    • D(I) = SIN(E(I))
  • END DO
!$OMP END DO
X is a summation
Iterations managed first-come, first-served, in blocks of 100
Iterations blocked evenly among threads (GUIDED scheduling also available)

HTML version of Basic Foils prepared August 29 98

Foil 51 PCF/OpenMP Analysis

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
Advantages
  • Convenient for shared memory, especially when using vendor extensions
Disadvantages
  • Tied strongly to shared memory
  • Few standard features for locality control
A good choice for shared-memory and DSM machines, but portability is still hard

HTML version of Basic Foils prepared August 29 98

Foil 52 Portable, Extensible Toolkit for Scientific Computations (PETSc)

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
A higher-level approach to solving PDEs
  • Not parallel per se, but easy to use that way
User-level library provides:
  • Linear and nonlinear solvers
  • Standard and advanced options (e.g. parallel preconditioners)
Programmer supplies:
  • Application-specific set-up, data structures, PDE operators

HTML version of Basic Foils prepared August 29 98

Foil 53 PETSc Example

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
SLES snes
MAT A
Vec x,F
integer n,its,ierr
call MatCreat(MPI_COMM_WORLD,n,n,J,ierr)
call VecCreate(MPI_COMM_WORLD,n,x,ierr)
call VecDuplicate(x,F,ierr)
call SNESCreate(MPI_COMM_WORLD,SNES_NONLINEAR_EQUATIONS,snes,ierr)
call SNESSetFunction(snes,F,EvaluateFunction,PETSC_NULL,ierr)
call SNESSetJacobian(snes,J,EvaluateJacobian,PETSC_NULL,ierr)
call SNESSetFromOptions(sles,ierr)
call SNESSolve(sles,b,x,its,ierr)
call SNESDestroy(snes,ierr)

HTML version of Basic Foils prepared August 29 98

Foil 54 PETSc Analysis

From Designing and Building Parallel Programs I: Introduction DoD Modernization Tutorial -- 1995-1998. *
Full HTML Index
A rather different beast from the other tools
  • The "P" does not stand for "Parallel"
  • Most concurrency in user code, often MPI
Advantages:
  • Easy access to advanced numerical methods
Disadvantages:
  • Limited scope
Good for implicit or explicit PDE solutions

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sat Aug 29 1998