Given by Ian Foster, Gina Goff, Ehtesham Hayder, Chuck Koelbel at DoD Modernization Tutorial on 1995-1998. Foils prepared August 29 98
Outside Index
Summary of Material
This is First in Four Part Tutorial built loosely around Ian Foster's Book
|
This Introduction covers
|
Later talks cover
|
Outside Index Summary of Material
(c) 1995, 1996, 1997, 1998 |
Ian Foster Gina Goff |
Ehtesham Hayder Charles Koelbel |
A Tutorial Presented by the Department of Defense HPC Modernization Programming Environment & Training Program |
Day 1
|
Day 2
|
Individual consulting in afternoons |
Day 1
|
Day 2
|
Continuing demands for higher performance |
Physical limits on single processor performance |
High costs of internal concurrency |
Result is rise of multiprocessor architectures
|
Networking is another contributing factor |
Future software must be concurrent & scalable |
Multicomputer = nodes + network |
Node = processor(s) + local memory |
Access to local memory is cheap
|
Access to remote memory is expensive
|
Cost of remote memory access/communication (including synchronization)
|
Hence locality is an important property of good parallel algorithms |
Major architecture types
|
Model fits current architectures pretty well |
Multiple Instruction/Multiple Data |
Processors with local memory connected by high-speed interconnection network |
Typically high bandwidth, medium latency |
Hardware support for remote memory access |
Model breaks down when topology matters |
Examples: Cray T3E, IBM SP |
Processors access shared memory via bus |
Low latency, high bandwidth |
Bus contention limits scalability |
Search for scalability introduces locality
|
Examples: Cray T90, SGI PCA, Sun |
A hybrid of distributed and shared memory |
Small groups of processors share memory; others access across a scalable network |
Low to moderate latency, high bandwidth |
Model simplifies the multilevel hierarchy |
Examples: SGI Origin, HP Exemplar |
Workstations connected by network |
Cost effective |
High latency, low to moderate bandwidth |
Often lack integrated software environment |
Model breaks down if connectivity limited |
Examples: Ethernet, ATM crossbar, Myrinet |
A parallel computation is a set of tasks |
Each task has local data, can be connected to other tasks by channels |
A task can
|
A receiving task blocks until data available |
Concurrency is enhanced by creating multiple tasks |
Scalability: More tasks than nodes |
Locality: Access local data when possible |
A task (with local data and subtasks) is a unit for modular design |
Mapping to nodes affects performance only |
Goal: "Develop an efficient (parallel) solution to a programming problem"
|
We present
|
Partition
|
Communication
|
Agglomeration
|
Mapping
|
Goal: identify opportunities for concurrent execution (define tasks: computation+data) |
Focus on data operated on by algorithm ...
|
... or on the operations performed
|
Identify communication requirements
|
Example: finite difference computation |
Must communicate with each neighbor |
Xi = (Xi-1 + 2*Xi + Xi+1)/4 |
X1 |
X2 |
X3 |
Partition creates |
one task per point |
Once tasks + communication determined, "agglomerate" small tasks into larger tasks |
Motivations
|
Caveats
|
Place tasks on processors, to
|
Techniques:
|
Simulate atmospheric processes
|
Represent atmosphere state by 3-D grid
|
Computation includes
|
Discretize the (continuous) domain by a regular Nx??Ny ??Nz grid
|
Approximate derivatives by finite differences
|
Use domain decomposition
|
Finite difference stencil horizontally
|
Radiation calculations vertically
|
Diagnostic sums
|
In horizontal
|
In vertical, clump all points in column
|
Resulting algorithm "reasonably scalable"
|
Technique depends on load distribution |
1) Agglomerate to one task per processor
|
2) Extend (1) to incorporate cyclic mapping
|
3) Use dynamic, local load balancing
|
Execution time (sums are over P nodes) is
|
Computation time comprises both
|
Idle time due to
|
Recall cost model
|
Model works well for many algorithms, and on many computers |
Computer ts tw |
IBM SP2 40 0.11 |
Intel Paragon 121 0.07 |
Meiko CS-2 87 0.08 |
Sparc/Ethernet 1500 5.0 |
Sparc/FDDI 1150 1.1 |
Times in microseconds |
Finite difference computation on N2Z grid
|
Decompose along one horizontal dimension |
Identical computations at each grid point
|
1-D decomposition, so each node sends 2NZ data to 2 neighbors [if ? 2 rows/node]
|
No significant idle time if load balanced
|
Therefore, T = tcN2Z/P + ts2 + tw4NZ |
During design
|
During implementation
|
Consider 2-D and 3-D decompositions
|
2-D Decomposition - On a ?P????P processor grid, messages of size 2N/?P???Z to 4 neighbors, so
|
Good if ts < twNZ(2-1/?P) |
3-D Decomposition - On a Px ? Py ? Pz grid,
|
What we have here is a failure to communicate... |
Multicomputer model assumes comm cost independent of location & other comms |
Real networks are not fully connected |
Multicomputer model can break down |
Ethernet |
2-D Mesh |
In many cases, a bandwidth constrained model can give sufficiently accurate results
|
Example: finite difference on Ethernet
|
Bandwidth-constrained model gives better fit |
High Performance Fortran (HPF) |
Message Passing Interface (MPI) |
Parallel Computing Forum (PCF) and OpenMP™ |
Portable, Extensible Toolkit for Scientific Computations (PETSc) |
A standard data-parallel language
|
Programmer specifies
|
Compiler infers
|
PROGRAM hpf_finite_difference |
!HPF$ PROCESSORS pr(4) |
REAL x(100,100), new(100,100) |
!HPF$ ALIGN new(:,:) WITH x(:,:) |
!HPF$ DISTRIBUTE x(BLOCK, *) ONTO pr |
new(2:99,2:99) = (x(1:98,2:99)+x(3:100,2:99)+ & |
& x(2:99,1:98)+x(2:99,3:100)) / 4 |
diff = MAXVAL(ABS(new-x)) |
end |
Advantages
|
Disadvantages
|
Good for regular, SPMD problems |
A standard message-passing library
|
An MPI program defines a set of processes
|
... that communicate by calling MPI functions
|
... and can be constructed in a modular fashion
|
main(int argc, char *argv[]) { |
MPI_Comm com = MPI_COMM_WORLD; |
MPI_Init(&argc,&argv); |
MPI_Comm_size(com,&np); |
MPI_Send(local+1,1,MPI_FLOAT,lnbr,10,com); |
MPI_Recv(local,1,MPI_FLOAT,rnbr,10,com,&status); |
MPI_Send(local+lsize,1,MPI_FLOAT,rnbr,10,com); |
MPI_Recv(local+lsize+1,1,MPI_FLOAT,lnbr,10,com,&status); |
ldiff = maxerror(local); |
MPI_Allreduce(&ldiff,&diff,1,MPI_FLOAT,MPI_MAX,com); |
MPI_Finalize(); |
} |
Advantages:
|
Disadvantages:
|
Good for performance-critical codes with natural task modularity |
Standardization (circa 1993) of shared memory parallelism in Fortran |
A PCF program is multithreaded, with explicit synchronization between the threads and shared variables |
A PCF program is divided into regions
|
PCF per se was not widely implemented
|
Its ideas resurfaced in OpenMP
|
!$DOACROSS, LOCAL(I), SHARE(A,B,C), |
!$& REDUCTION(X), |
!$& IF (N.GT.1000), |
!$& MP_SCHEDTYPE=DYNAMIC, CHUNK=100
|
!$DOACROSS, LOCAL(I), SHARE(D,E), |
!$& MP_SCHEDTYPE=SIMPLE
|
X is a summation |
Conditional parallelization |
Iterations managed first-come, first-served, in blocks of 100 |
Iterations blocked evenly among threads (INTERLEAVE, GSS, RUNTIME scheduling also available) |
PCF standard |
!$OMP PARALLEL DO, SHARED(A,B,C), & |
!$OMP REDUCTION(X), & |
!$OMP SCHEDULE(DYNAMIC, 100)
|
!$OMP END DO |
!$OMP PARALLEL DO, SHARED(D,E), & |
!$OMP SCHEDULE(STATIC)
|
!$OMP END DO |
X is a summation |
Iterations managed first-come, first-served, in blocks of 100 |
Iterations blocked evenly among threads (GUIDED scheduling also available) |
Advantages
|
Disadvantages
|
A good choice for shared-memory and DSM machines, but portability is still hard |
A higher-level approach to solving PDEs
|
User-level library provides:
|
Programmer supplies:
|
SLES snes |
MAT A |
Vec x,F |
integer n,its,ierr |
call MatCreat(MPI_COMM_WORLD,n,n,J,ierr) |
call VecCreate(MPI_COMM_WORLD,n,x,ierr) |
call VecDuplicate(x,F,ierr) |
call SNESCreate(MPI_COMM_WORLD,SNES_NONLINEAR_EQUATIONS,snes,ierr) |
call SNESSetFunction(snes,F,EvaluateFunction,PETSC_NULL,ierr) |
call SNESSetJacobian(snes,J,EvaluateJacobian,PETSC_NULL,ierr) |
call SNESSetFromOptions(sles,ierr) |
call SNESSolve(sles,b,x,its,ierr) |
call SNESDestroy(snes,ierr) |
A rather different beast from the other tools
|
Advantages:
|
Disadvantages:
|
Good for implicit or explicit PDE solutions |