1 |
Allocate enough memory on each processor for its section of each distributed array
Distributed memory: 1 malloc per processor or shrink bounds
Shared memory: 1 shared area, with usage divided
2 |
Adjust indexing
Distributed memory: translate global indices ¤ local numbering
Shared memory: permute elements, keep each processor's together
3 |
Adjust loops (including implicit loops)
Pick a reference in the loop (e.g. A(J+1))
Each processor executes iterations so that reference is local (e.g. lb-1:ub-1)
4 |
Handle nonlocal data
Distributed memory: allocate buffer space, SEND/RECV needed data
Shared memory: access data directly, or allocate & make local copies