In C:
for (k=0; k<NCB; k++) for (i=0; i<NRA; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++) c[i][k] = c[i][k] + a[i][j] * b[j][k]; }
do 100 k=1, NCB do 100 i=1, NRA c(i,k) = 0.0 do 100 j=1, NCA c(i,k) = c(i,k) + a(i,j) * b(j,k) 100 continue
For the C version, the problem is decomposed by assigning each task a number of consecutive rows of matrix A, and replicating matrix B on all tasks. Each task generates one or more rows of the result matrix C. Looking at the C code fragment above, the work is distributed according to the "i" index.
This decomposition is convenient because C stores matrices in row-major order (elements in the same row of the matrix are consecutive in storage). All message data is therefore contiguous. Each task has all the information needed for its part of the problem -- no intertask communication is necessary during the matrix multiply. In addition, since each row of A has to be multiplied against all columns of B, all data passed to each task is used by that task.
Since Fortran stores matrices in column-major order, the Fortran version is decomposed by replicating matrix A on all tasks, and distributing columns of matrix B to each task. Each task generates one or more columns of the result matrix. Looking at the Fortran code fragment, the work is distributed according to the "k" index.
The code is SPMD, i.e. each task runs the same executable. The task with taskid 0 is the master task, which distributes the matrices and collects results, but does not contribute to the calculation.
Tutorial directory: /usr/local/doc/www/Edu/Tutor/MPI/Templates/mm/
Fortran file: mm.f
C file: mm.c