Running MPI/GPU program on the Delta cluster
MPI code: pmm_mpi.c
#include <mpi.h> void invoke_cuda_vecadd(); int main(int argc, char *argv[]) { int rank, size; MPI_Init (&argc, &argv); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */ invoke_cuda_vecadd(); /* the cuda code */ MPI_Finalize(); return 0; }
CUDA code: dgemm_cuda.cu
#include <stdio.h>
__global__ void cuda_vecadd(int *array1, int *array2, int *array3)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
array3[index] = array1[index] + array2[index];
}
extern "C" void invoke_cuda_vecadd()
{
cudaMalloc((void**) &devarray1, sizeof(int)*10);
cudaMalloc((void**) &devarray2, sizeof(int)*10);
cudaMalloc((void**) &devarray3, sizeof(int)*10);
cudaMemcpy(devarray1, hostarray1, sizeof(int)*10, cudaMemcpyHostToDevice);
cudaMemcpy(devarray2, hostarray2, sizeof(int)*10, cudaMemcpyHostToDevice);
cuda_vec_add<<<1, 10>>>(devarray1, devarray2, devarray3);
cudaMemcpy(hostarray3, devarray3, sizeof(int)*10, cudaMemcpyDeviceToHost);
cudaFree(devarray1);
cudaFree(devarray2);
cudaFree(devarray3);
}
Note: Mixing MPI and CUDA code may cause problems during linking because of the difference between C and C++ calling conventions. The use of extern "C" around invoke_cuda_code which instructs the nvcc (a wrapper of c++) compiler to make that function callable from the C runtime.
Compiling the MPI/CUDA program:Load the Modules
> module load IntelMPI # load Intel MPI
> module load Intel # load icc > module load cuda # load cuda tools
This will load the Intel MPI, the compiler, and the cuda tools. Next compile the code with
> nvcc -c dgemm_cuda.cu -o dgemm_cuda.o > mpiicc -o pmm_mpi.c -o pmm_mpi.o
> mpiicc -o mpicuda pmm_mpi.o dgemm_cuda.o -lcudart -lcublas -L /opt/cuda/lib64 -I /opt/cuda/include
Note: The CUDA compiler nvcc is used only to compile the CUDA source file, and the IntelMPI compiler mpiicc is used to compile the C code and do the linking
Setting Up and Submitting MPI Jobs:
1. qsub -I -l nodes=4 -q delta # get 4 nodes from FG
2. uniq /var/spool/torque/aux/399286.i136 > gpu_nodes_list #create machine file list
3. module load IntelMPI # load Intel MPI
4. module load Intel # load icc
5. module load cuda # load cuda tools
6. mpdboot -r ssh -f gpu_nodes_list -n 4 # will start an mpd ring on 4 nodes including local host
7. mpiexec -l -machinefile gpu_nodes_list -n 4 ./mpicuda 10000 1 4 # run mpi program using 4 nodes
Comparison between four implementations of sequential matrix multiplication on Delta:

References: Source Code Package
[1] High Performance Computing using CUDA,2009 User Group Conference
[2] http://www.nvidia.com/content/global/global.php
To get source code: git clone git@github.com:futuregrid/GPU.git
Compiling source code on Delta machine:
module load intelmpi module load intel module load cuda cd mpi_cuda_mkl make
Attachment | Size |
---|---|
mpi_cuda_mkl.zip | 888.92 KB |