Running MPI/GPU program on the Delta cluster

GPUs provide the ability to use mathematical operations at a fraction of the cost and with higher performance than on the current generation of processors. FutureGrid provides the ability to test such an infrastructure as part of its delta cluster. Here, we provide a step-by-step guide on how to run a parallel matrix multiplication program using IntelMPI and CUDA on Delta machines. The MPI framework distributes the work among compute nodes, each of which use CUDA to execute the shared workload. We also provide the complete parallel matrix multiplication code using MPI/CUDA that has already been tested on Delta cluster in attachment.  

MPI code: pmm_mpi.c

#include <mpi.h>

void invoke_cuda_vecadd();

       int main(int argc, char *argv[])
{
          int rank, size;

          MPI_Init (&argc, &argv); /* starts MPI */
          MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
          MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
          invoke_cuda_vecadd();  /* the cuda code */
          MPI_Finalize();
 return 0;
} 

CUDA code: dgemm_cuda.cu

#include <stdio.h>

__global__ void cuda_vecadd(int *array1, int *array2, int *array3)
{
        int index = blockIdx.x * blockDim.x + threadIdx.x;
        array3[index] = array1[index] + array2[index];
}

  extern "C" void invoke_cuda_vecadd()
 {
        cudaMalloc((void**) &devarray1, sizeof(int)*10);
        cudaMalloc((void**) &devarray2, sizeof(int)*10);
        cudaMalloc((void**) &devarray3, sizeof(int)*10);
        cudaMemcpy(devarray1, hostarray1, sizeof(int)*10, cudaMemcpyHostToDevice);
        cudaMemcpy(devarray2, hostarray2, sizeof(int)*10, cudaMemcpyHostToDevice);
        cuda_vec_add<<<1, 10>>>(devarray1, devarray2, devarray3);
       cudaMemcpy(hostarray3, devarray3, sizeof(int)*10, cudaMemcpyDeviceToHost);
       cudaFree(devarray1);
       cudaFree(devarray2);
       cudaFree(devarray3);
}

Note: Mixing MPI and CUDA code may cause problems during linking because of the difference between C and C++ calling conventions. The use of extern "C" around invoke_cuda_code which instructs the nvcc (a wrapper of c++) compiler to make that function callable from the C runtime. 

Compiling the MPI/CUDA program:

Load the Modules
> module load IntelMPI # load Intel MPI
> module load Intel # load icc > module load cuda # load cuda tools
This will load the Intel MPI, the compiler, and the cuda tools. Next compile the code with 

> nvcc -c dgemm_cuda.cu -o dgemm_cuda.o   > mpiicc -o pmm_mpi.c -o pmm_mpi.o
> mpiicc -o mpicuda pmm_mpi.o dgemm_cuda.o -lcudart -lcublas -L /opt/cuda/lib64 -I /opt/cuda/include

Note: The CUDA compiler nvcc is used only to compile the CUDA source file, and the IntelMPI compiler mpiicc is used to compile the C code and do the linking
  Setting Up and Submitting MPI Jobs:

1. qsub -I -l nodes=4 -q delta        # get 4 nodes from FG
2. uniq /var/spool/torque/aux/399286.i136 > gpu_nodes_list       #create machine file list
3. module load IntelMPI                # load Intel MPI
4. module load Intel                     # load icc
5. module load cuda                     # load cuda tools
6. mpdboot -r ssh -f gpu_nodes_list -n 4  # will start an mpd ring on 4 nodes including local host 
7. mpiexec -l -machinefile gpu_nodes_list -n 4 ./mpicuda 10000 1 4  # run mpi program using 4 nodes

Comparison between four implementations of sequential matrix multiplication on Delta:

   
References: Source Code Package
[1] High Performance Computing using CUDA,2009 User Group Conference
[2] http://www.nvidia.com/content/global/global.php

To get source code: git clone git@github.com:futuregrid/GPU.git

Compiling source code on Delta machine:
module load intelmpi
module load intel
module load cuda
cd mpi_cuda_mkl
make

AttachmentSize
mpi_cuda_mkl.zip888.92 KB