GPGPU computing with CUDA

25 Nov 2021

Reading time ~10 minutes

Based on the problem from my previous post, this one includes and discusses my first venture into GPGPU computing via a compute cluster in order to handle the computations for both approaches (unoptimized and optimized) to solve the same, with the prime intention of this experiment being to gather some insight into the CPU vs GPU comparisons.

Since I wanted to repeat the computation against several different values of N (number of points), I explicitly set my value for it during compilation using the -D flag (for both my GPU and CPU programs, i.e. for both nvcc and gcc compilers) and trigger changes from my runner script, which is a shell program used to reserve and allocate resources (time, memory, CPU and thread core count) on the compute cluster (Monsoon), and then accordingly schedule my desired instructions via the scheduler (Slurm).

There are two types of shell scripts I created for the execution of my programs on Monsoon - one for the GPU executions, and one for the varying CPU executions (with a different number of cores set inside each). Here’s an example of both with N as 10⁵:

CPU script: (16 cores, one thread per core)

#!/bin/bash
#SBATCH --job-name=runSixteenCoresCPU
#SBATCH --output=/scratch/ac4743/cpuResultsSixteenCores.out
#SBATCH --time=00:01:30 # change for different core & thread count.
#SBATCH --mem=1000
#SBATCH --cpus-per-task 16
	       
trialrunCount=3
# -DNTHREADS=16
n=100000
declare -a epsilonValues=(5 10)
arrayLength=${#epsilonValues[@]}

gcc -DN=$n -fopenmp -O3 CountDistancesLessThanEpsilonBFV.c -lm -o CountDistancesLessThanEpsilonBFV
echo -e "\n Brute force (version one) approach with number of points, N = $n:"
for((i = 0; i < ${arrayLength}; i++));  
do
  for run in $(seq 1 $trialrunCount)
  do
      echo -e "\n\nTrial run #$run for an epsilon of ${epsilonValues[$i]}:"
      srun CountDistancesLessThanEpsilonBFV ${epsilonValues[$i]}
  done
done  

gcc -DN=$n -fopenmp -O3 CountDistancesLessThanEpsilonOV.c -lm -o CountDistancesLessThanEpsilonOV
echo -e "\n\n Optimized (version two) approach with number of points, N = $n:"
for((i = 0; i < ${arrayLength}; i++));  
do
  for run in $(seq 1 $trialrunCount)
  do
      echo -e "\n\nTrial run #$run for an epsilon of ${epsilonValues[$i]}:"
      srun CountDistancesLessThanEpsilonOV ${epsilonValues[$i]}
  done
done 

GPU script: (volta-gen/v100)

#!/bin/bash
#SBATCH --job-name=runGPU
#SBATCH --output=/scratch/ac4743/gpuResults.out
#SBATCH --time=00:01:00
#SBATCH --mem=1000
#SBATCH -G 1
#SBATCH --qos=gpu_class
#SBATCH --constraint v100

module load cuda
srun hostname
srun nvidia-smi

trialrunCount=3
n=100000
declare -a epsilonValues=(5 10)
arrayLength=${#epsilonValues[@]}

nvcc -DN=$n -arch=compute_70 -code=sm_70 -lcuda -Xcompiler -fopenmp CountDistancesLessThanEpsilonBFV.cu -o CountDistancesLessThanEpsilonBFV
echo -e "\n Brute force (version one) approach with number of points, N = $n:"
for((i = 0; i < ${arrayLength}; i++));  
do
  for run in $(seq 1 $trialrunCount)
  do
      echo -e "\n\nTrial run #$run for an epsilon of ${epsilonValues[$i]}:"
      srun CountDistancesLessThanEpsilonBFV ${epsilonValues[$i]}
  done
done 

echo -e "\n\n"
nvcc -DN=$n -arch=compute_70 -code=sm_70 -lcuda -Xcompiler -fopenmp CountDistancesLessThanEpsilonOV.cu -o CountDistancesLessThanEpsilonOV
echo -e "\n\n Optimized (version two) approach with number of points, N = $n:"
for((i = 0; i < ${arrayLength}; i++));  
do
  for run in $(seq 1 $trialrunCount)
  do
      echo -e "\n\nTrial run #$run for an epsilon of ${epsilonValues[$i]}:"
      srun CountDistancesLessThanEpsilonOV ${epsilonValues[$i]}
  done
done 

The Unintuitive

The brute force approach simply uses two nested loops running for N iterations each, where N is the number of points. Each point searches for every other point, computing the distance between them. In order to have the counter variable (which keeps the count of the euclidean distances less than epsilon) and the point data set available to the GPU, I allocate memory for them and transfer the variables/data to the GPU’s global memory using the CUDA version of malloc and memcpy. I pass these along with epsilon (which ends up being a constant with no value modifications throughout the program, and thus it does not require special treatment like the rest of the parameters) to the kernel function, which does the distance computations for N points, one by one against each point (total N*N), which is taken care of by one thread each (total N threads, and for the extra threads that are available, they won’t access our array outside of bounds due to the tid >= N check). If the condition is met (distance < epsilon), then I perform an atomic add on my shared counter, incrementing it by one.

The Optimized

Similar to what I did earlier, I am going with my self-count and double pair-count avoiding loop optimization in addition to avoiding the square-root computations for my kernel function.

There is another version of this function while using shared memory - I declared a shared array of size given by the block-size, and then I initialized the array with zeroes like a sparse one. I then make a block-level synchronization call and proceed to do the euclidean distance computations for each thread. (Note that here I just iterated over and up till N, discarding my optimization on the loops; however the optimization of going without the sqrt function still applies for this variant of the kernel function) For the points that satisfy the condition, I increment the block level counter for that thread. Then I synchronize the threads again and initialize a variable to store the block sums. I use the first thread to do the aggregation and then assign the total count back to my shared counter.

Some observations: (not necessarily astute)

After obtaining and tabulating the speedup and parallel efficiency (for both ε = 5 and 10), I obtained the much expected steady decline in parallel efficiency with the increasing number of cores used (kind of like an inverse relationship). This becomes especially applicable after a certain threshold, wherein the downward trend is more apparent as the number stagnates quickly.
The value for the efficiency is almost halved as compared to what I got for my brute force approach, and this absolutely makes sense because this is my optimized version, and the more optimized a program is, the less of an opportunity it affords for parallelism to be effective. In other words, the slowest configuration (implicatory of going without optimization) would usually give the best parallel scaling (speedup, efficiency), and with more and better optimization, the numbers would just decrease.
For 10⁵ points, The GPU is faster if we were to compare the GPU implementation with the CPU implementation while using 1 core/thread or even for 4 cores/threads. For 8 or more cores though, the CPU can outperform the GPU, but this comparison isn’t necessarily fair, and for larger N, the GPU can outperform the CPU version in both of these cases. For smaller N, the CPU will be faster, which is reasonable as well, since a small computation defeats the purpose of using multiple cores to perform that. In other words, as N scales (increases), the better performance improvement obtained while using the GPU (as compared to the CPU versions).
The kernel times for the GPU are almost comparable to the overall execution times (more than the former, of course) for larger N, since the data transfer is completely shadowed. On the other hand, for smaller N (like 100), one can notice that the difference shows, and is comparatively more than the cases with larger N values.

Here’s my CUDA program (with an alternative kernel that uses shared memory as well) for the optimized version:

#include <omp.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <iostream>
#include <complex.h>

// #define N 100000
#define SEED 72
#define BLOCKSIZE 1024

using namespace std; 
struct point { double x, y; };

void warmUpGPU();
__global__ void kernelFunction(struct point * pointData, unsigned long long int * countDistancesLessThanEpsilon, double epsilonSquare);
__global__ void kernelFunctionSharedMemory(struct point * pointData, unsigned long long int * countDistancesLessThanEpsilon, double epsilonSquare);

int main(int argc, char *argv[])
{
  warmUpGPU();	

  // Take epsilon as command line input:
  if (argc != 2)
  { printf("\nIncorrect number of input parameters. Please input an epsilon distance.\n");
	  return 0;
  } 
  double epsilon = atof(argv[1]);
  double epsilonSquare = epsilon * epsilon;  
  
  // Initialize point values between [0-1] for x and y:  
  struct point * pointData;
  pointData = (struct point *)malloc(sizeof(struct point) * N); 
  // Display memory required for pointData in mebibytes:
  printf("\nSize of the points data set (MiB): %f", (2.0 * sizeof(double) * N * 1.0) / (1024.0 * 1024.0)); 
  // In gibibytes: (N * 1.0 * sizeof(struct point)/(1024.0 * 1024.0 * 1024.0)) 
  // Seed the rng:
  srand(SEED);
  for (unsigned int i = 0; i < N; i++)
  {
    pointData[i].x = 1000.0 * ((double)(rand()) / RAND_MAX); 
    pointData[i].y = 1000.0 * ((double)(rand()) / RAND_MAX);	 
  }

  /*---
   GPU	
  ---*/

  double tstart = omp_get_wtime();
  cudaError_t errCode = cudaSuccess;
  if(errCode != cudaSuccess)
  {  
    cout << "\nLast error: " << errCode << endl; 
  }

  struct point *dev_pointData;
  unsigned long long int *countDistancesLessThanEpsilon;
  unsigned long long int *dev_countDistancesLessThanEpsilon; 
  countDistancesLessThanEpsilon = (unsigned long long int *)malloc(sizeof(unsigned long long int)); 
  dev_countDistancesLessThanEpsilon = (unsigned long long int *)malloc(sizeof(unsigned long long int));
  *countDistancesLessThanEpsilon = 0;

  // Allocate the entire point data set (pointData) on the device:
  errCode = cudaMalloc((struct point**)&dev_pointData, sizeof(struct point) * N);	
  if(errCode != cudaSuccess) 
  {
	cout << "\nError: pointData error with code " << errCode << endl; 
  }
  // Allocate the counter variable (countDistancesLessThanEpsilon) on the device:
  errCode = cudaMalloc((unsigned long long int**)&dev_countDistancesLessThanEpsilon, sizeof(unsigned long long int));	
  if(errCode != cudaSuccess) 
  {
	cout << "\nError: countDistancesLessThanEpsilon error with code " << errCode << endl; 
  }	

  // Copy the point data set to device:
  errCode = cudaMemcpy(dev_pointData, pointData, sizeof(struct point) * N, cudaMemcpyHostToDevice);
  if(errCode != cudaSuccess) 
  {
	cout << "\nError: dev_pointData Memcpy error with code " << errCode << endl; 
  }	
  // Copy the counter variable to device:
  errCode = cudaMemcpy(dev_countDistancesLessThanEpsilon, countDistancesLessThanEpsilon, sizeof(unsigned long long int), cudaMemcpyHostToDevice);
  if(errCode != cudaSuccess) 
  {
	cout << "\nError: countDistancesLessThanEpsilon Memcpy error with code " << errCode << endl; 
  }

  // Calculate block count:
  const unsigned int totalBlocks = ceil(N * 1.0 / 1024.0);
  printf("\nTotal number of blocks (GPU): %d", totalBlocks);

  // Execute kernel:
  double tstartkernel = omp_get_wtime();
  kernelFunction<<<totalBlocks, BLOCKSIZE>>>(dev_pointData, dev_countDistancesLessThanEpsilon, epsilonSquare);
	
  // Sync threads: (run kernel with shared memory before this point)  
  cudaDeviceSynchronize();
  double tendkernel = omp_get_wtime();
  printf("\nExecution time for only the GPU kernel (in seconds): %f", tendkernel - tstartkernel);

  if(errCode != cudaSuccess)
  {
	cout << "\nError after kernel launch! " << errCode << endl;
  }

  // Copy counter from device (GPU) back to the host (CPU):
  errCode = cudaMemcpy(countDistancesLessThanEpsilon, dev_countDistancesLessThanEpsilon, sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
  if(errCode != cudaSuccess) 
  {
	cout << "\nError: getting result form GPU error with code " << errCode << endl; 
  }

  printf("\nTotal number of distances between points that are within a search radius (epsilon) of %.1f, as computed by the GPU: %llu", epsilon, 2 * (*countDistancesLessThanEpsilon) + N);
  double tend = omp_get_wtime();
  printf("\nTotal time taken by the GPU (in seconds): %f", tend - tstart);
  
  free(pointData);
  return 0;
}

__global__ void kernelFunction(struct point * pointData, unsigned long long int * countDistancesLessThanEpsilon, double epsilonSquare)
{
  unsigned int tid = threadIdx.x + (blockIdx.x * blockDim.x); 
  double distance;	
  if (tid >= N)
  {
	  return;
  }
  for(int i = tid + 1; i < N; i = i + (blockDim.x * gridDim.x))
  { 
    distance = pow((pointData[tid].x - pointData[i].x), 2) + pow((pointData[tid].y - pointData[i].y), 2);
    if(distance < epsilonSquare)
      atomicAdd(countDistancesLessThanEpsilon, int(1));
  }
  return;
}

__global__ void kernelFunctionSharedMemory(struct point * pointData, unsigned long long int * countDistancesLessThanEpsilon, double epsilonSquare)
{
  unsigned int tid = threadIdx.x + (blockIdx.x * blockDim.x); 
  double distance;	
  if (tid >= N)
  {
	  return;
  }
  __shared__ unsigned long long int countArray[BLOCKSIZE];
  if (threadIdx.x == 0) 
  {
    for(int i = 0; i < BLOCKSIZE; i++) 
    {
      countArray[i] = 0;
    }
  }  
  __syncthreads();
  for(int i = tid; i < N; i++)
  { 
    distance = pow((pointData[tid].x - pointData[i].x), 2) + pow((pointData[tid].y - pointData[i].y), 2);
    if(distance < epsilonSquare)
      countArray[tid] += 1;
  }
  __syncthreads();
  unsigned int blockSum = 0;
  if(threadIdx.x == 0) 
  {
    for(int i = 0; i < BLOCKSIZE; i++) 
    {
      blockSum += countArray[i];
    }  
    atomicAdd(countDistancesLessThanEpsilon, blockSum);
  }
  return;
}

__global__ void warmup(unsigned int * tmp) 
{
  if (threadIdx.x == 0)
  *tmp = 555;
  return;
}

void warmUpGPU()
{
  // printf("\nWarming up GPU for time trialing...\n");	
  unsigned int * dev_tmp;
  unsigned int * tmp;
  tmp = (unsigned int*)malloc(sizeof(unsigned int));
  *tmp = 0;
  cudaError_t errCode = cudaSuccess;
  errCode = cudaMalloc((unsigned int**)&dev_tmp, sizeof(unsigned int));	
  if(errCode != cudaSuccess)
  {
	  cout << "\nError: dev_tmp error with code " << errCode << endl; 
  }
  warmup<<<1, BLOCKSIZE/4>>>(dev_tmp);
  //copy data from device to host 
  errCode = cudaMemcpy(tmp, dev_tmp, sizeof(unsigned int), cudaMemcpyDeviceToHost);
  if(errCode != cudaSuccess) 
  {
    cout << "\nError: getting tmp result form GPU error with code " << errCode << endl; 
  }
  // printf("\ntmp (changed to 555 on GPU): %d\n", *tmp);
  cudaFree(dev_tmp);
  return;
}

Anirban

11/25/2021