GPU Programming Basics: Getting Started Talk

< Back to homepage

This talk was given on 7th February 2011 at Trinity College Dublin to share the initial experiences of learning GPU programming, with the goal of assisting others in the statistics research group to begin using nVidia’s CUDA technology.

GPU Programming Basics: Getting Started slides (PDF)

There were two handouts at the talk:

Alternatively, click here to download a zip archive of these 3 files and all the examples below.

GPU Examples

There were two very simple demos in the talk. First was the classic parallel “hello world”: addition of two vectors. The second was computing the sum of probabilities below a given threshold from all combinations of possible outcomes from 7 consecutive (distinct) events, each with 17 outcomes. The second problem was a simple example of the speedups possible from GPUs since this involves up to 17⁶ > 410 million independent calculations.

Vector Addition Example

VecAdd.cu, VecAdd.R

Combinatorics Example

combnCPU.c, combnGPU.cu, combnGPU_shared.cu, combn.R

combnCPU.c is a general version which uses a single core of the computer’s CPU — this is provided for speed comparison and makes no use of CUDA.

combnGPU.cu is a naïve implementation for the GPU — easy to understand but not at all optimal.

combnGPU_shared.cu is a slightly optimised version which makes use of shared memory. This topic was not in the talk, but this is included as an example for future reference.

combn.R is the R file which runs the example. Approximate timings were:

10 secs for 2.4GHz Intel Core 2 Duo;
1.9 secs nVidia 8600M GT (naive);
0.4 secs nVidia 8600M GT (shared mem);
~0.04 secs nVidia Tesla M2050 (both).

Using The Examples

GPUcompile_mac, GPUcompile_linux

These are two shell scripts written to make compiling .cu CUDA files as shared objects for use with R a little easier (since simply running R CMD SHLIB myFile.c doesn’t work for CUDA code). Place the script (_mac/_linux as appropriate) in the same directory as the .cu file and assuming you have CUDA setup correctly then simply run, for example, ./GPUcompile_mac VecAdd.cu

If you downloaded the zip file above simply run, for example, ../GPUcompile_mac VecAdd.cu instead from inside the VecAdd directory.

The ‘Linux’ version is actually specific to the CentOS installation on Amazon’s GPU compute nodes, but is easily adapted by changing the relevant paths to R/CUDA header files.

Profiling CUDA Code

cudaprofile

A very important topic which there was insufficient time to cover in the talk is profiling of CUDA code. It is very easy to get CUDA to report certain important performance metrics, such as run-time, uncoalesced and coalesced global memory accesses etc. For example, in order to get those metrics, place the cudaprofile file above in the same directory as you plan to run your code from, then set the following environment variables (Apple Mac example, very similar for Linux):

export CUDA_PROFILE=1
export CUDA_PROFILE_CSV=1
export CUDA_PROFILE_LOG=CUDAProfiler.csv
export CUDA_PROFILE_CONFIG=cudaprofile

When you next run your GPU code in that terminal (even via R), this will create a file called CUDAProfiler.csv in your current working directory with performance data. This can be analysed manually or using the ‘computeprof’ application from nVidia.

Other Resources

The nVidia documentation is outstanding. It is strongly recommended to explore all the reference material available at the nVidia GPU Computing Developer Home Page.