CUDA Tips

I've been doing a bit more GPU programming recently, here are some things I found when writing CUDA programs. This all refers to the CUDA compiler in the recent 3.2RC, and based on my experiences with GTX 275 hardware. In particular this advice may need to be tweaked for Fermi architecture GPUs, since I have yet to experiment with one.

1. Expect large structures to be spilled

I had CUDA code along these lines:

__global__ void main()
{
    float3 foo1;
    float3 foo2;
    float3 foo3;
    float3 foo4;
    float3 foo5;
    float3 foo6;
    float bar1;
    float bar2;
    float bar3;
    float3 other1;
    float other2;
    /* etc */
    for (;;)
    {
        /* do stuff that modifies foos, bars and others */
    }
}

There was a lot of code duplication, and my foos and bars were all mixed up with other stuff, so I thought I'd be a good software engineer and tidy the code up with some encapsulation, making use of the C++ support in CUDA. The refactored code was functionally equivalent and looked like this:

class FooBar
{
private:
    __device__ void Operation1();
    __device__ void Operation2();
    /* etc */

public:
    float3 m_foo1;
    float3 m_foo2;
    float3 m_foo3;
    float3 m_foo4;
    float3 m_foo5;
    float3 m_foo6;
    float m_bar1;
    float m_bar2;
    float m_bar3;
};

__global__ void main()
{
    FooBar fooBar;
    float3 other1;
    float other2;
    /* etc */
    for (;;)
    {
        /* do stuff, calling member functions on fooBar, modify others */
    }
}

This new code was much cleaner, so I was happy. Also I noticed that the register count had reduced. Thinking this was a good sign I ran the new kernel: performance was nearly halved, and I was sad.

Inspecting the ptx output, it seems that the compiler had chosen to spill all of the FooBar structure to local memory! This explained the slowdown: every time a member variable was read or written this was going through memory for every loop iteration. Previously, in the untidy code, most of the data was kept in registers for each loop iteration.

Looking at the CUDA Programming Guide it appears that this is documented as expected behaviour of the compiler. Here's a quote:

Automatic variables that the compiler is likely to place in local memory are:

  • Arrays for which it cannot determine that they are indexed with constant quantities
  • Large structures or arrays that would consume too much register space
  • Any variable if the kernel uses more registers than available (this is also known as register spilling)

This is of course not ideal for performance. Ideally I'd want the CUDA compiler to figure out (using standard alias analysis) that it can colour each register of this class independently, and then I would end up with identical codegen to my initial (untidy) code. I'd even settle for some pragma to force the compiler to keep the whole class in registers. Sadly I can't find a mechanism for this, so the tip is: don't use large classes. Much better to use smaller classes and let the spilling code have smaller units to work with.

So we need other tricks to reduce code duplication and complexity, such as:

2. Use template functions

Template functions are a great way to generate specialised versions of general purpose functions for a particular kernel. If you want to change the behaviour of part of a kernel at runtime, don't do this:

// something we want to configure globally for the kernel
__constant__ bool useHam;

// kernel entry point
__global__ void main()
{
    /* do stuff */
    if (useHam)
    {
        /* ham-related stuff */
    }
    else
    {
        /* non-ham-related stuff */
    }
    /* do more stuff */
}

Although the branch will be perfectly coherent, you can do much better. Do this:

// kernel template
template <bool useHam> __device__ mainImpl()
{
    /* do stuff */
    if (useHam)
    {
        /* ham-related stuff */
    }
    else
    {
        /* non-ham-related stuff */
    }
    /* do more stuff */
}

// specialised entry points
__global__ void mainWithHam()
{
    mainImpl<true>();
}
__global__ void mainWithoutHam()
{
    mainImpl<false>();
}

Now instead of setting the boolean at runtime, select the correct entry point. The CUDA compiler seems to do all the right things with templates regarding dead code elimination and merging of basic blocks, which saves a few cycles and (more importantly) gets your register pressure down.

3. Prefer atomics in shared memory

One way to kill performance is to have every thread hammering the same location in global memory with atomic operations. It is much better to use shared memory if possible, following these basic rules:

A simple example of this is the persistent threads system described in the Aila and Laine paper Understanding the Efficiency of Ray Traversal on GPUs, where multiple warps worth of jobs are grabbed at once from the global list, stored in shared memory, and (actually without atomics in this case) then consumed by individual threads. A slightly more complex example would be: writing out values to a single linear array, where you don't know how many values each thread will write. This trick assumes that you set up the block geometry as rows of warps, so threadIdx.x is the thread index in each warp, and threadIdx.y is the warp index in each block.

typedef unsigned int uint;

// our block geometry
#define THREAD_INDEX          threadIdx.x
#define WARP_INDEX            threadIdx.y
#define MAX_WARPS_PER_BLOCK   16

// the data type we are writing out to a global array
typedef float4 Data;

// a function that returns empty data we can use to fill gaps
__device__ Data CreateEmptyData()
{
    return make_float4(0.0f, 0.0f, 0.0f, 0.0f);
}

// the global offset into the array
__device__ uint g_globalOffset;

// per-warp ranges that are allocated from the global array
__shared__ uint volatile g_sharedOffset[MAX_WARPS_PER_BLOCK];
__shared__ uint volatile g_sharedEnd[MAX_WARPS_PER_BLOCK];

// size of the range each warp should grab in one go (must be at least the warp size)
#define GRAB_COUNT 64

// initialize this warp
__device__ void Init()
{
    // warp shared areas
    if (THREAD_INDEX == 0)
    {
        g_sharedOffset[WARP_INDEX] = 0;
        g_sharedEnd[WARP_INDEX] = 0;
    }
}

// clean up this warp by filling in data we didn't use
__device__ void Cleanup(
    Data *dataArray,
    uint globalCapacity)
{
    // warp cleanup
    if (THREAD_INDEX == 0)
    {
        // zero off unused scatter slots
        uint index = g_sharedOffset[WARP_INDEX];
        uint const end = min(g_sharedEnd[WARP_INDEX], globalCapacity);
        while (index < end)
        {
            dataArray[index] = CreateEmptyData();
            ++index;
        }
    }
}

// grab an offset in the global array for this thread
__device__ uint GetNextOffset()
{
    // grab from the shared range for this warp
    uint offset = atomicAdd((uint *)&g_sharedOffset[WARP_INDEX], 1);

    // exactly one thread could hit the end of the shared range
    uint const end = g_sharedEnd[WARP_INDEX];
    if (offset == end)
    {
        // if this thread hit the end, grab more space from the global array
        uint const start = atomicAdd(&g_globalOffset, GRAB_COUNT);
        g_sharedOffset[WARP_INDEX] = start;
        g_sharedEnd[WARP_INDEX] = start + GRAB_COUNT;
    }

    // now the shared range is updated, handle all threads that went over the old limit
    if (offset >= end)
        offset = atomicAdd((uint *)&g_sharedOffset[WARP_INDEX], 1);

    // done
    return offset;
}

__global__ void main(
    Data *dataArray,
    uint globalCapacity)
{
    // initialise shared memory areas (to empty)
    Init();

    /* rest of init code */

    while (__any(/* some condition */))
    {
        /* ... */

        // get an offset within the global array
        // this is safe to call from within divergent flow control
        uint offset = GetNextOffset();
        if (offset < globalCapacity)
            dataArray[offset] = /* some value we want to write */

        /* ... */
    }

    // now all threads of the warp are done, fill in gaps in output array
    Cleanup(dataArray, globalCapacity);
}

A summary of what the code above is doing:

Future kernels or CPU-side code can get the number of entries written by reading g_globalOffset after the kernel has finished.