Archive for the ‘GPGPU’ Category
CUDA Mersenne Twister
I needed a random number generator for a CUDA project, and had relatively few requirements:
- It must have a small shared memory footprint
- It must be suitable for Monte Carlo methods (i.e. have long period and minimal correlation)
- It must allow warps to execute independently when generating random numbers
There seem to be two main approaches to RNG in CUDA:
- Each thread has its own local history, operates independently. This can be seen in the Mersenne Twister sample in the CUDA SDK (which has a very short history of 19 values). This usually requires an expensive offline process to seed each thread appropriately to avoid correlation. I can’t spare the registers or local memory for this approach.
- Have a single generator per thread block, parallelise the update between all threads and synchronise using __syncthreads. This is the approach in the recent MTGP CUDA sample. I can’t use this approach because I am allowing each warp in the block to process jobs independently (using persistent threads) – calls to __syncthreads to synchronise every thread in the block are not possible.
What I ended up with is basically a modified version of MTGP (the second approach above), but with each warp able to grab random numbers independently from the shared MT state. This had the nice side-effect of reducing the shared memory footprint to be the same as the equivalent CPU MT implementation. Read the rest of this entry »
Adventures in CUDA Path Tracing: Part 1
I thought I’d have a go at implementing some path tracing in CUDA. Let’s start simple: a classical path tracer with explicit direct lighting. Lots of hacks:
- No BVH yet, every ray tests the 30 triangles of the Cornell Box
- Every surface is lambertian (so cosine weighted hemisphere sampling for spawning rays)
- Hardcoded for a single area light (which the camera cannot see)
- Uses copy-pasted Moller intersection test from CPU code
- Random number generation got moved to a texture read (with the texture data updated CPU-side) to avoid absurd register counts
Convergence
I’m extremely excited about the results of Understanding the Efficiency of Ray Traversal on GPUs, and the related work by NVIDIA on ray traversal. In a programming way of course.
There’s this interesting paradigm shift from a strongly geometric grid model to one where we have persistent threads running small kernels (or actually large kernels due to the way CUDA code is currently linked) and grabbing their own jobs asynchronously. The interesting thing about this shift is that this is the way PS3 developers on Cell have been writing SPU job systems for years. Now I admit that the underlying hardware is radically different (massive hardware threading and wide SIMD vs no hardware threading and more conventional SIMD), but the same simple primitives of a resident kernel using atomic increment to grab from a shared job list still apply. I have no idea where this programming model is going to converge, but I think it certainly looks like it is.
(Atomic increment is actually only CUDA compute 1.1, so even your 1 year old laptop with an NVIDIA mobile chipset can probably run this sort of code. Of course it’s nicer with the 1.3 voting primitives, but you can emulate these through shared memory, so no need to go bargain hunting for a GTX 260 just yet.)
OpenCL on the CPU
So the old news is that the OpenCL specification has been done in record time and endorsed by all the major GPU manufacturers.
This is many kinds of awesome, but I’m wondering if any particular vendor is going to concentrate on a CL_DEVICE_TYPE_CPU implementation. I think a CPU implementation of OpenCL is important for two reasons:
- Debugging. Have you ever tried to debug a large CUDA kernel? This is my number 1 reason for a CPU implementation, as we can generate some nice debug info and use our favourite debugger.
- Wider Adoption. Not everyone has access to a machine with a 1 million thread GPU from the future. However, pretty much everyone has multiple SIMD cores, even in one year old laptops. If low/mid performance can be achieved by using SIMD, software fibers, and multiple physical cores, then a developer can write extremely scalable code with minimal requirements for a baseline spec.
Wikipedia states that LLVM are doing the initial implementation of OpenCL, but has no citation. Perhaps I’ve missed some announcement or other, but if I get to read about full-featured CPU OpenCL support for a popular compiler (e.g. gcc, msvc) then I will be very happy!