Simon's Graphics Blog

Work log for ideas and hobby projects.

Adventures in CUDA Path Tracing: Part 1

with 5 comments

I thought I’d have a go at implementing some path tracing in CUDA. Let’s start simple: a classical path tracer with explicit direct lighting. Lots of hacks:

  • No BVH yet, every ray tests the 30 triangles of the Cornell Box
  • Every surface is lambertian (so cosine weighted hemisphere sampling for spawning rays)
  • Hardcoded for a single area light (which the camera cannot see)
  • Uses copy-pasted Moller intersection test from CPU code
  • Random number generation got moved to a texture read (with the texture data updated CPU-side) to avoid absurd register counts

Some initial kernel stats and performance on my lowly GeForce 9500 GT at 512×512 for a single ray per pixel:

  • 0 bounces: 34 registers: 21ms/frame (12.4 Mray/s)
  • 1 bounce: 45 registers: 51ms/frame (approx 10.2 Mrays/s)
  • 2 bounces: 45 registers: 81ms/frame (?? Mrays/s)

I’ve no idea how many rays/s I get for 2+ bounces since many rays will have terminated by then I haven’t put any debug counters in for this. Note the big increase in register count at 1 bounce for adding a classic path tracing loop to the kernel.

Performance for this simple scene is not good. My occupancy is an awful 17% for each of the kernels. This obviously needs improving, I’m way over the register limit though. To get to 50% occupancy, I need to get down to 20 registers. To get to 100%, down to 10 registers. Switching to a more CUDA-friendly ray/triangle test will probably help a bit, but this isn’t going to perform miracles. The problem is the kernel structure itself: it’s trying to do everything in one loop. From reading the very nice CUDA-related papers from this years SIGGRAPH, I realise that I’d have to move to some job-based system eventually, but I found it surprising to suffer from register problems at such low complexity.

More on this topic soon (I don’t get paid to experiment with CUDA sadly). In the meantime, here are some “novel viewpoint” shots of the 2-bounce kernel as it accumulates rays per pixel (you can see just how much my RNG sucks):

1 ray per pixel

1 ray per pixel

16 rays per pixel

16 rays per pixel

512 rays per pixel

512 rays per pixel

Written by Simon Brown

August 15th, 2009 at 2:55 pm

5 Responses to 'Adventures in CUDA Path Tracing: Part 1'

Subscribe to comments with RSS or TrackBack to 'Adventures in CUDA Path Tracing: Part 1'.

  1. Looks good. Performance is kind of disappointing, but then maybe the latest beast from nvidia would blow your socks off? I have no idea. Did this take much time to implement? Is it a lot of CUDA code?

    Kevin

    16 Aug 09 at 3:19 pm

  2. According to wikipedia, the 9500 GT has around 1/8 the number of shader cores as a 275 GTX (which I’m tempted to buy), so that should give me an immediate 8x speedup. A 275 GTX (being compute 1.3) would also have double the register count per multiprocessor, and since I’m massively register bound I’d expect a further 2x speedup.

    The kernel is tiny: maybe a couple of hundred lines of C. Took a few evenings to get this far, but I have plenty of CPU reference code to copy from.

    Simon Brown

    16 Aug 09 at 4:48 pm

  3. Nice work Simon!

    Although switching to a persistent threads based system won’t probably help your registers pressure issues.
    Registers are statically allocated when a kernel is kicked on a multiprocessor, so for all your jobs handled by persistent threads you will pay the cost of your max reg usage on all of them (and perhaps even more).

    Marco Salvi

    23 Aug 09 at 1:08 am

  4. Hey Marco, nice to hear from you!

    Yep it wouldn’t be the persistant threads themselves that would reduce pressure, I would need to do less work in each job too. Of course this increases bandwidth requirements, atomic serialisation costs, etc since I must now save/restore state for each job as the lists get executed, so I have no idea how much of a win this will be in practise.

    Simon Brown

    23 Aug 09 at 10:01 am

  5. Hi Simon,

    I wonder how you get such clean results in the GI using path tracing. Do you also calculate the direct light at each hit point of the path? Otherwise most paths wouldn’t ever reach the small area light source and thus resulting in a dark pixel color. Or do you use some kind of importance sampling?
    The other question I would like to know is how do you use the cuda threads? Are you using bucket-rendering or some kind of image partitioning? Or do you use other techniques to avoid the watchdog time-out? Do you sum up the sampling values in your kernel or outside using the cpu?

    Thanks in advance.

    Daniel

    Daniel

    28 Dec 09 at 5:19 pm

Leave a Reply