Archive for the ‘GPGPU’ Category
Sampling Sun And Sky
In this post I will briefly cover how I implemented sampling of external light sources in a path tracing framework, concluding with an observation about sampling multiple external light sources that are non-zero over very different solid angles. I’m going to assume the reader is familiar with path tracing in the Veach framework.
My definition of an external light source, which I’ve also seen called an “infinite” light source since they are considered to be infinitely far away (and infinitely bright as a result), is as follows:
- Radiance always originates from outside of the scene bounds
- Radiance is a function of world space direction only (not sample position)
A simple example would be a cube map considered to be always centered at the sample point.
SketchUp Cities
Ray Tracey’s latest blog post has Brigade 2 renders of a nice-looking walled city scene created using Google SketchUp. The model came from this gem of a collection by “LordGood” (who evidently is a big Assassin’s Creed fan) hosted on Google 3D Warehouse.
Currently I only have a Blender exporter, and sadly the SketchUp-Collada-Blender path was producing garbage, but even the free version of SketchUp allows custom ruby plugins. After a bit of hunting around I found this OBJ exporter ruby plugin which worked very well, and now I have much nicer test meshes than my bad Blender programmer art.
The above images are from my usually-being-refactored path tracer with Preetham (et al) sun/sky. It doesn’t render as quickly as Brigade 2 (also I only have a lowly GTX 460 to render on), and yes it’s all diffuse and I haven’t exported any of the textures, and there’s no atmospheric terms or depth of field or shading normals or remote-controlled Stanford bunny, but it’s nice to have some decent public domain data to use.
I’m slowly working on a Bidirectional Instant Radiosity post (hopefully using this scene) but it’ll have to wait until work is less mental.
Hybrid Bidirectional Path Tracing
I’d like to share my results from converting a CPU-only bidirectional path tracer into a CPU/GPU hybrid (CPU used for shading and sampling, GPU used for ray intersections). These results are a bit old… I posted them a while ago as a thread on ompf. I found out later that this thread had been cited in Combinatorial Bidirectional Path-Tracing for Efficient Hybrid CPU/GPU Rendering, so let me summarise it here.
Now You’re Lighting With Portals
I hate dome lights. You always waste a ton of rays that are occluded by geometry, and the situation gets even worse when lighting indoor scenes with exterior dome lights!
So why not help your renderer out and place portals that, when hit, teleport to the dome light. Then instead of sampling the whole skydome, we just sample the portals, and avoid sending rays where we know they will be occluded.
As an example, here’s the Sponza scene using an exterior (uniform) dome light, rendered using unidirectional path tracing with multiple importance sampling:
Lots of rays never manage to find the open roof, so we get plenty of noise. Now let’s replace the dome light with a portal that covers the open roof, then allow that to be sampled instead:
Noise is greatly reduced, for exactly the same number of rays.
The sampling algorithm is simple enough to implement in your GPU path tracer of choice: sample the portal and use the usual conversion between pdf wrt area (the portal) and pdf wrt solid angle (the dome):
\[P_\sigma = \frac{P_A \|\mathbf{v}\|^2}{\cos(\theta)} = \frac{P_A \|\mathbf{v}\|^3}{\mathbf{v}.\mathbf{n}}\]
Where v is the vector between target point and the portal point, and n is the portal normal.
Two-Way Path Tracing
This post is about a path tracing technique that sits between unidirectional path tracing and bidirectional path tracing.
For want of a better name, let’s call this two-way path tracing. It’s defined as follows:
- Trace eye rays, handle light source intersections and sample light sources explicitly
- Trace light rays, handle sensor intersections and sample sensors explicitly
- When computing weights for multiple importance sampling, take both tracing methods into account
So you can think of this technique as either:
- Unidirectional path tracing in both directions at once
- Bidirectional path tracing, but we only connect sub-paths if one of the sub-paths has one vertex
So why is this interesting? Because:
- Like unidirectional path tracing, you only need to track a fixed amount of state, regardless of maximum path length. This is potentially nice for GPU implementations where you usually want to avoid hitting memory and have a large number of paths in flight.
- You can efficiently multiple importance sample between forward and reverse paths, so you can get reduced variance compared to unidirectional path tracing for some types of scenes (e.g. caustics).
In this post I’d like to cover how to multiple importance sample between forward and reverse paths, and show some test images.
Adventures in CUDA Path Tracing: Part 2
This is really just teaser post for my next update. I’ve not done much on traversal yet (hence the world of spheres), but I’ve made some progress on shading. Here’s a screenshot of a pure CUDA renderer left for 20 seconds or so to get a nice smooth result:
This scene contains a few BSDFs:
- Lambertian (the cyan, magenta and “wall” spheres)
- Perfect specular (the mirror sphere)
- Fresnel dielectric (the glass sphere)
- Blinn microfacet (the “floor” sphere)
Everything uses importance sampling to reduce variance, which lets caustics converge quite quickly even on glossy surfaces like the one shown here. I’m preparing a post to go into more details of the rendering algorithm, which is a type of path tracing that I think works quite well on the GPU…
CUDA Tips
I’ve been doing a bit more GPU programming recently, here are some things I found when writing CUDA programs. This all refers to the CUDA compiler in the recent 3.2RC, and based on my experiences with GTX 275 hardware. In particular this advice may need to be tweaked for Fermi architecture GPUs, since I have yet to experiment with one.
Using Optix
Optix version 2.0 was released recently, so I gave it a go by plugging it into an existing multi-core path tracer. This path tracer can submit tens of thousands of ray queries as a batch so should be a good match for Optix and the GPU.
I liked:
- Ease of use. Wow this thing makes GPU ray tracing easy: I wrote a few tiny CUDA functions, the runtime reported nice errors for my bugs, I fixed the bugs and it worked as expected!
- Net performance win. It improved the performance of the path tracer, but not by much (see below).
I disliked:
- Everything is synchronous. All optix calls seem to block for completion, so I couldn’t find a way to pipeline memory transfers with GPU work in a single optix context. Since my use case involved heavy interop between CPU and GPU, this was a big performance loss.
- No CUDA interop. There seems to be no support for using CUDA allocations in Optix kernels. So in particular you can’t use page-locked host memory to remove all those redundant (blocking) copies completely.
In conclusion I have mixed feelings about Optix. I think it’s a great tool for hobby projects or small demos, but I need async calls and much improved CUDA interop before I’d use it for anything larger.
CUDA Mersenne Twister
I needed a random number generator for a CUDA project, and had relatively few requirements:
- It must have a small shared memory footprint
- It must be suitable for Monte Carlo methods (i.e. have long period and minimal correlation)
- It must allow warps to execute independently when generating random numbers
There seem to be two main approaches to RNG in CUDA:
- Each thread has its own local history, operates independently. This can be seen in the Mersenne Twister sample in the CUDA SDK (which has a very short history of 19 values). This usually requires an expensive offline process to seed each thread appropriately to avoid correlation. I can’t spare the registers or local memory for this approach.
- Have a single generator per thread block, parallelise the update between all threads and synchronise using __syncthreads. This is the approach in the recent MTGP CUDA sample. I can’t use this approach because I am allowing each warp in the block to process jobs independently (using persistent threads) – calls to __syncthreads to synchronise every thread in the block are not possible.
What I ended up with is basically a modified version of MTGP (the second approach above), but with each warp able to grab random numbers independently from the shared MT state. This had the nice side-effect of reducing the shared memory footprint to be the same as the equivalent CPU MT implementation. Read the rest of this entry »
Adventures in CUDA Path Tracing: Part 1
I thought I’d have a go at implementing some path tracing in CUDA. Let’s start simple: a classical path tracer with explicit direct lighting. Lots of hacks:
- No BVH yet, every ray tests the 30 triangles of the Cornell Box
- Every surface is lambertian (so cosine weighted hemisphere sampling for spawning rays)
- Hardcoded for a single area light (which the camera cannot see)
- Uses copy-pasted Moller intersection test from CPU code
- Random number generation got moved to a texture read (with the texture data updated CPU-side) to avoid absurd register counts




