With each generation of mobile GPUs the available performance budget sees a significant increase. That performance increase opens up the door to new techniques never tried in mobile before. Post-processing is one of those areas that were traditionally underutilized in mobile - with the vast performance increases of the latest GPUs, that should no longer be the case.

This article is devoted to a post-processing technique we've been experimenting with revolving around realistic volumetric fog with shadows. This technique empowers developers to bring a more cinematic look to their scenes. The heart of the technique is of course ray marching but the interesting part is the de-noising techniques used to achieve good quality with a relatively small performance hit. But first a few screenshots with the current implementation of the effect in action.

Figure 1a and ab: Sponza with clustered volumetic fog

Figure 2a and 2b: Sponza with clustered volumetic fog

Figure 3a and 3b: Sponza with clustered volumetic fog

The final result is calculated in three render passes, the main pass and two blur passes. To avoid confusion, render pass is roughly the work between a glBindFramebuffer() and before the next glBindFramebuffer(). The main pass performs ray marching in view space to calculate and accumulate the light color of the dust particles. The next two passes are a special separable blur filter (more details later). These three passes run at 1/16 of the main framebuffer resolution (width/4, height/4) and the result is blended back to the HDR light buffer (upsampling). Fog is a low frequency effect and working at 1/16 of the resolution gives acceptable results for the most part. High frequency changes are observed in areas with depth discontinuities. In those areas the upsampling operation fails to determine the correct sample and the result can get ugly

Ray marching

The ray marching pass is using the same set of structures that are used for clustered deferred shading and/or clustered forward+ shading. For those not familiar, clustered shading partitions the frustum into a number of clusters. That division allows the application to assign light volumes in those clusters (that normally happens in the CPU) and the various light shaders make use of this classification to access nearby lights.

Figure 4: Dividing the frustum in clusters

In the main pass the fragment shader visits each cluster in the Z direction in order to compute and accumulate the light color of the dust particles. The number of times we sample per cluster is variable and depends on how deep the cluster is and how high the dust particle density is (dust_particles/meters in Z axis). Typically, the clusters are not evenly divided across the Z axis with clusters close to the camera being very small compared to distant ones.

Figure 5: Variation of number of samples between clusters

For reference, in our prototype demo we sample from one time to a maximum of four times per cluster. At the same time the frustum is split 32 times across the Z axis and the far plane is 500 game units. At most, we are facing 4*32=128 samples per fragment. In practice though we never encounter this upper limit if we early exit on clusters that fail the depth test and if we keep the particle density sane.

The naive algorithm is as follows:

MAX_SAMPLES_PER_CLUSTER = 4; DIST_BETWEEN_SAMPLES = 0.25; lightColor = vec3(0.0); depth = readDepthBuffer(); // NDC is the normalized device coordinates of the quad. // fragPos is the view space position of the fragment. farPos = unproject(NDC, depth); // View direction viewDir = normalize(farPos); // kNear is a Z in view space. Our ray marching starts from camera's near so initialy it's -cameraNear. // It's negative because in OpenGL the camera look vector is the (0,0,-1) kNear = -cameraNear; for(k = 0; k < CLUSTER_COUNT_Z; k++) { // kFar is a Z in view space and it give the far bound of the cluster. kFar = computeClusterFar(k); // Compute sample count per cluster clusterDepth = kNear - kFar; samplesf = clamp(clusterDepth / DIST_BETWEEN_SAMPLES, 1.0, float(MAX_SAMPLES_PER_CLUSTER)); dist = 1.0 / samplesf; // Iterate inside the cluster for(j = 0; j < uint(samplesf); ++j) { // zMedian is the Z in view space of the current sample zMedian = kNear + dist * float(j); // Check if the current sample will fall behind the depth if(zMedian < farPos.z) { k = CLUSTER_COUNT_Z; // Break the outer loop break; } // Compute sample's view space position. The equation is derived by colliding the near plane with // the viewDir fragPos = viewDir * (zMedian / viewDir.z); // Now we have everything, compute the light color of the sample and accumulate lightColor += computeLightColor(fragPos, k); } kNear = kFar; } return lightColor;

The outer loop iterates all clusters in the Z axis. Every cluster has a depth that starts from kNear until kFar. For every cluster we compute the number of samples given the dust particle density, the max value and the depth of the cluster (kNear - kFar). Then we iterate the samples to calculate a new Z that lies between kNear and kFar called zMedian. The fragment position (fragPos) is the intersection point of viewDir with the plane defined by the normal (0,0,-1) and offset zMedian. By simplifying the intersection equation we end up with just a multiplication and a division. In our case the fragPos will be used in the light computations. At this point we have to note that only the diffuse term is calculated and the specular is totally ignored. This is based on the assumption that dust particles have roughness close to 1.0.

Figure 6: Cluster N, Sample 0

Figure 7: Cluster N, Sample 1

Figure 8: Cluster N+1, Sample 0

The above figures show how the algorithm walks the cluster.

Video 1: The output of the naive pass (camera moving back and forth)

The result of the naive algorithm above is quite bad since the sample locations are fixed inside the cluster. The obvious and lazy solution is to increase the number of samples per cluster. Unfortunately keeping the number of samples low is a very important factor for the cost of the algorithm.

A better way is to add a random bias to for each fragment. That way, neighbour fragments end up covering more area and catch more details inside the cluster.

Figure 9: Randomly jittered samples

Video 2: Randomly jittered samples in action

For the above result we used a 64x64 blue noise texture (You can download a number of noise textures and read the excellent writeup here http://momentsingraphics.de/?p=127). This texture provides a factor value and that factor is what jitters the starting point of each sample.

The GLSL code is the following:

noiseTexUv = vec2(FB_SIZE) / vec2(NOISE_MAP_SIZE) * UV; randFactor = clamp(texture(noiseTex, noiseTexUv).r, 0.0, 1.0); ... zMedian = kNear + dist * float(j) + dist * randFactor;

Temporal feedback

In the previous section we used noise to randomize the starting position of each sample. The next logical step is to have a different starting point between frames as well. That can be achieved by animating a blue noise 2D texture array or just translating the UV coordinates between frames.

Video 3: Animating the noise between frames

This is how the fog looks like when animating the noise. In the case above we use a 2D array noise texture and animate it by using a different slice for each frame. At the same time we animate the texture in the Y axis to simulate the effect of dust falling from the top of the screen. Animating the noise doesn't solve anything by itself unless we blend the result of the previous frame with the current one. That way and over time the samples will end up covering all of the cluster's range.

This is how the current and previous results are blended together:

history = textureLod(historyTex, UV, 0.0).rgb; lightColor = mix(history, lightColor, 1.0 / 16.0);

Video 4: The result of the temporal modulation under motion

The 1/16 modulation factor is quite aggressive and it creates very ugly ghosting under motion. However, when the camera is still and the scene static the result is quite good. Other temporal effects re-project the history buffer to give some sort of stability under motion. Fog though is not depth dependent unlike SSAO, temporal AA or other temporal filters, thus, in theory re-projection doesn't make much sense for fog. In practice though re-projecting the history buffer gives a result good enough to trick the human eye.

Video 5: Temporal re-projection

Re-projecting the history buffer doesn't completely eliminate the ghosting. In the image above the column's position changes between frames and that creates a black trail. One solution is to "correct" the history buffer feedback.

history = textureLod(historyTex, reprojUV, 0.0).rgb; history = max(lightColor, history); out_color = mix(history, lightColor, 1.0 / 16.0);

The code above just uses a max. This is a simple fix that removes the ghosting but it adds some noise back to the picture.

Video 6: Temporal re-projection with the "max" fix

Separable blur

By introducing temporal feedback in the ray marching pass we removed the majority of the noise. One extra thing we can try is to blur the result to remove it completely.

Figure 10: The reference image without blurring

Figure 11: The reference image with Gaussian blur

In figure 11 we use an ordinary Gaussian blur filter. The problem with that blur filter is that it makes the fog bleed into the column. In some cases this issue can be medicated by using a less aggressive blur kernel or by just ignoring the problem completely. One solution that we tried was to use a depth aware blur filter. Higher depth discontinuities most of the time translate to lower blend weights between the blur samples. In practice though that didn't work well. The main problem is that it's difficult to chose a depth threshold that moderates the weights.

An alternative solution that we tried was to use a luminance-aware blur filter. The luminance is used to compute the blend weight of the samples.

float computeLumaWeight(float refLuma, vec3 col) { float l = computeLuminance(col); float diff = abs(refLuma - l); float weight = 1.0 / (EPSILON + diff); return weight; } out_color = vec3(0.0); float refLuma = computeLuminance(texture(colorTex, UV).rgb); float weight = 0.0; for(uint i = 0u; i < STEP_COUNT; ++i) { vec2 texCoordOffset = OFFSETS[i] * TEXEL_SIZE; vec2 uv = UV + texCoordOffset; vec3 col = texture(u_colorTex, uv).rgb; float w = computeLumaWeight(refLuma, col); out_color += col * w; weight += w; uv = in_uv - texCoordOffset; col = texture(u_colorTex, uv).rgb; w = computeLumaWeight(refLuma, col); out_color += col * w; weight += w; } out_color = out_color / weight;

Figure 12: The reference image with luma-aware blur

Unfortunately the luma-aware blur requires more samples to be comparable to the Gaussian blur. Also, to have the maximum blur effect the input image needs to be as less noisy as possible. The positive from using this luma-aware blur is that it doesn't cause any sort of bleeding.

Something we haven't experimented with yet is to use a median blur filter instead of the luma-aware. Median blur might be a better alternative for noise reduction.

Blending the result back to the light buffer

The final step of the technique is to blend the fog result back to the fullscreen light buffer. The simple solution is to just perform a fullscreen blit and the expensive one is to use a bilateral upsample filter. Compared to the naive blit most bilateral upsample algorithms require sampling two additional textures. These extra textures are two depth buffers. The first depth buffer is the fullcreen depth buffer and the second is the downsampled depth buffer.

The naive blit upsample might cause some bleeding but overall it gives acceptable results. Paying the bandwidth cost of some of the advanced bilateral upsampling algorithms might not worth it though.

Conclusion and performance considerations

The main pass, as presented in the pseudocode, is not the optimal way to implement clustered ray marching in Mali. The main problem is the high register pressure in pre-Bifrost family of GPUs (T8XX and lower). A better alternative would be to simplify the shader and split the main pass into multiple fullscreen drawcalls. Each drawcall will accumulate the result using fixed function blending. Note that fixed function blending is very bandwidth efficient on Mali and a very good way to accumulate color results to a framebuffer.

The outer loop in the main pass iterates from zero to the number of clusters in Z axis minus one (CLUSTER_COUNT_Z - 1). One additional enhancement would be to pre-compute the start and end clusters in the light binning step. The starting cluster would be the first cluster with lights and similarly for the end cluster. Depending on the light density of the scene this trick might decease the loop iterations dramatically.

Another solution for the main pass is to have a compute job that visits the clusters to gather the zMedian and other relevant light data into per-tile lists. By removing redundant cluster calculations to another job, the main pass will be simplified quite a bit.

Bandwidth is another important aspect that needs careful consideration. The main pass samples the depth and the history buffers and the separable blur ping-pongs the main pass' buffer. Depending on the quality you want to achieve or the content you are making it might be a good idea to opt out from using some of the de-noising techniques presented above. Also, you can try trading some ALU performance by increasing the max samples per cluster (MAX_SAMPLES_PER_CLUSTER) for some bandwidth (eg dropping the separable blur passes). Since the technique consists of discrete steps it's easy to remove those steps that don't improve enough the final result.

Feel free to provide feedback and/or share your thoughts on similar experiments. If your engine already supports clustered deferred or forward shading experimenting with clustered fog should be trivial.