In this blog post I will give a quick overview of the API created by our PowerVR Ray Tracing team. This API gives developers access to the PowerVR Ray Tracing hardware known as the Wizard architecture. I will briefly go over the physical changes at the hardware level and what has changed compared to a traditional GPU. Then I will look at new extensions to OpenGL ES 3.1 that we have created to give developers access to the hardware.

An overview of the API exists in PDF form here, the talk was given at Imagination’s idc16 developer conference at GDC 2016.

I will assume the reader has knowledge of what ray tracing is. If not, you can have a look at some of the higher-level information we have on ray tracing. This blog post will not go in to much detail on each feature, we are planning to post more in-depth articles on each of the features in the future.

Ray tracing blocks

A while ago we introduced our quad-cluster PowerVR GR6500 GPU with ray tracing and more recently we showed off some of the ray tracing applications we now have running on our GPU.

The diagram below shows the hardware features that we have added to accelerate ray tracing:

A block diagram showing the new hardware blocks for ray tracing

Ray Data Master

The Ray Data Master (RDM) block feeds intersections to the Unified Shading Clusters (USCs) which then execute ray shaders.

Scene Hierarchy Generator

The Scene Hierarchy Generator (SHG) block is used to generate the acceleration structure behind the ray intersection processor using world-space vertices streamed from standard vertex shaders.

Coherency Engine

This block is part of the Ray Tracing Unit (RTU) and buffers up a set of rays that traverse the scene hierarchy in the same way and follow similar execution paths. This helps avoid cache misses when fetching the ray data from memory and therefore reduces bandwidth and power usage. There are a lot more details on how we achieve this, but that will need to wait for another blog post. In the meantime, you can find more details about the coherency engine in our 2014 GDC talk.

Ray Intersection Processor

This block is part of the RTU and tests rays against triangles and rays against the scene hierarchy located in main memory, generated by the SGH.

Frame Accumulator Cache

This hardware block takes accumulate instructions from the USCs and using a write combining cache accelerates certain image atomic operations that are useful in ray tracing shaders. What this means is that we can issue write-only instructions, they will be queued up and executed asyncronously.

The PowerVR GR6500 GPU integrated on the PCIe card shown below has a peak performance of 300 million rays per second using the RTU; the SHG is targeted at generating the acceleration structure and vertex output from 100 million dynamic triangles per second.

The API

Our ray tracing hardware augments our typical GPU design rather than being a separate piece of hardware. We decided to extend OpenGL ES 3.1 to add ray tracing functionality so that we can leverage the work that has gone in to OpenGL ES, a familiar API for many developers. We chose to use modern paradigms such as direct state access, bindless etc. as we foresee this being the pattern for APIs in the future. We are also aware of the Vulkan API, however this article will only cover OpenGL ES extensions as Vulkan is still under development.

Vertex processing

Input to a ray tracer is different to a rasteriser. A ray tracer works with objects in world space. This is because with a ray tracer we need access to the entire scene. An object in the scene could be reflecting one of the objects behind the camera for example; in a rasteriser this would be difficult to handle.

Because of this fact, the output of vertex processing with the PowerVR hardware ray tracing API is in world space, not clip space as with a rasteriser. This means that we are going to be running the vertex part of the pipeline at different rates to a rasteriser – in a rasteriser the camera is usually moving from frame to frame and even if it isn’t, most games will re-render the scene because something else will change.

This is different in a ray tracer. We are decoupled from camera movement so we could run our vertex shaders at the start of our application, then never have to run them again if nothing in the scene moves in world space. Of course we will still have to run the ray traversal part of the pipeline each frame when the camera moves, but not the scene hierarchy generation part of the pipeline for the static parts of the scene.

Varyings and other features of OpenGL ES work as expected:

layout(std140, binding=0) uniform UboPerObjectIn { highp mat4 worldFromModel; highp mat4 worldFromModelIT; } UboPerObject[2]; out gl_PerVertex { vec4 gl_Position; float gl_PointSize; }; out PerVertexData { vec3 vertexNormal; }; void main() { gl_Position = UboPerObject[gl_BuildIDIMG].worldFromModel * vec4(inVertex, 1.0); // Output in world space vertexNormal = (UboPerObject[gl_BuildIDIMG].worldFromModelIT * vec4(inNormal, 0.0)).xyz

(gl_BuildIDIMG explained below)

However because we have a different model for initiating the vertex processing into world space than a rasteriser, we need a different API to start this part of the pipeline…

Scene construction



Vertices and varyings are processed producing object world space positions and are then used to generate a scene hierarchy acceleration structure. This acceleration structure is required for the ray traversal stage of the pipeline. Computing this acceleration structure is relatively heavyweight operation. This is why we have the Scene Hierarchy Generator block in hardware – so we can efficiently generate the scene hierarchy acceleration structure. We have designed the API so that developers can segment their scene into components, component groups, scenes and scene arrays.

This split enables developers to use the hardware efficiently.

When we issue a build command this will execute the vertex shaders for the triangles in the part of the scene we have specified. An acceleration structure will then be computed using the output from the vertex shaders for that part of the scene.

Components

Components are linked to a vertex array object. They are defined similarly to calling glDraw*(). However instead of executing immediately, the call is recorded and deferred until we issue a build command on a component group containing that component.

We can also set other attributes on this component for interaction with the ray traversal phase of the pipeline:

Front face (clockwise or counter clockwise)

If this piece of geometry is an occluder

Whether this piece of geometry is visible to a certain ray type

The visible faces (front and/or back)

An example of creating a component and setting some attributes:

glGenVertexArrays(1, &vertexArray); // ... upload index data and model space vertex data as usual glCreateComponentsIMG(1, &componentHandle); glComponentVertexArrayIMG(componentHandle, vertexArray); glComponentIndexedGeometryIMG(componentHandle, GL_TRIANGLES, numIndices, GL_UNSIGNED_SHORT, 0, 0); glComponentBufferRangeIMG(componentHandle, GL_UNIFORM_BUFFER, 0, transformUBOHandle, 0, transformUBOSize); glComponentVisibleFaceIMG(componentHandle, GL_FRONT_AND_BACK); glComponentOccluderIMG(componentHandle, GL_TRUE);

Note that nothing is yet being executed on the GPU yet. We are only specifying the data for the component at the moment. It is also worth noting that the vertex buffer needs to stay alive for the same duration as the component that references it – it is not copied anywhere.

Component groups

Component groups are the smallest divisible unit of processing that the build or merge commands can consume.

glCreateComponentGroupsIMG(1, &componentGroupHandle); componentGroupHandle = glGetComponentGroupHandleIMG(componentGroupHandle); glComponentGroupExtentIMG(componentGroupHandle, &extentMin, &extentMax); // Set the extents

We also specify the min/max extents of the geometry. After we have created our component groups, we can build them out of the components we have created earlier. This is where we associate the vertex shader and ray shader with the component. We are defining the material of the component at this point. The build command initiates the SHG part of the hardware:

componentHandle = glGetComponentProgramHandleIMG(componentHandle, vertexAndRayShaderProgram); std::vector<GLuint64> components = {{componentHandle}}; glBuildComponentGroupIMG(0, componentGroupHandle, components.size(), components.data()); // Create a fence sync for the scene generation so that we know when it's ready sceneHierarchyGenerationSync = glFenceSync(GL_SYNC_SHG_COMMANDS_COMPLETE_IMG, 0);

We can also issue a merge command. A merge command will take two component groups and merge them together. Therefore it allows the developer the option to rebuild certain parts of the scene at different rates. Building a subset of your scene and merging with an already built subset should be cheaper than doing a full rebuild although there are trade offs in the quality of the hierarchy.

An example might be a car moving along the ground. We can build the ground component group at the start of the application and never change it. Then you would build the car component group as it moves, each frame if necessary.

std::vector<GLuint64> componentGroups = {{...}}; glMergeComponentGroupsIMG(mergedComponentGroupResultHandle, componentGroups.size(), componentGroups.data(), GL_DONT_CARE); sceneHierarchyGenerationSync = glFenceSync(GL_SYNC_SHG_COMMANDS_COMPLETE_IMG, 0);

The GL fences in the code above are to synchronise access to memory that the hardware is using. It also allows the possibility to multi-buffer operations. We could immediately wait for the fence after the call to merge/build and this would wait until the hardware is done with the commands and could be a way of handling scene hierarchy building at the start of an application. However this introduces a CPU side stall which would not be good at runtime. Instead we can multi-buffer the merge/build commands so that while the GPU hardware is reading/writing to one set of objects, we set-up our second set of objects on the CPU. This is easier to achieve with the first argument to the glBuildComponentGroupIMG() command. This index is passed to the vertex shader through the glsl builtin gl_BuildIDIMG and in the example above, we can see that this helps us multi-buffer the per object data UBO. This means we could move this component group in world space without stalling the CPU.

Scene arrays

A scene array contains a set of one or more component groups that are traverse-able within a single dispatch. We need to specify the size of each ray type for each scene array we create and this is done with glSceneArrayRayBlockSizeIMG.

GLuint sceneArray = glCreateSceneArrayIMG(); glSceneArrayRayBlockSizeIMG(sceneArray, 0, 0); // Primary rays glSceneArrayRayBlockSizeIMG(sceneArray, 1, 12); // Shadow rays glBindSceneArrayComponentGroupIMG(sceneArray, 0, componentGroupHandle);

Multiple component groups in a single scene array could be useful for geometry level of detail (LOD) like in rasterisation. For example if we were ray tracing a scene with complex geometry we might want multiple LODs of that scene to help speed up the intersection tests. We can switch to a different scene in our shaders so we could use tests like distance for example to switch to the lower LOD scene.

A simpler, more logical example could be a scene where we have a virtual monitor and a cctv camera. When we hit the monitor with a ray, we can switch to the cctv scene and initiate the ray from the camera’s point of view.

glBindSceneArrayComponentGroupIMG(sceneArray, 0, monitorComponentGroup); glBindSceneArrayComponentGroupIMG(sceneArray, 1, cameraViewComponentGroup);

Multiple scene arrays are useful when we have multiple types of scenes. I.e. if we wanted to ray trace the visuals for a game level at the same time as ray tracing the path of sound along the level. The sound scene would need different information stored in the rays and the geometry compared to visual information. We could separate these details in to different scene arrays:

glBindSceneArrayIMG(sceneArrayVisuals); glDispatchRaysIMG(0, 0, 0, frame.w, frame.h, GL_NO_WAIT_BIT_IMG); glBindSceneArrayIMG(sceneArrayAudio); glDispatchRaysIMG(0, 0, 0, 1, numSounds * 2, GL_NO_WAIT_BIT_IMG);

Alternatively we could use a single scene array and multiple ray types.

We call the currently bound scene array the scene so that when we refer to the scene in articles it makes sense.

Bindless

When using textures in regular OpenGL, we need to first call glActiveTexture, then glBindTexture and we only have a limited set of binding points. This works on the single object level, because for the next object we can call these functions again and change the textures for the object – we do not need to worry about the textures for other objects.

With ray tracing we need to know the textures for all of our objects and shaders when we start ray tracing. We cannot change the bindings during ray traversal and we do not want to be limited to a small number of textures. So we need another mechanism to bind these textures in our objects shaders. This is where our new extension IMG_bindless_texture comes in. Instead of the above gl calls to bind textures, we have the following.

GLuint64 textureHandle = glGetTextureHandleIMG(textureObject); // Similar to glUniform glProgramUniformHandleui64IMG(rayProgramHandle, samplerLocationInShader, textureHandle); // Now in glsl we can access the texture: layout(bindless_sampler) uniform sampler2D sTexture;

Ray traversal

The ray tracing pipeline

The ray traversal part of the pipeline is initiated by a call to glDispatchRaysIMG. We need to define a set of maximum values for the number of ray bounces with glRayBounceLimitIMG. There are a few other things we require that helps the hardware execute efficiently. glProgramMaxRayEmitsIMG limits the number of rays that a program can emit. We would want to set this to 1 on our shadow ray types, if we were doing hard shadows with one light for example. This would help the hardware know that if we are executing a ray shader, there is a maximum of one shadow ray that can be emitted.

Ray types

A default ray in GLSL will have the following implicit built-in variables:

raytypeIMG { highp vec3 gl_OriginIMG; highp vec3 gl_DirectionIMG; highp rayprogramIMG gl_PrefixRayProgramIMG; lowp uint gl_SceneIMG; highp float gl_MaxDistanceIMG; mediump ivec2 gl_PixelIMG; mediump uint gl_BounceCountIMG; bool gl_IsOutgoingIMG; bool gl_FlipFacingIMG; bool gl_RunPrefixProgramIMG; };

We can then define our own ray types by adding user defined variables to the default ones. For example a ray that is just doing shadowing would look like:

layout(binding = 0, occlusion_test_always) raytypeIMG ShadowRay { vec3 colour; };

With calls to glSceneArrayRayBlockSizeIMG we have defined each of our ray types that we can use. With calls to glGetComponentProgramHandleIMG we have defined what vertex and ray shader each component group should execute. We create a ray shader just like a standard vertex or fragment shader.

Frame shaders

The first part of the pipeline that executes will be the frame shader. Frame shaders are glsl shaders that initiate zero or more ray emissions in to the scene for each width x height invocation requested by the parameters in glDispatchRaysIMG. Note that the current frame coordinate from the frame shader does not need to be used as the accumulate location for the rays. I.e. rays are not coupled to pixel locations.

layout (rgba8, binding = 0) uniform accumulateonly highp image2D rayTraceDiffuseImage; layout(max_rays = 1) out; out ShadowRay shadowRay; uniform rayprogramIMG defaultRayProgram; void emitShadowRay(highp vec3 p, highp vec3 normal, highp vec3 dir, highp float maxDistance, vec3 colour) { shadowRay.gl_OriginIMG = p + depthModifier*normal; shadowRay.gl_DirectionIMG = dir; shadowRay.gl_PrefixRayProgramIMG = gl_NullRayProgramIMG; shadowRay.gl_SceneIMG = uint(gl_DispatchRaysIDIMG); shadowRay.gl_MaxDistanceIMG = maxDistance; shadowRay.gl_PixelIMG = gl_FrameCoordIMG; shadowRay.gl_BounceCountIMG = 0u; shadowRay.gl_IsOutgoingIMG = true; shadowRay.gl_FlipFacingIMG = false; shadowRay.gl_RunPrefixProgramIMG = false; shadowRay.colour = colour; emitRayIMG(shadowRay, defaultRayProgram); } void main() { emitShadowRay(vPosition, unpackedNormal, vNormalisedDirectionToLight, length(vDirectionToLight), vec3(1.0,0.0,0.0)); imageAddIMG(rayTraceDiffuseImage, gl_FrameCoordIMG, vec4(0.0,0.0,1.0,0.0)); }

We can see that there are a few additions to GLSL for the frame shader. First the builtins:

gl_DispatchRaysIDIMG This index comes from the first argument to glDispatchRaysIMG to help with multi-buffering.

gl_FrameCoordIMG This is the coordinate in the current frame shader.

gl_NullRayProgramIMG This is a no-op program and is useful for comparisons against rayprogramIMGs.

Our shadow ray has all of the implicit raytype variables still available as above and each performs a certain function in the pipeline. In the frame shader we will usually write to these variables as opposed to reading from them.

gl_OriginIMG The origin of the ray that we want to emit.

gl_DirectionIMG The direction of the ray that we want to emit.

gl_PrefixRayProgramIMG An optional prefix ray program to run before we run the intersection ray shader.

gl_SceneIMG The scene id that we want to emit in to. (Specified by the index in glBindSceneArrayComponentGroupIMG in the example above).

gl_MaxDistanceIMG The maximum distance a ray can travel before we stop the ray, ignore intersections and run the defaultRayProgram instead.

gl_PixelIMG The originating pixel this ray was emitted from.

gl_BounceCountIMG The current bounce count of the ray. (Usually set to zero in a frame shader)

gl_IsOutgoingIMG If the ray is “outgoing” – this will require a more in-depth article to explain.

gl_FlipFacingIMG Whether to flip the face that we test the ray against in the next intersection.

gl_RunPrefixProgramIMG Whether to run the prefix ray program above.

emitRayIMG is the GLSL function that will emit the ray and eventually pass it on to the intersection test hardware and imageAddIMG is the accumulate function described below.

shadeRayIMG is also available. This function shades a ray with a given ray program without intersection testing.

Ray shaders

Ray shaders are executed when a ray intersects with a triangle, when a ray reaches it’s maximum distance or when we want to run a prefix program. Whereas in the frame shader we were only writing to the ray variables, in the ray shader we will read from them and write to them if we want to emit more rays.

layout(binding=0, occlusion_test_always) raytypeIMG ShadowRay { highp vec3 diffuseObjectColor; highp vec3 ambientObjectColor; }; layout(binding=1, occlusion_test_never) raytypeIMG ReflectiveRay { highp vec3 reflectiveColor; }; layout(rgba8, binding=2) uniform accumulateonly highp image2D reflectionOutput; in perVertexData { highp vec3 vertexNormal; highp vec2 vertexTexCoord; } vertexData[]; rayInputHandlerIMG(ShadowRay inputRay) { void main() { imageAddIMG(reflectionOutput, inputRay.gl_PixelIMG, vec4(inputRay.ambientObjectColor, 0.0)); } } rayInputHandlerIMG(ReflectiveRay inputRay) { layout(max_rays=2) out; out ShadowRay reflectedShadowRay; out ReflectiveRay reflectionRay; void main() { // We interpolate the varyings ourselves highp vec3 intersectionPoint = interpolateAtRayHitIMG(gl_in[0].gl_Position.xyz, gl_in[1].gl_Position.xyz, gl_in[2].gl_Position.xyz); highp vec2 intersectionTextureCoord = interpolateAtRayHitIMG(vertexData[0].vertexTexCoord.xy, vertexData[1].vertexTexCoord.xy, vertexData[2].vertexTexCoord.xy); highp vec3 vDirectionToLight = lightData.vLightPosition.xyz - intersectionPoint; highp vec3 intersectionNormal = interpolateAtRayHitIMG(vertexData[0].vertexNormal.xyz, vertexData[1].vertexNormal.xyz, vertexData[2].vertexNormal.xyz); highp vec3 vNormalisedNormal = normalize(intersectionNormal); highp vec4 reflectionTexture = texture(sTexture, intersectionTextureCoord); if (aboveReflectionThreshold && inputRay.gl_BounceCountIMG < NUMBER_OF_REFLECTION_RAYS) { reflectionRay.gl_DirectionIMG = reflectionDirection; reflectionRay.gl_OriginIMG = intersectionPoint + reflectionDirectionOffset * reflectionRay.gl_DirectionIMG; reflectionRay.gl_MaxDistanceIMG = inputRay.gl_MaxDistanceIMG; reflectionRay.gl_SceneIMG = inputRay.gl_SceneIMG; reflectionRay.gl_PixelIMG = inputRay.gl_PixelIMG; reflectionRay.gl_BounceCountIMG = inputRay.gl_BounceCountIMG + 1u; reflectionRay.gl_FlipFacingIMG = // ... set the rest of the ray values reflectionRay.reflectiveColor = reflectedObjectColor; emitRayIMG(reflectionRay, environmentRayProgram); } else { // ... imageAddIMG(reflectionOutput, inputRay.gl_PixelIMG, vec4(environmentAccumulationColor, 0.0)); } } }

This shader has a few differences from a normal OpenGL ES shader. Firstly there are multiple main() entry points. This is because we have multiple ray types. When a ray intersects some geometry the entry point that has the correct type of ray will be executed. For example when a ShadowRay hits a piece of geometry with this ray shader attached it will run the first main(). When a ReflectiveRay hits it will run the second main(). The type of ray is implicity assigned with the type from raytypeIMG. In this example there are two ray types: ShadowRay and ReflectiveRay. We can also see that in the second main() we can emit rays: “layout(max_rays=2) out;”. We can also see the types of ray we can emit in the next lines. So this main() can emit both types of ray. If we look at the first main() we can see that this piece of geometry will never emit any more rays if a shadow ray intersects it.

The next thing we notice is the perVertexData. This is the varying data that we wrote out in our vertex shader. It has been stored in main memory and retrieved when we run a ray shader. We have the varyings for each point on the triangle and we use the interpolateAtRayHitIMG function to perform a barycentric interpolation on the varying data. This in different compared to a rasteriser where we rely on something out of our control to do the interpolation. With this API we have control over the interpolation so we can maybe perform a different type of interpolation if we wanted.

We manually increase the bounce count of the ray before we emit again. This is to ensure that we do not encounter an infinite loop.

We can also call imageAddIMG in a ray shader. The pixel that will be accumulated to is specified by the second argument. We take this pixel address from the input ray, this is not necessary, it could come from anywhere.

Occluder rays

When we set occlusion_test_always on our ray type, we make this ray act differently to other rays. This is an optimisation for rays that are used to test whether an intersection occurs or not. If the ray intersects any geometry (that is an occluder – see glComponentOccluderIMG) then the ray is dropped and no further shading is done. If the ray reaches it’s maximum distance then we can still run a ray shader. This is useful for shadows where we want to test if a ray is occluded by any geometry or not. Marking the ray as an occluder has a fast path in the hardware because we know this will be a useful mechanism for developers.

Prefix programs

A prefix program is a ray shader that will execute before an intersection ray shader executes. We can attach a prefix program to a ray from another ray shader. We envision the use for this to be for distance based effects such as clouds or water rendering. Where we want to know the distance the ray has travelled in a medium. For example in the image below we know the distance between where we enter the cloud and where we intersect another object. We can run our prefix shader to calculate some shading for this cloud before we run the shader for the object we intersect.

We can envisage more uses for this feature and have articles/blog posts planned on prefix programs and per ray distance culling.

Hybrid rendering

We have found that ray tracing and the PowerVR tile based deferred rendering hardware work well together. With the pixel local storage extension (PLS) we can render a scene using the rasteriser, write this information to a gbuffer that lives in local memory and then issue ray tracing commands on that gbuffer. We have used this in many techniques so far including for soft shadows and lighting. Using local memory means we save on memory bandwidth when writing and reading our gbuffer. More information about this is coming soon.

SDK demos

Our SDK team are working on examples and demos with source code along with helper functions to make creation of a ray tracing application easier.

One of our ray tracing SDK demos showing soft shadows

Performance

There will be a lot of posts to write on performance related to ray tracing, however that will need to wait until a later blog post. Our PowerVR SDK performance analysis tool, PVRTune already supports reading the performance counters from the ray tracing hardware. The two new hardware blocks are visible in the below screenshot.

Ray tracing counters in the PowerVR SDK PVRTune

More information

You can get access to the extension specification via an NDA at the moment. In the near future we will make the specification public. Our hardware currently exists as a PCIe test chip that we have demonstrated at GDC.