3DMark Port Royal uses a custom engine developed in-house with input from UL Benchmark Development Program members including AMD, Intel, and NVIDIA. We worked especially closely with Microsoft to create a first-class implementation of the DirectX Raytracing API.

Port Royal improves on the Time Spy/Night Raid rendering engine by implementing new effects and integrating DirectX Raytracing.

CPU side

Since Port Royal does not have a CPU test, the main role of the CPU in the test is to compose command lists for the GPU to execute. The task system allows heavy parallel execution. The rendering— including scene update, visibility evaluation and command list building—is done with multiple CPU threads using one thread per available logical CPU core. This shortens the CPU rendering time and reduces the chance of the CPU becoming a bottleneck.

GPU side

The GPU side of the rendering is composed of multiple rasterization, compute, and ray tracing passes. Some passes run in an asynchronous compute queue.

The engine supports multi-GPU in the form of alternate frame rendering for linked-node setups (homogenous adapters).

Rendering passes

The image below shows the high-level construction of a typical frame in the Port Royal benchmark. Tasks are color-coded by work type. Arrows show task relationships. The position of each task indicates the queue in which the task is executed.

Shadow map draw

Frustum lights can be shadowed. For each shadowed light, a shadow map is allocated for each frame based on heuristics that determine which resolution is required. The maximum resolution is 1k. The shadow map is used for surface illumination and for generating light shafts for volume illumination.

Shadows are sampled in illumination, cube rendering, and ray tracing shaders.

G-buffer draw

Opaque objects are drawn into the G-buffer in two passes, separating luminous and non-luminous geometries.

The material system uses physically-based materials. The system supports the following material textures: Albedo (RGB) + metalness (A), Normal (RG) + Roughness (B) + Height (A), Luminance, Blend, Opacity, and Light Map. A material might not use all these textures.

The G-buffer is composed of five textures: depth, albedo + metalness, normal + roughness, luminance, and motion vectors for TAA.

Volume illumination

Volume illumination is computed using the tessellated light volume approach. The volume mesh of a light is computed by extruding the shadow map of a light, using the tessellation pipeline and adaptation heuristics to reduce the amount of mesh data. The fragment shader then computes the volumetric illumination using additive blending to sum up the airlight integral for the view ray corresponding to the pixel.

This is only used for frustum lights, and each fragment computes its own contribution to the airlight integral numerically to include the influence of the radial mask and attenuation of the light. The algorithm is explained in this paper.

This pass is the only pass that uses tessellation in this benchmark. The normal geometry pipeline does not use tessellation.

Cubemap update

Cubemaps are used to cache the radiance for both perspective-correct and traditional specular reflections for static geometries.

Illumination

Cubemaps include static geometries that are drawn once into a G-buffer. The illumination of the cubes is dynamically updated in a compute pass similar to the normal surface illumination each frame. The lights are queried from the world-space clusters in the illumination pass. The view ray direction is set to the main camera view direction to match the specular highlights to the screen space illumination texture that is also sampled for reflections.

Filtering

The mip levels of the cube maps are calculated in compute passes by taking the average of each quad in the lower mip. Each cube is halved in this way until the highest possible mip level is computed.

Transparent geometry draw

For rendering transparent geometries we use a variant of an order-independent transparency technique called Order-Independent Transparency Approximation with Raster Order Views.  Simply put, transparent geometry is rendered and a per-pixel visibility function (accumulated transparency) is approximated by merging pixels into the compressed function. Then the transparent geometry is re-rendered, illuminated and additively blended according to the visibility function.

Ambient occlusion

Ambient occlusion uses an adaptive screen-space technique. It is computed using a group of compute shader passes.

Lighting

All frustum lights, omni lights, reflection probes, and decals are clustered in the world space to a uniform grid. This is done CPU side and then transferred to the GPU in advance.

The main camera lighting is evaluated using a tiled method in multiple, separate passes. Dynamic light evaluation is split into shadowed and unshadowed parts and computed separately.

Before the main illumination passes, asynchronous compute shaders are used to compute screen-space ambient occlusion and calculate unshadowed surface illumination. These tasks are started right after G-buffer rendering has finished and are executed alongside shadow and environment rendering. Ambient occlusion and unshadowed illumination results are written out to their respective targets.

Reflections

Reflection rendering is a combination of multiple rendering passes containing cube rendering, ray tracing of reflection rays, reflection sampling of the cubes, and filtering of the reflection result. The illumination of the reflection cubes is updated in a compute pass as explained earlier.

We cast rays to the importance sampled direction for each screen space pixel that is over a roughness threshold. The resulting hitpoints (or one if only a single ray is used per pixel) are stored and the results are used to sample reflections from the environment maps. In cases of pixels being non-visible from the cubes or with mirror like surfaces that require pixel-perfect reflections, we compute the reflection separately to render a correct reflection (re-shade). For reflections of glass, we always run the full shading.

Reflect

The reflect pass uses the ray tracing pipeline to generate a reflection ray for each pixel above a predetermined roughness threshold. The direction is importance sampled according to the same specular BRDF as used in direct illumination. The hit shader writes the ray length, instance ID, primitive index and barycentric coordinates of the hit. The ray generation shader then stores these into textures.

Cube sampling

The reflection cubes are used for glossy reflections to find the radiance of a ray intersection generated by the ray tracing pass. The intersection point is reconstructed using the same importance sampling routine as in the reflect pass and reading the ray length stored by the reflect pass. This position is then projected into the reflection cubes. The world space position of the projected point in each cube is tested to determine if it corresponds to the same intersection point, or if it was occluded by another geometry.

In case the point is not found from any cubes or screen space illumination texture, the instance ID, primitive index and barycentric coordinates stored by the reflection pass are read and used to recompute the radiance for the given ray.

If the roughness of the surface is above a certain threshold, the cube sampling is skipped since the resolution is generally not enough for sharp reflections.

Filtering

Finally, we execute a spatial-temporal filtering pass for the reflection result and combine it with illumination.

Pre-TAA combine pass

This pass, shown in the GPU task structure as “Combine” inside the “Illumination” block, evaluates the reflection illumination by evaluating a preintegrated specular BRDF, and modulating the reflection filtering results with it. The reflection is then added with surface illumination. Ambient occlusion is also applied here since it has to be before TAA and after the reflection sampling phase as the screen space illumination texture is also sampled there.

Decals

The Port Royal engine implements a deferred decal system for increased visual quality and easier scene variation.

Decals are skewed prisms that are applied on top of the rendered G-buffer using a compute shader in the asynchronous compute queue. Decals are clustered similarly to lights to speed up the apply pass. For each pixel, active decals are fetched from the matching cluster and applied on top of the G-buffer using one of the implemented blending modes. Various modes allow changing different attributes in the G-buffer (such as normal only or all channels).

Ray-traced shadows

Ray-traced shadows are implemented in a separate pass running in an async compute queue. For each fragment, there is a shadow ray cast from this fragment in the world space towards the direction of the light source. The any-hit shader is then used to detect whether the ray has been occluded on its way from light towards the fragment.

The output of the ray generation shader is a shadow modulation map, which is a float32 texture filled with values ranging from 0 to 1. The values are generated per-fragment by dividing the energy flux that has reached this fragment by the total energy flux present in the scene (i.e. from all the lights).

One shadow ray is cast from each fragment towards the light source. A post-process filter is not employed for the shadow mask, so the implementation only supports hard-edged shadows.

Particles

Particles are simulated on the GPU using the asynchronous compute queue. Simulation work is submitted to the asynchronous queue while G-buffer and shadow map rendering commands are submitted to the main command queue.

Particle illumination

Particles are rendered as transparent surfaces with approximated visibility.

Fluids

Fluid simulation is only used in the demo. It does not contribute to the benchmark score.

Simulation

Fluids are simulated on the GPU using the asynchronous compute queue. The simulation is based on the Position Based Fluids method. Radix sort is used in each step to order the fluid particles using the Z-order curve to achieve locality of memory access when calculating interactions. Additionally, spatial hashing is used to accelerate the neighbor search.

Illumination

The liquid surface is constructed in screen-space by splatting ellipsoids, doing most of the computation in vertex shader, and smoothing the result. The illumination of the surface is done in a compute pass after the surface is illuminated, and the surface illumination is used to apply approximate screen space refractions.

Post-processing

Temporal anti-aliasing

Temporal anti-aliasing (TAA) is applied for the surface illumination texture that already has reflections applied. The projection matrix used for the G-buffer is jittered for each frame so that the sampled subpixel position varies according to a determined pattern. TAA then blends these subpixel-jittered samples together using the exponential average. To fetch a sample from a previous frame, motion vectors written by the G-buffer pass are used. Additionally, variance clipping is used to reduce ghosting.

Post-TAA resolve

This pass applies parts that do not use TAA on top of the illumination resolved by TAA. Since TAA only applies to opaque objects, transparent elements within the scene such as volumetric illumination, particles and transparent meshes are directly resolved in this pass, on top of the TAA results.

Depth of field

The effect is computed by scattering the illumination in the out of focus parts of the input image by using multiple passes. First, a compute shader is used to compute confusion radiuses based on depth texture, and splatting primitives are added to a buffer. Then, these primitives are rendered to various resolution textures using the normal rasterization pipeline. Last, the out-of-focus illumination is combined with the original illumination.

Bloom

Bloom is based on a compute shader FFT that evaluates several effects with one filter kernel. The effects are blur, streaks, anamorphic flare and lenticular halo. 

Lens Reflections

The effect is computed by first applying a filter to the computed illumination in frequency domain like in the bloom effect. The filtered result is then splatted in several scales and intensities on top of the input image using additive blending. The effect is computed in the same resolution as the bloom effect and therefore the forward FFT needs to be performed only once for both effects. The filtering and inverse FFT are performed using the CS and floating point textures.

Tone mapping

Tone mapping is executed as the last pass of the rendering pipeline. It applies various two-dimensional camera effects (such as vignette) to the final texture and controls the tone reproduction.

Dynamic Global Illumination: Ray traced photon mapping

Dynamic Global Illumination is only used in the demo. It does not contribute to the benchmark score.

We have implemented a dynamic global illumination solution using real-time photon mapping. This is a multi-pass algorithm with components of rasterization, ray tracing and compute work. The main passes of the algorithm are sample generation from reflective shadow maps, photon tracing, photon splatting and irradiance filtering.