3DMark Night Raid uses a DirectX 12 graphics engine that is optimized for integrated graphics hardware. The engine was developed in-house with input from members of the UL Benchmark Development Program.
The rendering, including scene update, visibility evaluation, and command list building, is done with multiple CPU threads using one thread per available logical CPU core. This reduces CPU load by utilizing multiple cores.
The engine implements multi-GPU support using explicit alternate frame rendering on linked-node configuration. Heterogeneous adapters are not supported.
The Umbra occlusion library (version 3.3.17 or newer) is used to accelerate and optimize object visibility evaluation for all cameras, including the main camera and light views used for shadow map rendering. The culling runs on the CPU and does not consume GPU resources.
One descriptor heap is created for each descriptor type when the scene is loaded. Hardware Tier 1 is sufficient for containing all the required descriptors in the heaps.
Implicit resource heaps are used for most resources. Explicitly created heaps are used for some resources to reduce memory consumption by placing resources that are not needed at the same time on top of each other.
Asynchronous compute is used heavily to overlap multiple rendering passes for maximum utilization of the GPU. Async compute workload per frame varies between 10-20%. The forward-rendering path uses less async compute as there are fewer compute passes to run along with the shadow map and G-buffer passes.
The engine supports Phong tessellation and displacement-map-based detail tessellation.
Tessellation factors are adjusted to achieve the desired edge length for the output geometry on the render target (G-buffer, shadow map or other). For shadow maps, edge length is also calculated from the main camera to reduce aliasing due to different tessellation factors between the main camera and shadow map camera.
Additionally, patches that are back-facing and patches that are outside of the view frustum are culled by setting the tessellation factor to zero.
Tessellation is turned entirely off by disabling hull and domain shaders when the size of an object’s bounding box on the render target drops below a given threshold.
If an object has several geometry LODs, tessellation is used on the most detailed LOD.
Graphics Test 1 uses a deferred rendering pipeline. Objects are first rendered into a G-buffer that contains all the geometry attributes that are required for the illumination. Illumination is computed in multiple passes and the final result is blended with transparents and fed to the post-processing stages.
Objects are rendered in two steps depending on the attributes of the geometries. First, all non-transparent objects are drawn into the G-buffer. In the second step, transparent objects are rendered using an order-independent transparency algorithm to another target, which is then resolved on top of surface illumination later on.
Geometry rendering uses a LOD system to reduce the number of vertices and triangles for objects that are far away. This also results in bigger on-screen triangle size.
The material system uses physically-based materials. The system supports the following material textures: Albedo (RGB) + metalness (A), Roughness (R) + Cavity (G), Normal (RG), Ambient Occlusion (R), Displacement, Luminance, Blend, and Opacity. A material might not use all these textures.
Opaque objects are rendered directly to the G-buffer. The G-buffer is composed of textures for Depth, Normal, Albedo, Material Attributes, and Luminance. A material might not use all these textures.
When rendering transparent geometries, the engine uses a technique called “Weighted Order-Independent Transparency” (McGuire & Bavoil, 2013). The technique only requires two render targets and the special blending settings to achieve a good approximation of real transparency. Transparents are blended on top of the final surface illumination.
Lighting is evaluated using a tiled method in multiple separate passes.
Before the main illumination passes, asynchronous compute shaders are used to cull lights, compute screen-space ambient occlusion and evaluate unshadowed illumination. These tasks are started right after G-buffer rendering has finished and are executed alongside shadow rendering. All omni-lights are culled to small tiles (16x16 pixels) and written to an intermediate buffer. Frustum lights and environment cubes are culled for every pixel, because there are only a couple of them. Ambient occlusion and unshadowed illumination results are written out to their respective textures.
Illumination for shadowed lights is calculated after the completion of the shadow map rendering. This is also written out to its respective texture.
These results are combined in the global illumination pass while adding probe-based global illumination for objects that do not use light maps.
Reflection illumination is evaluated for the opaque surfaces by combining Screen Space Reflections (SSR) and sampling the precomputed reflection cubes for those surfaces that are rough (above a fixed threshold). Reflections are blended into the illumination in the SSR combination pass.
Final illumination is passed into post-processing.
Graphics Test 2 uses a forward rendering pipeline.
In forward rendering mode, the geometry is rendered in the same order as in the deferred mode. The same input textures are used and the illumination is computed similarly. The difference is that the outputs do not contain all material information, but rather the results of the illumination which is done in the same pixel shader. There is only one color render target where the illumination information is stored and a depth target which is used for post-processing effects. There is no depth pre-pass. All the lights in the scene are iterated and there is no culling step.
Particles are simulated on the GPU using the asynchronous compute queue. Rendering is performed using indirect draw calls with inputs coming from the simulation buffers.
Simulation is executed with multiple compute shader passes in the asynchronous queue alongside shadow map rendering. The following steps are executed per frame for each particle system:
- Alive count of particles is cleared
- New particles are emitted
- Particles are simulated
- Particles that are alive are counted and the count is written into a buffer that is used as an indirect argument buffer in the draw phase.
Particles can be illuminated with scene lights or they can be self-illuminated. The output buffers of the GPU light culling pass are used as inputs for illuminated particles. The illuminated particles are drawn without tessellation and they are illuminated in either the vertex or pixel shader. Particles are blended together with the same order-independent technique as transparent geometries.
Depth of field
The effect is based on a separable blur filter that is used to create an out-of-focus texture in the following manner.
- Circle of confusion radius is computed for all screen pixels based on the half-resolution depth. Output texture is obtained by multiplying the illumination with the corresponding radii. Average radius is stored to output alpha channel.
- The result of the previous step is blurred in two passes using a separable filter and two work textures so that we get hexagonal bokehs when the outputs are combined.
- Upon summing the work textures together in the combination step, they are divided by the stored average radii to renormalize the illumination.
- The final result is obtained by linearly interpolating between the original illumination and the out-of-focus illumination based on the radius calculated from the full-resolution depth.
Bloom is based on a compute shader FFT that evaluates several effects with one filter kernel. The effects are blur, streaks, anamorphic flare and lenticular halo. Bloom is computed in half resolution to make it faster.
The effect is computed by first applying a filter to the computed illumination in the frequency domain like in the bloom effect. The filtered result is then splatted in several scales and intensities on top of the input image using additive blending. The effect is computed in the same resolution as the bloom effect and therefore the forward FFT needs to be performed only once for both effects. The filtering and inverse FFT are performed using compute shaders.