To fully take advantage of the performance improvements that DirectX 12 offers, Time Spy uses a custom game engine developed in-house from the ground up. The engine was created with the input and expertise of AMD, Intel, Microsoft, NVIDIA, and the other members of the UL Benchmark Development Program.
The rendering, including scene update, visibility evaluation, and command list building, is done with multiple CPU threads using one thread per available logical CPU core. This reduces CPU load by utilizing multiple cores.
The engine supports the most common type of multi-GPU configuration, i.e. two identical GPU adapters in Crossfire/SLI, by using explicit multi-adapter with a linked-node configuration to implement explicit alternate frame rendering. Heterogeneous adapters are not supported.
The Umbra occlusion library (version 3.3.17 or newer) is used to accelerate and optimize object visibility evaluation for all cameras, including the main camera and light views used for shadow map rendering. The culling runs on the CPU and does not consume GPU resources.
One descriptor heap is created for each descriptor type when the scene is loaded. Hardware Tier 1 is sufficient for containing all the required descriptors in the heaps. Root signature constants and descriptors are used when suitable.
Implicit resource heaps created by ID3D12Device::CreateCommittedResource() are used for most resources. Explicitly created heaps are used for some target resources to reduce memory consumption by placing resources that not needed at the same time on top of each other.
Asynchronous compute is utilized heavily to overlap multiple rendering passes for maximum utilization of the GPU. Async compute workload per frame varies between 10-20%.
The engine supports Phong tessellation and displacement-map-based detail tessellation.
Tessellation factors are adjusted to achieve the desired edge length for the output geometry on the render target (G-buffer, shadow map or other). Additionally, patches that are back-facing and patches that are outside of the view frustum are culled by setting the tessellation factor to zero.
Tessellation is turned entirely off by disabling hull and domain shaders when the size of an object’s bounding box on the render target drops below a given threshold.
If an object has several geometry LODs, tessellation is used on the most detailed LOD.
Objects are rendered in two steps. First, all opaque objects are drawn into the G-buffer. In the second step, transparent objects are rendered to an A-buffer, which is then resolved on top of surface illumination later on.
Geometry rendering uses a LOD system to reduce the number of vertices and triangles for objects that are far away. This also results in a bigger on-screen triangle size.
The material system uses physically-based materials.
Opaque objects are rendered directly to the G-buffer. The G-buffer is composed of textures such as depth, normal, albedo, material attributes, and luminance. A material might not use all target textures. For example, a luminance texture is only written into when drawing geometries with luminous materials.
For rendering transparent geometries, the engine uses a variant of an order-independent transparency technique called Adaptive Transparency (Salvi et al. 2011). Simply put, a per-pixel list of fragments is created for which a visibility function (accumulated transparency) is approximated. The fragments are blended according to the visibility function and illuminated in the lighting pass to allow them to be rendered in any order. The A-buffer is drawn after the G-buffer to fully take advantage of early depth tests.
In addition to the per-pixel lists of fragments, per 2x2 quad lists of fragments are created. The per-quad lists can be used for selected renderables instead of the per pixel lists. This saves memory when per pixel information is not required for a visually satisfying result. When rendering to per quad lists, a half resolution viewport and depth texture is used to ignore fragments behind opaque surfaces. When resolving the A-buffer fragments for each pixel, both per pixel list and per quad list are read and blended in the correct order. Each per quad list is read for four pixels in the resolve pass.
Lighting is evaluated using a tiled method in multiple separate passes.
Before the main illumination passes, asynchronous compute shaders are used to cull lights, evaluate illumination from prebaked environment reflections, compute screen-space ambient occlusion, and calculate unshadowed surface illumination. These tasks are started right after G-buffer rendering has finished and are executed alongside shadow rendering. All frustum lights, omni-lights and reflection capture probes are culled to small tiles (16x16 pixels) and written to an intermediate buffer. Reflection illumination is evaluated for the opaque surfaces by sampling the precomputed reflection cubes. The results are written out to a separate texture. Ambient occlusion and unshadowed illumination results are written out to their respective targets.
Second, illumination from all lights and GI data is evaluated for the surface. The A-buffer is also resolved in a separate pass and then composed on top of surface illumination. This produces the final illumination that is sampled in the screen space reflection step, which also blends in previously computed environment illumination based on SSR quality. Reflections are applied on top of surface illumination. Surface illumination is also masked with SSAO results.
Third, volume illumination is computed. This includes two passes. The first one evaluates volume illumination from global illumination data and the second one calculates illumination from direct lights. The evaluation is done by raymarching the light ranges.
Finally, surface illumination, GI volume illumination, and direct volume illumination are composed into one final texture with some blurring, which is then fed to post-processing stages.
Shadows are sampled in both surface and volume illumination shaders.
Particles are simulated on the GPU using the asynchronous compute queue. Simulation work is submitted to the asynchronous queue while G-buffer and shadow map rendering commands are submitted to the main command queue.
Particles are rendered by inserting particle fragments into an A-buffer. The engine utilizes a separate half-resolution A-buffer for low-frequency particles to allow more of them to be visible in the scene at once. They are blended together with the main A-buffer in the combination step. Particles can be illuminated with scene lights or they can be self-illuminated. The output buffers of the GPU light-culling pass and the global illumination probes are used as inputs for illuminated particles. The illuminated particles are drawn without tessellation and they are illuminated in the pixel shader.
Particles can cast shadows. Shadow casting particles are rendered into transmittance 3D textures for lights that have particle shadows enabled. Before being used as an input to illumination shaders, an accumulated version of the transmittance texture is created. If typed UAV loads are supported, the transmittance texture is accumulated in-place. Otherwise, the accumulated result is written to an additional texture. The accumulated transmittance texture is sampled when rendering surface, particle, and volume illumination by taking one sample with bilinear filtering per pixel or per ray marching step. Resolution of the transmittance texture for each spotlight is evaluated on each frame based on the screen coverage of the light. For directional light, fixed resolution textures are used.
Depth of field
The effect is computed by scattering the illumination in the out-of-focus parts of the input image using the following procedure.
- Using CS, circle of confusion radius is computed for all screen pixels based on depth texture. The information is additionally reduced to half and quarter resolutions. In the same CS pass, a splatting primitive (position, radius and color) for out-of-focus pixels whose circle of confusion radius exceeds a predefined threshold is appended to a buffer. For pixel quads and 4x4 tiles that are strongly out of focus, a splatting primitive per quad or tile is appended to the buffer instead of per-pixel primitives.
- The buffer with splatting primitives for the out-of-focus pixels is used as point primitive vertex data and, using Geometry Shader, an image of a bokeh is splatted to the positions of these primitives. Splatting is done to a texture that is divided into regions with different resolutions using multiple viewports. The first region is screen resolution and the rest are a series of halved regions down to 1x1 texel resolution. The screen space radius of the splatted bokeh determines the used resolution. The larger the radius the smaller the used splatting resolution.
- The different regions of the splatting texture are combined by up-scaling the data in the smaller resolution regions step by step to the screen resolution region.
- Finally, the out-of-focus illumination is combined with the original illumination.
Bloom is based on a compute shader FFT that evaluates several effects with one filter kernel. The effects are blur, streaks, anamorphic flare and lenticular halo.
The effect is computed by first applying a filter to the computed illumination in the frequency domain like in the bloom effect. The filtered result is then splatted in several scales and intensities on top of the input image using additive blending. The effect is computed in the same resolution as the bloom effect and therefore the forward FFT needs to be performed only once for both effects. The filtering and inverse FFT are performed using the CS and floating-point textures.