During the benchmark, a number of images (default 16) are generated. image generation is separated into batches. These batch sizes vary per image generation model, with SD1.5 FP16 having a batch size of 4, and both SD1.5 INT8 and SDXL FP16 having a batch size of 1.
Within each batch, the detailed results show where the majority of time is spent during workload. Typically, most time is spent during the UNET step.
- Text encoder – converting text prompt into tokenized text.
- UNET – Takes tokenized text, adds random noise, then loops denoising steps to create an image in the latent space.
- VAE – decodes latent image into final (actual) image output.
- Pipeline – all the above steps, as well as the initialization of random latent noise.
Diagram showing a simplified view of the Stable Diffusion steps used in the Procyon AI Image Generation Benchmark and the scores they relate to.