During the benchmark, a number of images (default 16) are generated. image generation is separated into batches. These batch sizes vary per image generation model, with SD1.5 having a batch size of 4, and SDXL having a batch size of 1.

Within each batch, the detailed results show where the majority of time is spent during workload. Typically, most time is spent during the UNET step.

  • Text encoder – converting text prompt into tokenized text.
  • UNET – Takes tokenized text, adds random noise, then loops denoising steps to create an image in the latent space.
  • VAE – decodes latent image into final (actual) image output.
  • Pipeline – all the above steps, as well as the initialization of random latent noise.

Diagram showing a simplified view of the Stable Diffusion steps used in the Procyon AI Image Generation Benchmark and the scores they relate to.