Image classification

In AI image classification, the AI model inspects an image or video frame and classifies its contents. For the image classification test in the Procyon Computer Vision Benchmark, we use the ConvNeXt-Tiny (ImageNet-1K) AI model.

Uses for image classification include smart visual search functions such as searching, sorting and tagging images or videos in a content library, retail inventory management and even applications in product quality control or assisting medical diagnoses.

Image captioning

Image captioning refers to generating natural‑language descriptions of an image using an AI model that combines visual understanding with language generation. In the Procyon Computer Vision Benchmark, this task uses the BLIP (Base) model, where each caption is produced through one encoder pass followed by multiple decoder steps.

This workload mirrors several emerging Windows 11 scenarios, such as AI‑enhanced accessibility capabilities, smart content tagging in applications like photos and screenshot or visual summary features in productivity tools.

Video Pipeline

Video object detection

Object detection states what is in the image and where each object is, for example, the width, height and class label. It is a key tool for differentiation, and anything that needs identification and localization uses object detection.

Models used here are Base DETR with ResNet50 backbone. The increased reliance on visual understanding to speed up tasks in our everyday PC usage makes this a key insight into the daily impact of AI within the office.

Video segmentation

Video or image segmentation is the technique of identifying and partitioning regions of an image or video frame into regions.

This test uses the SAM2 small variant AI model by Meta. AI image or video segmentation is used for tasks such as blurring the background of a video or applying masks to objects.

Video upscaling

AI-enhanced upscaling takes an initial video or image and improves its fidelity by using AI to work out and re-add missing information. The video upscaling section of this benchmark uses the Real-ERSGAN AI model.

This AI use case can be used to improve low-quality images or video feeds, or reduce the bandwidth needed for a clear picture, such as for a video call made from a place with poor signal.