A New Feature: Accelerated Model Output Post-Processing With CuPy

When deep neural networks are evaluated with the CUDA runtime, the input of the model and its output are allocated in the GPU memory. The next step is to extract high-level data like bounding boxes, attributes, or masks from raw GPU-allocated tensors.

There are two approaches:

Download the raw tensors to the general-purpose RAM and process them with algorithms evaluated on a CPU;
Utilize CUDA to process the raw tensors directly in GPU.

Unfortunately, there is no silver bullet to approach the task universally: sometimes CPU-based processing is so efficient that downloading the raw tensors from the GPU RAM to CPU-allocated RAM is worth it; in other scenarios, CUDA-accelerated processing may significantly outperform its CPU-based competitors.

Nevertheless, the users need tools to decide what approach to choose. When we consider CPU-based postprocessing, the first thing that comes to our mind is NumPy, a de facto standard demonstrating excellent performance when used correctly. Developers can combine NumPy with auxiliary technologies like Numba or OpenCV, improving the performance of the computations.

Regarding CUDA-optimized computations, we first think about writing low-level CUDA cores with C-like language. However, it is not the only approach: there is an excellent library, CuPy, which is a drop-in replacement to NumPy, using CUDA for accelerated computations. Often, to replace NumPy with CuPy, you can change the module import declaration. When you need to implement a custom kernel, CuPy provides a means for doing that.

Previously, Savant provided only NumPy-based processing for output tensors, which works perfectly for detection and classification models because they produce small output tensors. However, recently, we developed several demos with models resulting in large output tensors. Downloading them to the CPU RAM and processing with the CPU demonstrated a significant CPU load and performance decrease. Primarily, I’m discussing segmentation models producing masks and generative models producing whole images (like super-resolution models generating huge raw images).

Considering the situation, we released a new feature that allows access to raw tensors allocated in the GPU RAM with CuPy without the need to download them and process them with conventional CPU-bound algorithms. The first sample that exploited the feature is the instance segmentation sample. We modified it to use either CPU-bound NumPy post-processing or GPU-bound CuPy post-processing based on user needs.

Now, when you are implementing a post-processing converter, you can specify how it receives the data: with NumPy or with CuPy.

CuPy:

"""YOLOv8-seg postprocessing (converter)."""
import cupy as cp

from typing import Any, List, Tuple
from savant.base.converter import BaseComplexModelOutputConverter, TensorFormat

class TensorToBBoxSegConverter(BaseComplexModelOutputConverter):
    """YOLOv8-seg output converter.

    :param confidence_threshold: confidence threshold (pre-cluster-threshold)
    :param nms_iou_threshold: NMS IoU threshold
    :param top_k: leave no more than top K bboxes with maximum confidence
    """

    tensor_format: TensorFormat = TensorFormat.CuPy

    def __call__(
        self,
        *output_layers: cp.ndarray,
        model: NvInferInstanceSegmentation,
        roi: Tuple[float, float, float, float],
    ) -> Tuple[np.ndarray, List[List[Tuple[str, Any, float]]]]:
    ...

NumPy:

"""YOLOv8-seg postprocessing (converter)."""
from typing import Any, List, Tuple

import cv2
import numpy as np

from savant.base.converter import BaseComplexModelOutputConverter, TensorFormat

class TensorToBBoxSegConverter(BaseComplexModelOutputConverter):

    tensor_format: TensorFormat = TensorFormat.NumPy

    def __call__(
        self,
        *output_layers: np.ndarray,
        model: NvInferInstanceSegmentation,
        roi: Tuple[float, float, float, float],
    ) -> Tuple[np.ndarray, List[List[Tuple[str, Any, float]]]]:

When the pipeline initializes, it applies the required mechanisms to provide output tensors as objects allocated to a particular post-processing converter’s main memory (NumPy) or GPU memory (CuPy).

There Is No Silver Bullet

We found it hard to predict what gives better results in a particular situation. Currently, we consider the following factors that jointly form a computational result: GPU family, GPU power, CPU family, CPU single-core performance, CPU core count, PCI-E bandwidth, Jetson, or discrete GPU.

We witness very mixed results based on the above-mentioned factors:

A notebook with RTX 3060 or RTX 4060: speed-up 2.5 – 3.5 times, for the instance segmentation task (40 to 150 FPS);
A workstation with Intel i5-6400 and RTX4000: CPU offload, FPS did not change significantly;
A workstation with AMD Ryzen 7 3700X and RTX A4000: CPU offload, FPS improved slightly;
Nvidia Jetson Orin Nano 8GB: performance decreased.

Such varying results only mean you must try different approaches based on the hardware planned for production.

Sample

A demo featuring GPU-based postprocessing is the instance segmentation sample. It is available in the Savant repo by the link.