Everybody loves benchmarking, and we love it, too! We always claim that Savant is fast and highly optimized for Nvidia hardware because it uses TensorRT inference under the hood. However, without numbers and benchmarks, the declaration may sound unfounded. Thus, we decided to publish a benchmark demonstrating the inference performance for three technologies:
- PyTorch on CUDA + video processing with OpenCV;
- PyTorch on CUDA + hardware accelerated (NVDEC) video processing with Torchaudio (weirdly, video processing primitives lie in the Torchaudio library);
- Savant.
The 1st is what most developers usually use as a de-facto approach. The 2nd is used rarely because it requires a custom build, and developers often underestimate hardware-accelerated video decoding/encoding as the critical enabling factor for CUDA-based processing.
Let us quickly observe why hardware-accelerated decoding is a must for CUDA. As you all know, CUDA operations are performed on the data allocated in the highly optimized GPU memory. To make the data allocated inside it, we usually upload them from a CPU memory by transferring them via the PCI-E bus. This operation is not very efficient from the perspective of video data because raw buffers are enormous, and modern accelerators are fast. Thus, we spend much time uploading the data to the GPU memory. Hardware (NVDEC) accelerated decoding receives a very optimized video stream (H.264/HEVC) and decodes data with NVDEC in the GPU memory without uploading them. It is a highly efficient mechanism – both performant from the decoding capabilities and memory transfer efficiency.
This idea is huge!
NVENC and NVDEC are not just offloading encoders and decoders but also the data bridge between GPUs and CPUs. Want to know more about it? Read our article describing it in detail.
So, returning to our comparison participants, we have two of them (2) and (3) using this technology and (1) without it.
In our benchmark, we have compared YOLOV8M FP16 configured with an input size of 640×640 and batch size 1.
The benchmark is implemented as follows:
- Reads a video file.
- Decodes it to get a raw frame tensor in CPU (1) or GPU (2, 3).
- Runs tensor preprocessing operation.
- Runs CUDA-accelerated inference: PyTorch (1,2), TensorRT (3).
- Runs postprocessing involving NMS for detected classes.
The results of them are displayed below and confirm our words:
Benchmark | FPS | Improvement |
---|---|---|
Pytorch CUDA + OpenCV | 75 | 0.294 |
Pytorch CUDA + HW Decode | 107 | 0.419 |
Savant TensorRT + HW Decode | 255 | 1 |
As you can see, the YOLOV8M model serving with Savant significantly outperforms the competitors. Various batch sizes can result in other numbers, but Savant generally demonstrates significant performance improvement vs PyTorch.
The benchmark is located on GitHub.
Don’t forget to subscribe to our X to receive updates on Savant. Also, we have Discord, where we help users. To know more about Savant, read our article: Ten reasons to consider Savant for your computer vision project.