I started to apply computer vision tech in 2004 and have watched this field evolve for 20 years. As of 2025, we have built dozens of hands-on computer vision applications. Over time, they grew into the SAVANT framework, now open source.

In this article, I will cover the current and legacy tech stacks for computer vision applications. I hope this helps beginner developers make their first steps and choose their tech wisely, so they can reach their goals and stay scalable, well-performing, and up to date for at least a few years.

My name is Ivan Kudriavtsev, and I am the founder of the Savant Framework, the tech behind dozens of computer vision applications built on NVIDIA hardware. Now let’s dive into the topic!

Computer vision is a complex field that requires specialized expertise and adherence to best practices recommended by software and hardware vendors. Often, a naïve approach to computer vision application development leads to inefficient solutions that are unreliable or require excessive hardware, resulting in quality or economic failures.

The problem is that it is too easy to cook it incorrectly: a promising proof-of-concept often evolves into a fragile application. This problem is caused not only by the above-mentioned field complexity but also by the industry’s emerging state: computer vision technology has not yet reached a level of development comparable to the traditional software development ecosystem that developers use to build classic desktop, server, or mobile applications. An intuitive approach often leads in the wrong direction.

In the article, we discuss frameworks and tools developers can use to build state-of-the-art computer vision applications, while also identifying conceptual problems that are often overlooked and lead to inferior, if not dramatic, outcomes. We focus primarily on computer vision applications serving rather than neural model training, because the former is often underestimated, while the latter is well established and mature. Both training and serving are crucial, but our observations show that serving is usually treated as an easy-to-deliver, secondary task that does not reflect reality on the ground.

The article comes in two parts: a landscape overview (this document) and a detailed walkthrough (available soon).

Before we begin, let us define two key terms that we will use throughout the article: computer vision and video analytics. Sometimes, they can be used interchangeably, but it is essential to understand their distinct characteristics.

Computer Vision is a field of artificial intelligence that enables computers to interpret and understand visual information from the world, such as images or videos. It focuses on tasks like object detection, image classification, segmentation, tracking, and scene understanding — essentially teaching machines to “see” and extract meaningful information from visual data.

Video Analytics, on the other hand, is an application area that builds upon computer vision techniques to analyze video streams in real time or offline. Its goal is to detect events, behaviors, or patterns of interest — for example, counting people, recognizing suspicious activities, or monitoring traffic. While computer vision provides the core algorithms and models for understanding visual content, video analytics integrates these capabilities into systems designed for practical, often domain-specific, insights and decisions.

Why is it so difficult: a multidisciplinary field

Computer vision applications require broad knowledge in many areas. The lack of knowledge in a particular field may lead to failure. To help you appreciate the problem, let us begin with an example: you need to determine the color of the object in the wild. Initially, the problem is perceived as very simple and trivial: use RGB to calculate the color, either with a neural classifier or a classic computer vision algorithm. In reality, this is a vast and complex problem: our brain performs intricate estimations that require significant computing power to reproduce. Many factors influence color perception, including illumination parameters, white balance, shadow presence, and material (even a black metallic surface can appear white-ish under certain conditions).

To be productive in the domain, you need to know optics, photography, color representation (RGB, Y’UV, CMYK, HSV, LAB, and how to convert between them), network video streaming protocols (RTP, RTSP, HLS, WebRTC), video codec details (H.264, HEVC, AV1, VP9, MJPEG), video codec pixel formats (yuv420p, yuv422, yuv444), how efficiently load and process data in GPU or NPU, how to quantize, prune and tweak networks for particular hardware, to find GPU/CPU bottlenecks, how to parallelize computations efficiently, and other knowledge. And you need to know how to train models.

In such a cross-knowledge field, doing things right is crucial: the lack of skill or relying on fragmented knowledge is unacceptable. Aside from the above, general reasoning skills and environmental analysis are required to foresee and model different situations that are absent in the laboratory. This matter often concerns mental synthesis, not analytics, because you need to model what you have not met in your data so far.

The examples of the modelling thoughts are:

“How will our computer vision system run under heavy rain or a snowstorm?”
“What if the camera is dirty? Can we recognize that it is dirty?”
“What if the device is overheating and starts throttling?”

What is “throttling” at all? This is what hardware engineers are only aware of; how does it relate to computer vision? Well, it turns out it relates!

Such a vast amount of information and knowledge can be overwhelming, making the field of computer vision challenging to study and risky to develop and commercialize. Democratizing computer vision and video analytics is a significant and complex challenge that enables their efficient adoption in the economy. Keeping this in mind, selecting the right computer vision technology is crucial for practical CV application development.

Computer vision problems

The technology is designed to address multiple problems encountered in computer vision applications. Let us identify and discuss them:

neural model inference;
classic computer vision algorithms;
working with image/video sources and sinks like (video files, images, RTSP, RTMP, CSI, USB, etc, WebRTC);
live stream synchronization;
dynamic re-configuration;
IoT/edge, mobile, browser, and data-center applicability;
control flow management;
sensor fusion;
benchmarking, profiling, and monitoring;
latency control, real-time and batched processing;
metadata visualization and video generation;
developer ecosystem.

The list covers major problems, but not all of them obviously. It is fair to say that, depending on the application, particular aspects are crucial or may not play any valuable role. Let us quickly explain them to ensure we are discussing the specificities from the same perspective.

Neural model inference

This is a core feature of any computer vision system: neural models, such as deep neural networks (DNNs), are the workhorses of modern computer vision. They simulate biological perception of the outer world and deliver a quality impossible to achieve with traditional computer vision algorithms. However, they are resource-intensive, so hardware vendors provide engineers with specialized ASICs (application-specific integrated circuits) or specialized processors, such as Nvidia GPUs with the CUDA programming toolkit, to accelerate computations. In this article, we will focus on major vendors and brands, as the variety of specialized hardware is quite extensive. Some minor vendors are niche and not yet widely popular.

Computer vision algorithms

Being less “cognitive” and more mathematical, they are the foundation of computer vision. Data preparation, whether for training or inference, relies on classic computer vision algorithms, which range from color manipulation and geometric transformations, such as scaling, to feature matching and edge detection. These algorithms are ubiquitous for feeding neural networks with preprocessed data, thereby maximizing their efficiency and often reducing operational costs.

For example, larger images are downsampled to the neural network’s input size to optimize model performance. Computer vision algorithms can frequently be executed on CPUs or GPUs, depending on the particular algorithm and framework.

Accessing image and video sources

Data is ingested from third-party systems, and perception information is sent to third-party systems. The technology must address those problems in a non-blocking, reliable, and recoverable manner. Similarly, as you can read and write files with system calls, the computer vision applications read live video and image streams, produce modified video and image streams, and send insights and signals to OLTP or OLAP storage for later use.

Typically, the sink communication mechanism is tailored to the application’s needs; however, the technology must enable extension and implementation in a way that ensures optimal perception operation, providing extensible protocols and design patterns. In the meantime, the sources are mostly typical, and the technology can support many of them out of the box, including RTSP and video files, as well as a mechanism for implementing custom sources.

When discussing sources and sinks, it is essential to consider their integration with the problems discussed before and later. Depending on the technology’s purpose, these problems can be treated as outer scope or inner scope.

Live stream synchronization

Computer vision systems advance from single-lens to multi-lens. A stereo pair is only a part of the landscape, but a very popular example. However, the real-world need extends way beyond stereo pairs: multi-camera arrays are frequently used synchronously to track object transitions across cameras or build a three-dimensional space representation. Those solutions require millisecond-grade precision to cross-map and identify objects among multiple cams. Being complex by nature, the problem requires non-trivial approaches to merge the perceptions from numerous video sources into a whole picture. Specific computer vision technologies offer support or best practices for implementing such solutions, while others consider the problem outside their scope.

IoT/edge, mobile, browser, and data center

Computer vision AI is applicable in many areas, ranging from smart cameras and inexpensive single-board computers to mobile phones, browsers, and data centers. Vendors promote AI functionality across various platforms to serve diverse business needs. Of course, one size does not fit all: particular technologies are created for specific needs. What we understand by platform support is also a contentious topic, but to keep things simple, let us consider the platform supported if the technology allows running computer vision algorithms and neural network inference on it, because these represent the minimum requirements for developing practical computer vision applications.

Dynamic re-configuration

The running application often requires online configuration changes and support for application-based settings on a per-source basis. This is less frequent in robotics but more common in larger systems that serve multiple cameras, requiring customizations based on the camera’s purpose, tracked events, and deployment conditions. The matter encompasses not only manually initiated configuration changes during deployment, but also changes driven by the feedback loop, in which specialized models trigger configuration changes based on their inference outputs. To appreciate the problem, consider the illumination estimation model, which dynamically adjusts the object detection model’s confidence thresholds based on the amount of sunlight.

As with sources and sinks, a particular technology can either put this problem outside its scope or provide special features that support its implementation.

Control flow management

In general, a computer vision system often represents graphs when data flows from upstream components to downstream components: they are either coupled or distributed (e.g., microservices) across multiple compute components, such as edge, mobile, or data center.

The technology supporting control flow management provides functional units for routing, buffering, filtering, repeating, rewinding, duplicating, and reorganizing streams flowing from upstream to downstream.

Based on supported features, it can form either a linear upstream-to-downstream video analysis or a non-linear one:

Processing and analysis do not strictly follow the timeline (frame by frame).
A complex, adaptive, or iterative logic is used instead of simple “linear” rules.
It is possible to go back, revise, and refine conclusions as new data accumulates.

Often, a truly efficient video analytical system requires a non-linear approach: complex models that require a lot of computing are invoked only when simpler, less computationally intensive models find promising data; the same video stream can be processed multiple times with rewinds.

For example, a simple, lightweight model detects a car once a second. When the vehicle is detected, the system rewinds the stream by a couple of seconds to process it with a full-featured pipeline, profiling the car’s motion using sophisticated models and algorithms.

Sensor fusion

Visual information represents only a part of perception: sound, temperature, GPS coordinates, current speed and acceleration, altitude, air pressure, and other characteristics extend visual information and feed other models to build a generalized picture of the outer world. In the article, the technology supports sensor fusion if it allows collecting, transporting, and processing this information within the system.

Benchmarking, profiling, and monitoring

Benchmarking (performance measurement) is one of the most essential problems regularly addressed in software development and in computer vision applications. This activity typically involves two types of performance measurement: qualitative and quantitative. The former concerns how the actual system underperforms compared to the ground truth data, and the latter measures how “fast” the system is.

Benchmarking is a complex activity that requires evaluating reproducible experiments, which is particularly challenging in applications that utilize runtime heuristics and self-adaptation. Depending on the scope of the framework, developers have access to different benchmarking capabilities, ranging from neural network performance estimation to glass-to-glass benchmarking.

Profiling refers to understanding the timings and frequencies of execution of specific system components, such as functions, methods, algorithms, and operation sequences. Profiling enables developers to identify the system’s hot and cold spots and plan optimizations accordingly.

Monitoring enables the gathering of runtime information to predict system performance, recognize issues, troubleshoot, and prevent system degradation during production operations. In the next section, we discuss latency control in real-time and batched processing, and highlight monitoring as a core feature that enables system operators to manage the system’s negotiated SLA.

Depending on the technology’s purpose and scope, these aspects can be perceived as external or internal.

Latency control, real-time and batched processing

Latency is a parameter that estimates the processing delay and its fluctuations. Some systems require very low latency, while others may need only minimal latency variation. Additionally, some systems do not prioritize latency, demanding that the data be eventually processed, but without strict time constraints.

Bandwidth per computing unit is a parameter that concerns how efficiently the hardware is used. Higher bandwidth per unit means a more efficient economy and a lower operational cost.

It is common to trade latency for bandwidth: the more latency and its variation, the higher the achievable bandwidth can be. Various software systems must operate under specific, business-requested SLAs for bandwidth and latency. Those parameters conflict: the lower the latency you require, the lower the bandwidth you get; the higher the latency you can afford, the higher the bandwidth you can achieve.

The technology can focus on real-time or high-bandwidth processing or both, offering users options. There is also a soft or semi-real-time mode, where the system operates in real-time most of the time, but occasionally becomes non-real-time, resulting in increased latency. Such an approach optimizes and enables both economic efficiency and affordability in the retrieval process.

Metadata visualization and video generation

Visual inspection is one of the most popular and valuable methods for observing how computer vision and video analytics work. For business or development needs, people require augmented video streams or slideshows to review the system’s operation. The technology’s ability to efficiently augment streams with bounding boxes, labels, legends, sprites, and other visual elements is crucial for real-world systems. In most situations, such augmentations are computationally expensive, and when a technology provides an efficient, expressive API for implementing them, it can be a game-changer.

Developer ecosystem

We began the article by discussing the field’s complexity, which hinders the adoption and successful development of computer vision. This concerns not only theoretical knowledge but also the hardware/software stack: not all technologies are friendly to developers; some are difficult for a reason, others are used in a suboptimal way (anti-patterns), and the third are inefficient but straightforward.

A typical deep learning researcher is proficient in Python but often struggles with C++. Fortunately, many current products are Python-first, but this is not enough: they can easily allow developers to shoot themselves in the foot by misusing hardware or adopting unwanted implementation patterns. A well-developed ecosystem provides new and experienced developers with tools, best practices, and architectural patterns that inspire them to do things right.

Tools and frameworks

Before we compare (part 2) the pros and cons of tools and frameworks that represent the computer vision landscape, let’s take a quick look at them. It is not that straightforward to choose them because “popularity” does not necessarily mean the best fit or applicability, but rather can reflect the ease of the learning curve. Nevertheless, we strive to accurately reflect the expert ranking and limit our candidates to those compatible with major computing platforms, such as x86 CPUs, Nvidia GPUs, and AMD GPUs.

The candidates we have selected for the comparison are:

OpenCV, including OpenCV CUDA;
PyTorch;
ONNX Runtime;
Ultralytics YOLO;
TensorFlow;
Keras;
NVIDIA TensorRT;
NVIDIA Triton;
NVIDIA DeepStream;
NVIDIA VPI;
NVIDIA CV-CUDA;
Intel OpenVINO;
AMD VVAS and Ryzen^TM AI;
Savant Framework;
Detectron2;
MMCV;
Paddle;
Google MediaPipe.

Let us briefly go through the list to introduce the tool and describe its purpose. In the following sections, we observe every technology individually.

OpenCV. This is a respected computer vision library that primarily focuses on efficient classic computer vision algorithms, while also supporting neural network inference. You can find OpenCV literally in every computer vision-related product. The library provides CPU-based algorithms and supports CUDA via the OpenCV CUDA extension. This is a must-have library in every project.

PyToch (incl. Torchvision). As one of the most (and often a de facto) standard tools for neural network training, the library provides useful utilities for model inference, image pre- and post-processing, tensor manipulation, and video decoding/encoding. PyTorch is a widely used framework for building computer vision applications, particularly among deep learning researchers who use it as their preferred tool for model training. PyTorch is a universal library that supports multiple inference backends, including X86/ARM CPUs, CUDA, AMD ROCm, and other less common inference technologies.

ONNX Runtime. This is a cross-platform model inference runtime that supports various backends, ranging from CPUs to CUDA, TensorRT, ROCm, and less commonly used hardware accelerators, such as Rockchip NPUs and AMD Xilinx FPGA cards.

Ultralytics YOLO. YOLO is a common name for a zoo of neural models used to detect, classify, segment, and estimate poses. Ultralytics is a commercial company that develops a licensed toolkit for convenient training and inference of YOLO models (both open-source and custom, such as YOLO8 and YOLO11) in a curated, easy-to-use manner. The framework is freely available for non-commercial use, but a license is required to use their code or models commercially. Ultralytics YOLO is extremely popular due to its very lean learning curve.

TensorFlow. As a pioneer of deep learning frameworks, TensorFlow is widely adopted and offers features similar to those of PyTorch. However, new projects increasingly choose PyTorch over TensorFlow. Nevertheless, mature and legacy projects continue using it. There is also a regional aspect of popularity and adoption, according to Google Trends.

Keras. Keras is similar to TensorFlow and PyTorch, though it is not as widely used among serious researchers and is not often used in production. It has less buzz but is beginner-friendly and full-featured technology. Essentially, Keras is a high-level API for TensorFlow, PyTorch, and JAX, with batteries included.

NVIDIA TensorRT. TensorRT is an optimization and inference runtime for NVIDIA hardware. It delivers best-in-class performance on NVIDIA hardware. TensorRT is often used as an inference backend in PyTorch, TensorFlow, ONNX Runtime, and other frameworks. Also, with TensorRT, developers can easily estimate the model’s inference performance for various batch sizes, precisions, and input sizes.

TensorRT is a must-have technology for neural networks running on NVIDIA hardware. It is worth noting that not all models are TensorRT-compatible; depending on the model architecture, you may need to modify specific layers or use a more general inference engine (e.g., PyTorch), which may result in a performance trade-off.

NVIDIA Triton. Triton is a universal inference server that allows communication with deployed models through HTTP or gRPC. The technology addresses multiple issues, with a primary focus on inference flexibility and horizontal scalability. It is less frequently used in video-oriented computer vision but more relevant to systems that work with pictures (e.g., to implement a photo-enhancing filter or document recognition).

Often, engineers who come into computer vision from the web service development industry misuse Triton to implement video analytical pipelines, decode video streams into images, and send them to Triton over the network; however, this is often an anti-pattern and inefficient processing. However, Triton integrates with DeepStream and can sometimes yield beneficial results when raw images are shared between the two via CUDA memory.

To sum up, use Triton for image processing, not video streams, when you need horizontal scalability and flexibility in model architectures, and when model processing time is significantly larger than transactional costs.

NVIDIA DeepStream. DeepStream is the most optimized and recommended technology for video analytics and computer vision on NVIDIA edge and data center platforms. This steep-learning-curve SDK can be difficult for many developers coming from the web services industry and requires advanced knowledge of GStreamer, a real-time multimedia framework. Not many people are brave enough to work with DeepStream, but the results for those who do are outstanding.

DeepStream optimizes video and image processing by leveraging NVIDIA’s best practices, including hardware decoding, TensorRT-enabled inference, Triton integration, and hardware encoding. Most importantly, it enables correct memory usage on CUDA, thereby decreasing transactional costs. Nevertheless, DeepStream is a complex, low-level SDK, so when developers need a more affordable solution, they use Savant (see later), which is built on DeepStream and hides its complexity while providing numerous advanced features.

NVIDIA VPI. This is a workhorse library similar to OpenCV but optimized for both CUDA and specialized acceleration hardware available on the Jetson platform. Providing fewer features and being less popular, it is more optimized for the hardware and supported by NVIDIA.

In OpenCV, the CPU is a primary computing resource, while CUDA is a best-effort-supported resource. On the other hand, for VPI, CUDA, and Jetson ASICS, such as PVA and VIC, the CPU is a primary compute resource, and the GPU is a secondary resource when the former cannot be used efficiently.

NVIDIA CV-CUDA. CV-CUDA is a library similar to VPI and OpenCV, but mainly optimized for data center use with a focus on batched processing. In this respect, it differs from OpenCV and VPI, which are oriented toward individual image processing. It is a frequent situation when NVIDIA provides concurrent and parallel implementations of the same functionality, targeting slightly different purposes. This requires significant time and expertise to explore the landscape, evaluate it, and select the right solution that best meets a particular need.

Intel OpenVINO. OpenVINO is Intel’s framework optimized for inference, image, and video processing on Intel hardware. It provides a model converter, an inference runtime, a post-training optimizer, a model zoo, and a whole bunch of other components. OpenVINO targets Intel CPUs and other Intel accelerators, such as Myriad and Intel Xe, making it a relatively niche AI solution designed to run on Intel hardware. The ecosystem is large and advanced, but the problem lies with hardware: Intel often cancels products that are not based on X86 CPUs, thereby disappointing expectations.

Overall, the industry perceives Intel as a company rolling down the hill, losing the markets where it dominated, which makes OpenVINO an exotic option for most developers. We will not compare Intel OpenVINO in the following sections because it is not a first-class citizen in the domain under discussion.

AMD VVAS and Ryzen^TM AI. AMD is NVIDIA’s rival, seeking to increase its market share in AI. They aggressively acquire startups and increase their portfolio to compete with NVIDIA. Such aggressive growth leads to a complex ecosystem with numerous competing products that overlap in functionality. It’s a pity situation: AMD produces versatile, high-quality hardware, but software isn’t its strong suit. Several years ago, they acquired Xilinx and released VVAS, an ecosystem that works on Xilinx hardware. The VVAS repo on GitHub was updated in 2023. Ryzen^TM AI, on the other hand, looks more promising because technology is available on AMD-enabled laptops via NPU colocated with the main CPU. Thus, when properly positioned, it can facilitate edge-based computer vision.

Overall, unlike Intel, AMD has had greater success with its hardware but lacks quality software and developer friendliness. Similarly, we will not compare AMD products in the future sections because they are not yet mature in the domain discussed here, such as OpenVINO.

Savant Framework. Savant is a high-level framework standing on the shoulders of NVIDIA DeepStream. It focuses on processing real-time, semi-real-time, and non-real-time applications that utilize dynamic live video streams (RTSP, L4T, GigE) and video files. As a framework, it is not something your code calls (like a library); instead, it calls your code or declarative elements to define the behavior required by an application.

Developed to be as powerful as DeepStream but easier to use, without a steeper learning curve and with batteries included. Savant is a Python-first framework tailored for computer vision and video analytics application development, featuring ready-to-use components for glass-to-glass applications that work on edge and data center NVIDIA hardware. The framework provides a friendly API that enables the development of reliable, production-ready applications in days, not weeks.

Detectron2. This project is a library containing Facebook’s cutting-edge object detection, pose estimation, and instance segmentation models. The framework is quite popular among individuals who require a higher-level API for detection and segmentation (both training and serving) and who are not willing to deal with the complexity of PyTorch. Often, Detectron2 is associated with panoptic scene perception. Nevertheless, under the hood, it utilizes PyTorch to perform operations.

MMCV. This project is a high-level computer vision library built on PyTorch that simplifies PyTorch for specific machine learning domains. The library focuses on inference and training, but not pipelines or applications. According to GitHub repository activity, the project is in a maintenance state, with the latest update dating back 9 months (as of the date of this article), with numerous unmerged pull requests and open issues.

Paddle (or PaddlePaddle). The name is an abbreviation for Parallel Distributed Deep Learning. Paddle is similar to PyTorch in many ways, and PyTorch inspires many approaches. Frequently, Paddle is chosen when its ecosystem provides models suitable for solving specific problems, such as optical character recognition. The Paddle is widely used in East Asia, particularly in China, due to its primary developer, Baidu. However, compared to PyTorch’s popularity, Paddle is not a sizeable competitor.

Google MediaPipe. MediaPipe is a computer vision framework from Google that provides a graph-based processing architecture and prebuilt ML solutions for facial, gesture recognition, pose estimation, and object detection. The framework covers popular platforms such as Android, iOS, desktop, web, and embedded. It is primarily known for its ML-powered real-time tracking (like Google’s 3D hand tracking demo). Most people use MediaPipe for their solutions: ready-to-use pipelines solving specific problems.

Summary

In this article, we outlined the major problems users face in the computer vision industry when building video analytics applications. The major takeaways are:

CV/ML is a multidisciplinary field that requires specialized, niche knowledge and erudition.
Many highly optimized, narrow technologies solve particular problems very efficiently, but are steep in learning.
Various technologies focus either on core functionality or on the entire stack, from hardware to glass-to-glass applications.
Easy-to-use approaches are typically inefficient or suboptimal because they trade efficiency for simplicity.
Having a background in other programming areas often leads to the misuse of technology when developers try to apply their current knowledge to the domain.

Choosing the right stack for an efficient and reliable application architecture is a sophisticated task: many tools provide functionality that looks similar at first glance, but detailed exploration and benchmarking quickly shorten the candidate list by eliminating many promising candidates.

In the following article, we will explore how each of the discussed products maps to the problem landscape and dive deeper into their pros and cons.