SAVANT | Approaches On How To Surpass Real-Time Video Analytics Challenges

Real-time video analytics is a hot topic today. Needless to say, visual information is essential for us: it is estimated that about 80–90% of the information we receive comes through our sense of sight. Moreover, the human visual system is remarkably efficient at processing a vast amount of information; our visual processing centers are highly adept at interpreting and organizing visual information, allowing us to make sense of the world around us quickly and efficiently.

Article repost
The article was originally published on betterprogramming.pub – a Medium publication. Because the original article is behind the paywall we decided to repost it in the new blog to make it available to broad auditory.

We cannot imagine another world where we don’t rely on visual information so much. Thus, following our behavioristic patterns, we develop technologies that can coexist with us in our world. They must recognize the phenomena the same way we do. As a result, computer vision plays a significant role in implementing intelligent systems that augment our capabilities. Often those systems must react to the events fast, which motivates us to process visual data in real time with low latency.

Such systems rely on specially designed live visual data sources, which can rapidly deliver video frames to video analysis systems, processing the data with ASICs. It is an efficient model for IP/USB/GigE cameras because they don’t have purposes other than producing frames. Of course, vendors are trying to integrate these devices with various AI functions, but we will ignore that because those functions are limited and cover only typical universal patterns.

When an image is delivered to a real-time computer vision system, it must be processed to produce knowledge. What does ‘real time’ mean within the context of video analytics? It is a broad term; let us dive into it to understand its variations better.

Processing Time Limit and Delay

It looks correct to start with FPS: a major factor defining real-time. FPS determines the end-to-end performance characteristics of a pipeline. In short, it specifies how frequently we receive a picture from a camera to analyze it and deliver the results to interested parties.

Typical FPS values for video streams are between 24 and 30 FPS. The processes happening in such videos look continuous and smooth to our brains, but sometimes rapidly developing processes make us increase the FPS rate dramatically: to 200 FPS and higher.

The FPS value defines the quantum of time a computer vision system can afford to process the frame. So, for FPS equal to 30, the system has roughly 33 ms to process the image: processing time delay causes adverse effects like skipping or corruption on the following frames. Besides, there is another factor that is crucial for real-time processing: processing delay. It is responsible for delivering the information to the receiver in time. It is worth noting that different data sources, protocols, and codecs greatly influence the delay.

For example, when an H264 frame is delivered to the Nvidia decoder (NVDEC), it delays the processing until 3 frames are buffered, so for 30 FPS, it leads to a 9–10 ms delay. At the same time, the MJPEG, YUYV, and RGB codecs do not buffer 3 frames, decreasing such delay to 0.

Types of Video Analytics Systems

We can classify computer vision systems based on real-time requirements:

a low-load real-time system;
a high-load almost-real-time system;
a high-load non-real-time system.

Each class has its challenges and approaches to surmount them. In the following passages, we will explore every type in detail and develop architectural proposals based on expectations.

Low-Load Real-Time System

When designing such a system, we must ensure it handles data within the allowed timeframe (i.e., 33 ms for 30 FPS) in the worst operational case. What is the worst case is a million-dollar question. Let us enumerate major factors leading to performance degradation:

more objects in a scene than expected;
poorly defined operational constraints;
overheating.

Let us discuss all of them in their order.

The frequent situation is when an unexpected number of objects appear in the viewport. It may dramatically influence postprocessing and multi-object tracker performance because related algorithms often have quadratic difficulty. As a result, when many objects are analyzed, it may lead to a situation when the processing takes longer than allowed. You must benchmark your pipeline in situations covering the corner cases with many objects to ensure it can handle them.

Poorly defined operational constraints may cause unexpected behavior of the algorithms leading to situations similar to the above-mentioned with a large number of objects. For example, imagine that a dark scene, rainy environment, dirt on a cam lens, or even vibration can make the model produce many phantom objects if the model is unaware of such cases. To surpass such problems, you must validate the operational correctness of the pipeline under development in various real-life situations.

Please, do not underestimate throttling. It may happen for various reasons: long-term operation in a typical environment, increased environmental temperature, overuse of computational resources due to intensive processing (pt. 1), or changed operational conditions (pt. 2).

Let us observe several cases from practice:

An edge device is placed in a hot environment; developers tested the device in a lab under average operational temperature. When deployed in a bus, which worked in a hot environment, the device demonstrated significant performance degradation and unstable functioning.
After several minutes of operation, the Nvidia T4 accelerator under load installed in a server with insufficient airflow demonstrated a significant performance decline.
An industrial GigE vision cam started to produce frames at slower FPS and finally hung because its CPU overheated without proper cooling.
An edge device receiving raw frames from an industrial GigE vision cam experienced throttling because the conversion between color formats caused the CPU to overheat after a while.

Practical Situation
Once, we used Nvidia Tesla T4 installed in a liquid-cooled server provided by an innovative cloud provider. It demonstrated outstanding, predictable performance. However, the same card installed in a customer’s server was showing significantly lower performance declining over time because of throttling.

We have discussed the questions related to processing time, but what about the delay? This aspect is not multifaceted: the rule of thumb is to avoid application-level protocols assuming awaiting a reply and those that are session-based.

Let us discuss the H264 video-encoding protocol. Unlike MJPEG, frames in H264 are not self-sufficient: to decode specific frames, you need to know information about key-frames, and there is a variety of such co-dependent frames(like I-frames, B-frames, P-frames). So to decode such a stream, the decoder needs to keep up with the session; otherwise, it may cause dropping multiple frames until the next I-frame. The situation becomes even worse when the stream is encoded with B-frames. So, the technology definitely introduces the delay in processing.

On the other hand, MJPEG encodes every frame separately, so the decoder doesn’t need to wait for something; the decoding can start as soon as possible.

Now, let us consider the interaction with data sink systems. The critical property of video streams is the cause-and-effect relationship between frames. It means that you expect analytics results to be correctly organized based on processing timestamps. It is crucial when tracking technology is used.

However, it leads to a controversy: on the one hand, we don’t want to use session-based technologies to avoid pipeline blocking; on the other hand, we need session-based technologies to ensure the cause-and-effect agreement. Due to that controversy, it is essential to select the sink system properly: the best choices include low-latency queueing systems like MQTT, NATS, ZeroMQ, and Redis-like systems. Interacting with programs that can introduce unexpected latency, like MySQL, MongoDB, and even Apache Kafka, is unsafe.

To summarize, when building a “Strict Real-Time System,” keep it always “low-load,” providing a significant margin of spare resources necessary to handle unexpected bursts, test the pipeline in all possible conditions for an extended period of operation, ensuring that overheating is not the case. Also, develop heuristics to help the pipeline endure unexpected situations without losing the system’s real-time capabilities. Finally, to decrease delays, avoid application-level session-based protocols like H264 and communications with the systems prone to be slow (RDBMS, OLAP systems, high-load queue brokers). Such systems are challenging to validate and test in various working conditions but are not sophisticated from the perspective of data flow processing.

High-Load Near-Real-Time System

Different from the systems discussed in the previous section, there is a widespread class of pipelines that does its best to be real-time but utilizes the hardware fully and must handle short-term overloads when they are not real-time for a while. The design of such systems is often dictated by economic limitations or when working conditions are predictable. They suffer from all the problems mentioned above in greater respect but have the means to overcome them if appropriately designed.

Let us discuss why their capabilities are affected more than those of systems discussed previously. The answer is simple: they don’t have reserved capacity because the users want to utilize them to maximize their investment return. On the other hand, when appropriately designed, they tackle overloads predictably.

Before we proceed to understand architectural traits, let us mention that their processing time and delay fluctuate in broader ranges compared to those of a low-load real-time system.

For example, you may expect the delay to reach 700–2000 ms for a 30 FPS RTSP stream (500–1500 ms for RTSP access, 99 ms for H264 buffering and decoding, 33 ms for data processing, 10–50 ms for interaction with 3rd-party components, and other operational delays). Such numbers can make the system not applicable for the use cases which require an instant reaction.

Besides, the processing time also can fluctuate in a broader range: expecting that under normal operational conditions, it fits 33 ms (for a 30 FPS stream) means that if the conditions change due to the reasons listed in the previous section, the processing time may span accordingly (throttling, rainy or dark environment, etc.).

The delay is mostly a business-related parameter that determines how long data recipients can wait for data. On the contrary, the processing time is a technical parameter that affects software architecture. It is crucial to have internal queues that buffer video frames when the system operates under an unexpectedly high load. If the buffers are not large enough, the data are dropped or corrupted: in the case of an RTSP stream, it may cause all the frames between keyframes to be corrupted.

Architecture for a High-Load Near-Real-Time System

Let us begin by understanding a problem: the pipeline must keep up with the video source’s FPS. Specific video sources have particular rules for working with them. More to say, skipping frames or their late capturing causes consequences concerning the analytics quality and technical problems. Let us consider two widespread sources to appreciate the problem: USB cam and RTSP stream.

The following example shows slow processing for frames captured from a USB webcam. I artificially added a pause of 1 second between frame capturing, resulting in a massive delay of 4–5 seconds between what happens in a viewport and what is captured from the webcam.

import cv2
import time

cap = cv2.VideoCapture("/dev/video0")
fps = int(cap.get(5))
print("fps:", fps)

while(cap.isOpened()):

    ret,frame = cap.read()
    if not ret:
        break

    print("Captured")
    time.sleep(1)

    cv2.imshow('frame', frame)

    k = cv2.waitKey(1)
    if k == 27:
        break

If you expected the code to capture the current frame every second while uncaptured frames are dropped, you would be disappointed because it doesn’t work that way. The captured frames lag for 4–5 seconds. So, the only way to synchronously capture and process is to configure a lower FPS. Depending on USB cam implementation, frames also may be delivered corrupted.

You will be punished more severely if you try the same trick with an H264/HEVC RTSP stream: as RTSP delivers ordered, co-dependent frames, you must keep up with the stream. Otherwise, you will end up with corrupted frames with unrecoverable artifacts.

What does the right solution look like? Well, the only acceptable choice is to decouple capturing and processing. It can be done with threads and a shared buffer. I will demonstrate it with a simple example in Python:

import threading

import cv2
import time

from threading import Thread
from queue import Queue
from random import randrange

queue = Queue()
cap = cv2.VideoCapture("/dev/video0")

def worker():
    while True:
        delay = randrange(0, 55)
        time.sleep(float(delay) / 1000.0)
        f = queue.get()
        if f is None:
            break
        cv2.imshow('frame', f)

worker_h = threading.Thread(target=worker)
worker_h.start()

while(cap.isOpened()):

    ret, frame = cap.read()
    if not ret:
        break
    queue.put(frame)
    print(f"Queue size: {queue.qsize()}")
    k = cv2.waitKey(1)
    if k == 27:
        queue.put(None)
        worker_h.join()
        break

The thread and queue enabled the decoupling of the processing pipeline and capturing code. This is a crucial part of every pipeline expected to operate as a High-Load Near-Real-Time System.

About GStreamer
GStreamer is a system aimed at building real-time multimedia pipelines. To address the above-mentioned issue, it provides the queue object, which causes running a queue source and a sink in separate threads, decoupling them.

Queues, as decoupling elements, require defining data management policy tailored to the pipeline working expectations. There are several aspects of queues need to be addressed:

how large is the queue in terms of stored elements;
how large is the queue in terms of stored bytes;
what to do with new elements when the queue is full;
what to do with queued elements when the queue is full.

Decisions regarding these aspects define data consistency, delay, and workload adaptation range. Loosening up parameters, you may end up with a high-load non-real-time system, which we discuss in the following section.

High-Load Non-Real-Time System

Although the article is about (near-)real-time video analytics pipelines, I would also like to discuss non-real-time systems. Expanding the queues from in-process buffers to persistent distributed queue brokers, you will implement a high-load non-real-time system. Such a solution has other challenges to overcome connected with scaling and processing state recovery after failures.

Obviously, users don’t expect such systems to be low-latency solutions. Nevertheless, non-real-time video analytics systems are convenient in efficiently processing large volumes of data, utilizing the hardware most economically. For example, if the system that accounts for traffic has a peak load during the day, it is almost spare at night. Skipping the requirement for real-time operation makes it possible to decrease the hardware twice or three-fold, processing the traffic related to work hours 24×7 rather than, let us say, 12×7, what a real-time system would do.

In such systems, data sources usually write the data to systems like Apache Kafka, S3, HDFS, or file chunks, which are later processed by a video analytics core at scale.

Real-life system architects mix and match various properties to address various user requirements. For example, the edge part of the system can be strict real-time, enabling quick response but basic analytical capabilities; conversely, the core part of the system can be designed to be a non-real-time system providing sophisticated insights efficiently.

Conclusion

Real-time video analytics systems have several challenges, which sometimes take significant effort to solve properly. Depending on requirements, you may need to build a hybrid, sophisticated architecture to fulfill seem-to-be controversial customers’ needs. The more you decouple sources and sinks from the processing pipeline, the more scalable the system becomes, introducing additional delays but acquiring crucial properties to work at scale.

To address the problems discussed, we created Savant: an open-source Python framework on DeepStream, allowing deep-learning engineers to build pipelines with excellent production properties effortlessly.