Pipeline Observabilty With Prometheus And Grafana

Monitoring as a part of software observability is crucial for understanding the state and health of the system. Video analytical and computer vision pipelines also benefit from monitoring, allowing SRE engineers to understand and predict system operation and reason about problems based on anomalies and deviations.

Computer vision pipelines represent complex software working in a wild environment, requiring continuous observation to understand trends and correlations between internal and external factors.

To solve that problem, we integrated Prometheus metric exporter in Savant. The exporter allows for exporting two kinds of metrics:

System metrics collected from Savant components;
User-defined metrics collected from Python functions.

Observability in Savant
Savant supports not only metrics but also OpenTelemetry traces allowing to trace every frame passing through the pipeline. Traces help to instrument the system very precisily which is not requried for monitoring needs because it requires a lot of effort to transform tracing data into metrics.

To overcome the situation we implemented metrics showing aggregate performance counters and gauges for a running pipeline.

Currently, metrics are exported in the Prometheus format. However, we intend to modify the exported to support OpenTelemetry when Prometheus supports it (expected to be in the 3.0 release).

System Metrics

They are collected internally for every pipeline stage, including those not managed by the user. Every stage counts the number of processed frames, batches, their metadata objects, and the current queue length. Those metrics are automatically exported to Prometheus when the telemetry.metrics section is configured.

params:
  telemetry:
    metrics:
      # Output stats after every N frames
      frame_period: ${oc.decode:${oc.env:METRICS_FRAME_PERIOD, 10000}}
      # Output stats after every N seconds
      time_period: ${oc.decode:${oc.env:METRICS_TIME_PERIOD, null}}
      # How many last stats to keep in the memory
      history: ${oc.decode:${oc.env:METRICS_HISTORY, 100}}
      # Metrics provider name
      provider: ${oc.decode:${oc.env:METRICS_PROVIDER, null}}
      # Parameters for metrics provider
      provider_params: ${json:${oc.env:METRICS_PROVIDER_PARAMS, null}}

Currently, the only supported provider is Prometheus. The params.telemetry.metrics.provider_params allows setting port and labels attributes:

params:
  telemetry:
    metrics:
      frame_period: 1000
      time_period: 1
      history: 100
      provider: prometheus
      provider_params:
        port: 8000
        labels:
          module_type: detector

The port defines the endpoint for the Prometheus collector to connect to. The labels allow custom labels to be set for all gauges and counters.

User-defined Metrics

As well as system-level metrics, Savant allows developers to collect their own metrics in pyfuncs. We support only two kinds of them: a counter and a gauge. Developers can use the following syntax to implement the functionality:

"""Example of how to use metrics in PyFunc."""
from savant.deepstream.meta.frame import NvDsFrameMeta
from savant.deepstream.pyfunc import NvDsPyFuncPlugin
from savant.gstreamer import Gst
from savant.metrics import Counter, Gauge


class PyFuncMetricsExample(NvDsPyFuncPlugin):
    """Example of how to use metrics in PyFunc.

    Metrics values example:

    .. code-block:: text

        # HELP frames_per_source_total Number of processed frames per source
        # TYPE frames_per_source_total counter
        frames_per_source_total{module_stage="tracker",source_id="city-traffic"} 748.0 1700803467794
        # HELP total_queue_length The total queue length for the pipeline
        # TYPE total_queue_length gauge
        total_queue_length{module_stage="tracker",source_id="city-traffic"} 36.0 1700803467794

    Note: the "module_stage" label is configured in docker-compose file and added to all metrics.
    """

    # Called when the new source is added
    def on_source_add(self, source_id: str):
        # Check if the metric is not registered yet
        if 'frames_per_source' not in self.metrics:
            # Register the counter metric
            self.metrics['frames_per_source'] = Counter(
                name='frames_per_source',
                description='Number of processed frames per source',
                # Labels are optional, by default there are no labels
                labelnames=('source_id',),
            )
            self.logger.info('Registered metric: %s', 'frames_per_source')
        if 'total_queue_length' not in self.metrics:
            # Register the gauge metric
            self.metrics['total_queue_length'] = Gauge(
                name='total_queue_length',
                description='The total queue length for the pipeline',
                # There are no labels for this metric
            )
            self.logger.info('Registered metric: %s', 'total_queue_length')

    def process_frame(self, buffer: Gst.Buffer, frame_meta: NvDsFrameMeta):
        # Count the frame for this source
        self.metrics['frames_per_source'].inc(
            # 1,  # Default increment value
            # Labels should be a tuple and must match the labelnames
            labels=(frame_meta.source_id,),
        )
        try:
            last_runtime_metric = self.get_runtime_metrics(1)[0]
            queue_length = sum(
                stage.queue_length for stage in last_runtime_metric.stage_stats
            )
        except IndexError:
            queue_length = 0

        # Set the total queue length for this source
        self.metrics['total_queue_length'].set(
            queue_length,  # The new gauge value
            # There are no labels for this metric
        )

The metric exporter exports current metric values according to its schedule, not every set value.

End-To-End Sample

We have created a complete demonstration showing how to use metrics. You can find it in the Savant repo by the following link. The sample provides a ready-to-use Grafana dashboard displaying all system metrics.

Links

Interested in diving deeper into observability? Investigate how OpenTelemetry traces can be used in Savant for development and precise instrumenting.

Don’t forget to subscribe to our X to receive updates on Savant. Also, we have Discord, where we help users.

To know more about Savant, read our article: Ten reasons to consider Savant for your computer vision project.