Add OpenTelemetry support for distributed tracing (!708) · Merge requests · tango-controls / pytango

Anton Joubert requested to merge 582-telemetry-support into develop May 15, 2024

Overview

Support distributed tracing of server and client methods, using OpenTelemetry and cppTango v10's new features (part of IDL v6).

The implementation handles the following configurations:

Telemetry support not compiled in to cppTango: no telemetry, but dummy functions available for same API.
Telemetry support compiled in:
- Python OpenTelemetry API and SDK dependencies installed: full functionality.
- Python OpenTelemetry API dependency installed, but not SDK dependency: partial functionality - functions calls propagate telemetry information, but no traces are emitted (the tracing backend will show missing traces).
- Python OpenTelemetry dependencies not installed: no telemetry, but dummy functions available for same API.

For servers, devices defined with the high-level API are traced. Devices defined the low-level API devices are not traced by PyTango.

For client access, DeviceProxy, AttributeProxy, Group and Database access can be traced.

The new telemetry dependencies are optional. If they are not available, the tracing functionality is disabled.

Implementation details

A new function and class is added to allow exchanging OpenTelemetry context between cppTango's C++ context and PyTango's Python context. These are in a module called _telemetry - the leading underscore indicates that they are intended for PyTango's internal use only.

Client access uses TraceContextScope to pass Python context to C++. Device servers use get_trace_context to pass C++ context to Python.

At the Device implementation level, we expose 6 functions from cppTango for enabling and disabling telemetry, and checking if it is enabled (both overall device level, and kernel-level tracing).

Device also get some new methods to create the OpenTelemetry tracer provider and tracer. These are public, and can be overridden, if the user needs specialisation. Each device also gets its own OpenTelemetry tracer, which is used to emit traces belonging to that device class and instance.

For client access, there is a single client tracer. This is used if client access (DeviceProxy, AttributeProxy, Group, Database) is executed outside of a Tango device. E.g., running from a Python script or interactive terminal.

When doing client access from within a Tango device (e.g., inside a command handler) the traces are associated with the device's tracer instead.

All client methods that used green mode access will also emit distributed traces.

As telemetry support is a compile-time option for cppTango, we need to do the same in PyTango. This can be checked via the constant tango.constants.TELEMETRY_SUPPORTED. If telemetry support is not compiled in, the telemetry-related functions are still available, but are replaced by no-ops/dummy methods. This means user code doesn't have to check if telemetry is available, unless they want to.

Fix args list in __EnsureOmniThread__exit__ and return false, to better match expectations for context handler exit.

Limitations

For devices using asyncio and gevent green modes, some spans within command and attribute handlers will be incorrectly associated with the standard client tracer instead of the device's tracer.

Environment variables

Tango has a number of new environment variables related to telemetry. PyTango is also using TANGO_TELEMETRY_ENABLE, TANGO_TELEMETRY_TRACES_EXPORTER and TANGO_TELEMETRY_TRACES_ENDPOINT.

PyTango also has some of its own:

PYTANGO_TELEMETRY_SPAN_PROCESSOR_TYPE
- This allows the type of span processor to be overridden. Options are simple for the SimpleSpanProcessor, or batch for the BatchSpanProcessor. If not defined, the default behaviour is to use batch, unless console output is selected, in which case simple is used.
PYTANGO_TELEMETRY_CLIENT_SERVICE_NAME
- The client tracer's logical name (see service.name) is "pytango.client" by default. Users can provide their own name using this variable.
PYTANGO_DISABLE_TELEMETRY_PATCHING
- PyTango needs to patch the device and client methods to enable tracing. This patching can be disabled by setting the variable to on (cppTango will still process telemetry as normal, unaware of the Python code).

TODO

Add some tests
Use TANGO_TELEMETRY_TRACES_ENDPOINT env var when instantiating exporters
Remove debug print messages
Wait for cppTango changes to be merged, and update CI to run our tests
Update pytango-builder Docker image with new cppTango changes

Other

This is based on a unmerged branch on cppTango, trace_context_propagation#1185: cppTango!1197 (merged)

Closes #582 (closed)

Example of traces from the protoyping example, viewed with SigNoz. Different colours represent different "services":

purple for the top-level Python script (my.app)
orange for client calls from that script (pytango.client)
light blue for the Leader device (Leader)
dark blue for the Follower devices (Follower)

Env vars: TANGO_TELEMETRY_ENABLE=on, TANGO_TELEMETRY_TRACES_EXPORTER=grpc, TANGO_TELEMETRY_LOGS_EXPORTER=grpc.

Edited Jun 17, 2024 by Anton Joubert

Add OpenTelemetry support for distributed tracing