Add OpenTelemetry support for distributed tracing
Overview
Support distributed tracing of server and client methods, using OpenTelemetry and cppTango v10's new features (part of IDL v6).
The implementation handles the following configurations:
- Telemetry support not compiled in to cppTango: no telemetry, but dummy functions available for same API.
- Telemetry support compiled in:
- Python OpenTelemetry API and SDK dependencies installed: full functionality.
- Python OpenTelemetry API dependency installed, but not SDK dependency: partial functionality - functions calls propagate telemetry information, but no traces are emitted (the tracing backend will show missing traces).
- Python OpenTelemetry dependencies not installed: no telemetry, but dummy functions available for same API.
For servers, devices defined with the high-level API are traced. Devices defined the low-level API devices are not traced by PyTango.
For client access, DeviceProxy
, AttributeProxy
, Group
and Database
access can be traced.
The new telemetry dependencies are optional. If they are not available, the tracing functionality is disabled.
Implementation details
A new function and class is added to allow exchanging
OpenTelemetry context between cppTango's C++ context
and PyTango's Python context. These are in a module
called _telemetry
- the leading underscore indicates
that they are intended for PyTango's internal use only.
Client access uses TraceContextScope
to pass Python context
to C++. Device servers use get_trace_context
to
pass C++ context to Python.
At the Device implementation level, we expose 6 functions from cppTango for enabling and disabling telemetry, and checking if it is enabled (both overall device level, and kernel-level tracing).
Device also get some new methods to create the OpenTelemetry tracer provider and tracer. These are public, and can be overridden, if the user needs specialisation. Each device also gets its own OpenTelemetry tracer, which is used to emit traces belonging to that device class and instance.
For client access, there is a single client tracer. This is used if client access (DeviceProxy, AttributeProxy, Group, Database) is executed outside of a Tango device. E.g., running from a Python script or interactive terminal.
When doing client access from within a Tango device (e.g., inside a command handler) the traces are associated with the device's tracer instead.
All client methods that used green mode access will also emit distributed traces.
As telemetry support is a compile-time option for cppTango, we
need to do the same in PyTango. This can be checked via the
constant tango.constants.TELEMETRY_SUPPORTED
. If telemetry support is not
compiled in, the telemetry-related functions are still available,
but are replaced by no-ops/dummy methods. This means user code doesn't have to check
if telemetry is available, unless they want to.
Fix args list in __EnsureOmniThread__exit__
and return false,
to better match expectations for context handler exit.
Limitations
For devices using asyncio and gevent green modes, some spans within command and attribute handlers will be incorrectly associated with the standard client tracer instead of the device's tracer.
Environment variables
Tango has a number of new environment variables related to telemetry. PyTango is also using TANGO_TELEMETRY_ENABLE
, TANGO_TELEMETRY_TRACES_EXPORTER
and TANGO_TELEMETRY_TRACES_ENDPOINT
.
PyTango also has some of its own:
-
PYTANGO_TELEMETRY_SPAN_PROCESSOR_TYPE
- This allows the type of span processor to be overridden. Options are
simple
for theSimpleSpanProcessor
, orbatch
for theBatchSpanProcessor
. If not defined, the default behaviour is to usebatch
, unless console output is selected, in which casesimple
is used.
- This allows the type of span processor to be overridden. Options are
-
PYTANGO_TELEMETRY_CLIENT_SERVICE_NAME
- The client tracer's logical name (see
service.name
) is"pytango.client"
by default. Users can provide their own name using this variable.
- The client tracer's logical name (see
-
PYTANGO_DISABLE_TELEMETRY_PATCHING
- PyTango needs to patch the device and client methods to enable tracing. This patching can be disabled by setting the variable to
on
(cppTango will still process telemetry as normal, unaware of the Python code).
- PyTango needs to patch the device and client methods to enable tracing. This patching can be disabled by setting the variable to
TODO
-
Add some tests -
Use TANGO_TELEMETRY_TRACES_ENDPOINT
env var when instantiating exporters -
Remove debug print messages -
Wait for cppTango changes to be merged, and update CI to run our tests -
Update pytango-builder Docker image with new cppTango changes
Other
This is based on a unmerged branch on cppTango, trace_context_propagation#1185
: cppTango!1197 (merged)
Closes #582 (closed)
Example of traces from the protoyping example, viewed with SigNoz. Different colours represent different "services":
- purple for the top-level Python script (
my.app
) - orange for client calls from that script (
pytango.client
) - light blue for the Leader device (
Leader
) - dark blue for the Follower devices (
Follower
)
Env vars: TANGO_TELEMETRY_ENABLE=on
, TANGO_TELEMETRY_TRACES_EXPORTER=grpc
, TANGO_TELEMETRY_LOGS_EXPORTER=grpc
.