Telemetry propagation breaks when kernel spans are filtered out by topic
I ran into some problems when implementing the new telemetry topics and runtime configuration in PyTango. Initially, I saw tests failures with opentelemetry-cpp 1.26.0 in CI, but not locally, where I had version 1.21.0. Investigating that (with the help of Codex) led to a couple of related problems. ### Examples It is easiest to show with some screenshots. Here a PyTango client is reading the "voltage" attribute from a PyTango device server. #### libopentelemetry-cpp 1.21.0 **topics: all (client and server)** This works correctly. From top to bottom, spans 1-2: PyTango, spans 3-4: cppTango (see `::` in names), spans 5-7: PyTango. ![image](/uploads/d00e5be0586044084b74df730dc563e9/image.png){width=897 height=188} **topics: user (client and server)** - server span has incorrect parent, and missing the second client span. ![image](/uploads/507562b746dc1598e141db8d23023c78/image.png){width=890 height=85} #### libopentelemetry-cpp 1.26.0 **topics: all (client and server)** Correct - same as with 1.21.0 ![image](/uploads/acaa194dc23e169ed5e205fb82875cab/image.png){width=896 height=211} **topics: user (client and server)** Just a single PyTango client span, the rest is missing ![image](/uploads/a5fa62c843db508bded19bf82f44eb86/image.png){width=893 height=59} #### Expected **topics: user (client and server)** - with the fixes proposed here, we get this. Expected PyTango spans, with correct parent-child relationship, and no cppTango kernel spans. (It is debatable if span 2 should be included as a "user" span but that is an implementation detail in PyTango). ![image](/uploads/2a7e6aa15a89f0a72ca7b7a388fd7f56/image.png){width=894 height=112} --- ## Problem cppTango currently couples two different concerns in its kernel telemetry macros: - activating propagated context - creating hidden kernel spans This affects both sides of an remote procedure call (RPC) boundary: - client-side in [TANGO_TELEMETRY_TRACE_BEGIN](https://gitlab.com/tango-controls/cppTango/-/blob/10.3.0/src/include/tango/internal/telemetry/telemetry_kernel_macros.h?ref_type=tags#L75) - server-side in [TANGO_TELEMETRY_KERNEL_TRACE_BEGIN](https://gitlab.com/tango-controls/cppTango/-/blob/10.3.0/src/include/tango/internal/telemetry/telemetry_kernel_macros.h?ref_type=tags#L82) There are two related issues here, but they are not identical. ### 1. Server-side issue On the server side, the problem became visible with newer opentelemetry-cpp. With opentelemetry-cpp >= 1.24.0, dropped spans clear the sampled bit. This changed in [PR #3745](https://github.com/open-telemetry/opentelemetry-cpp/pull/3745). It is also mentioned in the [v1.24.0](https://github.com/open-telemetry/opentelemetry-cpp/releases/tag/v1.24.0) release notes. As a result: - incoming client context is sampled - the hidden server/kernel span is dropped by topic filtering - downstream code receives a non-sampled (i.e., do not emit) trace (traceparent will end in ...-00 instead of --01). This is visible in PyTango: - cppTango C++ user spans can still be emitted when they set tango.telemetry.topic=user - but PyTango spans disappear, because Python OTel correctly honors the propagated unsampled parent So the server-side issue is version-sensitive and was exposed by newer opentelemetry-cpp. ### 2. Client-side issue On the client side, there is a related but separate problem in TANGO_TELEMETRY_TRACE_BEGIN. The hidden client/kernel span is created before trace context is serialized into ClntIdent. If that client span is filtered out by topic policy, cppTango can still propagate the wrong parent context downstream. This issue is not specific to newer opentelemetry-cpp. It is a semantic problem in the client-side macro behavior itself: - an already-active user/client span can be replaced by a hidden client/kernel span - the propagated parent on the server no longer matches the expected direct parent-child relationship ## Expected behaviour Topic filtering should control which spans cppTango emits, without breaking context propagation. Desired exported shapes, based on topics: - client 'all', device 'all': client -> rpc -> kernel -> user - client 'all', device 'user': client -> rpc -> user - client 'user', device 'all': client -> kernel -> user - client 'user', device 'user': client -> user This should work the same whether the user span is created in C++ or Python. ## Proposed fix Decouple propagation from kernel span emission. Server side: - keep TANGO_TELEMETRY_KERNEL_TRACE_BEGIN as the entry point - activate incoming context first - only create the hidden server/kernel span when kernel spans should actually be emitted - otherwise keep the propagated client context active so downstream user code attaches directly to it Client side: - update TANGO_TELEMETRY_TRACE_BEGIN so it does not always replace an already-active context with a hidden client/kernel span - if kernel spans are enabled, create the client/kernel span as today - otherwise preserve the already-active context for propagation - if there is no active context yet, cppTango may still create a client span, even if topic filtering later drops it See attached patch file (this is based on cppTango 10.3.0): [cpptango-telemetry.patch](/uploads/20ebb0479d94ac2a5e9fb781088201bb/cpptango-telemetry.patch) I haven't found a way to fix this without changing the cppTango API, but maybe it is possible? ## Validation A useful regression test is a 2x2 matrix over client/device topics: - client all, device all - client all, device user - client user, device all - client user, device user For each case, verify the parent-child relationship of the exported spans, not just trace continuity. These changes have also be validated with PyTango tests. ## Summary The issue is not that PyTango ignores sampling rules. The issue is that cppTango can propagate the context of a topic-filtered hidden span instead of preserving the correct active parent context. Once cppTango propagates the wrong context: - newer OTel makes the server-side case visible as ...-00 - downstream SDKs correctly honor that propagated context - trace structure becomes inconsistent across C++ and Python code This issues suggests a new release will be required to enable PyTango to properly support runtime telemetry configuration of topics.
issue