Provide performance monitoring for the ZMQ event system
At SKAO, we have had some issues where it appears that Tango events have gone "missing" and it has been difficult to track down the cause of these missing events because the Tango event system is somewhat opaque.
To try and track down the missing events, we have developed a patched version of cppTango (available in a pytango wheel from https://gitlab.com/ska-telescope/ska-tango-event-monitor) which provides new QueryEventSystem() command to the DServer so that clients can inspect the state of the ZMQ supplier/consumer. Additionally, there are StartEventSystemMonitoring()/StopEventSystemMonitoring() commands which make the device server start gather performance metrics for the event processing, which can be retrieved via QueryEventSystem().
Using this new tool, we have been able to track down issues in SKAO software which is causing these missing events.
The QueryEventSystem() call returns the following information:
- The contents of
ZmqEventSupplier::event_counters - The contents of
ZmqEventConsumer::event_callback_map - The contents of
ZmqEventConsumer::channel_map - How many events have been received for a particular topic
If performance monitoring has been enabled, it also returns performance samples of events been sent and received (up to 256 each). Each "publisher" sample includes:
- How long since the last event was sent
- How long it took to add the event to the ZMQ publisher queue
Each "subscriber" sample includes:
- How long since the last event was received
- How long it takes to process the event (i.e. call the user callbacks)
- How long the subscriber thread spent sleeping, waiting for this event
- The "latency" of the event, i.e. the time difference between when the event was sent and received[1]
- The name of the attribute
All the information is returned in a single DevString, holding a JSON object.
This issue is to upstream these patches (or something similar) as we feel they would be generally useful for the Tango community. Something missing from the patches is the ability to query the event consumer from a pure Tango client, it would be nice if we could include this too.
[1] The sent time is taken from the attr_value, which can be set by the user, so this measurement should be taken with a pinch of salt.