Monitor timeout due to random lags in State command

Created by: roger11

Hi to everyone,

let me introduce you to a problem we have encountered at ALBA synchrotron

Error We have recently stumbled in the following error on our system:

DevFailed[
DevError[
    desc = Not able to acquire serialization (dev, class or process) monitor
  origin = TangoMonitor::get_monitor
  reason = API_CommandTimedOut
severity = ERR]

Problem description It happens when we are executing DATA_READY events on an acquisition thread on the server side and, meanwhile, a client executes a command on the server.

Isolate scenario As our setup was quite complex, we proceeded in creating a reduced one that consists on a server and a client. Therefore, a zip file with a Server and a Client is attached: MonitorLockSerialization.zip

Lets review its components:

  • Server The server has three attributes, one that generates data_ready events, one that generate change events instead and finally a sleep_time variable. Both the attr that generate events start to do so when written. Their write method starts a thread that generates events waiting sleep_time seconds between each generation. To stop the event generation we have implemented a StopThread command that stops all threads. Finally, but not less important for it, we have implemented a command that makes sleeps (CommandSleep).
  • Client Our client simply receive a parameter and executes a method that, depending on the input parameter of the client, starts one of the events generation of the server, then it makes a subscription to the attr and, finally, it starts a loop that executes a command_inout of the device (the CommandSleep)

The setup is based in this scripts but to build it, it is necessary to:

  1. Start the server: Open a console and define the device as follows: $ tango_admin –add-server MonitorLockSerializationServer/LockTest MonitorLockSerialization test/monitor_lock/1 Once the device has been defined, start it from the directory where its file is located with: $ python MonitorLockSerializationServer.py LockTest -v4
  2. Start the client: From another console of a host with the same TANGO_DB we run the script from the directory where it is located like: $ ./MonitorLockSerializationClient ChangeEvent if we want to test the setup with change events. $ ./MonitorLockSerializationClient DataReadyEvent if we want to test the setup with data_ready events instead.

Error generation If we run the setup using the ChangeEvent parameter no error will happen and the system will work as expected. If we run the setup using the DataReadyEvent parameter, after some loops (10 exactly, as the CommandSleep executes 10 times faster than the push data ready) the client will crash from a timeout executing the CommandSleep and the server will generate the Not able to acquire serialization monitor error. Playing with the sleeps time of the client we can force the problem to happen at the first loop (i.e. increasing the sleep time on the CommandSleep) but as it is now, the error happens quite fast either way.



That is all. Any additional information required, do not hesitate to ask for it.

Many thanks, Roger