Multinode Protocol not working (possible LAVA bug)
CONTEXT
Hello everyone. I'm working in a system which uses LAVA to automate the execution of tests for several devices/targets. However, after several weeks of work and test and after reading all the pages of the LAVA documentation for the use of the Multinode API and the Multinode Protocol (and how to make tests, etc.), I couldn't find a solution for the synchronization of a computer acting like a host and a device acting like a guest, which it must wait to the host to be up before starting itself.
ISSUE
After testing during 2-3 weeks (getting to test dozens of use cases by brute force, varying parameters and configurations), finally I started studying the code of LAVA of the Multinode Protocol implementation, in order to locate the error. I discovered that the error was being raised by the function:
def collate(self, reply, params)
in the protocols/multinode.py file (line 451).
I'm not an expert in Python, but I studied the flow execution of the code, the parameters and the results returned by each function, and I think that there is a bug in the code. Specifically in the function collate(). According to the documentation, in the Multinode Protocol we can be use the lava-send command, specifying only the messageID and not the message. However, when reviewing the collate() function, it seems that if you don't specify a message in the job configuration, the job will fail regardless. Also, when specifying the message, it fails too.
Also, this function is accessed by the Multinode Protocol when using the lava-wait command in the target and it fails when using the lava-send command in the host. I don't know if this function is intended to be used by all the commands of the Multinode Protocol.
ATTACHMENTS
I'm attaching the configuration of the device types and the devices used in LAVA, as well as the configuration of some jobs that I submitted and its execution trace with the errors. I had to delete some data due to confidentiality and security, but it is not related to the problem. Also, I'm attaching a simple sequence diagram, in which I show the main functions used during the Multinode Protocol execution (according to what I learned when studied the code). I hope that this diagram can help you when debugging the problem.
-
Host device type configuration -> testbench.jinja2
-
Host device configuration -> testbench-2.jinja2
-
Guest device type configuration -> t2080rdb.jinja2
-
Guest device configuration -> t2080rdb-1.jinja2
-
Job configuration -> job_configuration.yaml
-
Host results trace -> testbench.log
-
Guest results trace -> t2080rdb.log
I'm trying to patch this bug by myself, but, as I said, I'm not an expert in Python and I'm not very sure how work all the functions involved in this. I would appreciate all the help that you can provide and, if this is truly a bug in LAVA, I think that fixing it will help us all.
Please, comment on any question, idea or update that you have on this. Thank you very much in advance.