Skip to content

Draft: dispatcher: fix timeout issue for nested retry.

Rémi Duraffort requested to merge nested_retry into master

This MR follows the https://git.lavasoftware.org/lava/lava/-/merge_requests/903, at that time, we had a kernel exception which made LAVA send a # and made the # directly sent to linux prompt, so we propose login retry feature to fix it.

In that case, the login timed out exception will be seen in 1 minute, so after login retry, the uboot-commands timed out won't be seen.

But now, what we want to resolve is the interleave of kernel message and login prompt like next:

2022-05-07T04:42:11 NXP i.MX Release Distro [   21.710112] caam 292e0000.crypto: rng crypto API alg registered prng-caam
2022-05-07T04:42:11 5.15-kirkstone imx8ulpevk ttyLP1
2022-05-07T04:42:11 
2022-05-07T04:42:11 imx8ulpevk l[   21.721075] caam 292e0000.crypto: registering rng-caam
2022-05-07T04:42:11 ogin: [   21.753843] Device caam-keygen registered

The interleave frequently happens when boot, we want to utilize the login retry feature in LAVA to fix it, that is: after the first login timeout, force send a break to let board give a new prompt again.

Everything works fine, except the uboot commands action timeout after the second retry login success:

- {"dt": "2022-06-02T02:23:56.577407", "lvl": "target", "msg": "imx8ulpevk login: root"}
- {"dt": "2022-06-02T02:23:59.889491", "lvl": "debug", "msg": "Setting prompt string to ['root@(.*):~#']"}
- {"dt": "2022-06-02T02:23:59.889684", "lvl": "debug", "msg": "end: 2.4.4.1 login-action (duration 00:00:04) [test_suite_1]"}
- {"dt": "2022-06-02T02:23:59.889766", "lvl": "results", "msg": {"case": "login-action", "definition": "lava", "duration": "3.70", "extra": {"connection-timeout": ! "300", "duration": ! "166.00010514259338", "fail": "login-action timed out after 166 seconds", "success": "(.*) login:", "timeout": ! "180"}, "level": "2.4.4.1", "namespace": "test_suite_1", "result": "pass"}}
- {"dt": "2022-06-02T02:23:59.889934", "lvl": "exception", "msg": "uboot-commands timed out after 203 seconds"}

Brief check the RetryAction, it looks it just adjust the itself's max_end_time, in another word. If we have nested RetryAction used in pipeline or the RetryAction not in the first level of pipeline. The action who is the parent of RetryAction will never have chance to increase its timeout.

This MR will try to find the ancestors action and sibling action of current RetryAction to adjust the max_end_time, then after login retry(a nested action) success, the outer's action won't timeout.

Edited by Rémi Duraffort

Merge request reports