scilab 6.1.1 and 2024.1.0 documentation building deadlock randomly
Hi Scilab Developers,
Bug Description
Santiago Vila identified some sort of deadlock situation occurring notably during the documentation building steps of scilab 6.1.1 initially, but we have also seen the problem occur in the more recent 2024.1.0. The problem was initially reported in the Debian bug #1106083. In the current state of our investigations, the problem surfaced when bumping the operating system kernel from Debian 12 bookworm's Linux 6.1.y kernel series to the upcoming Debian 13 trixie's Linux 6.12.y kernel series; the rest of the build environment being Debian 13 trixie's in any case due to building in an isolate environment.
Some information identified after further investigation,
although not (yet) recorded in the Debian bug. is that the
problem is not dependent of the kernel config options (it occurs
on either plain 6.12.y+deb13-amd64, or cloud's special
6.12.y+deb13-amd64-cloud). The deadlock has been seen so far
only on virtual machines; I don't believe anyone witnessed this
issue on real hardware yet (or if it does, it has less than 5%
of probability to occur). In addition, stressing the entropy
source with cat /dev/random>/dev/null seemed to substantially
increase the probability of deadlock, but I may have been
carrying cargo cult here. I also tested whether there would be
differences of behaviour between the openjdk 21 and 25, both
currently available in the upcoming Debian 13 trixie, and both
were affected by the deadlock.
Steps to reproduce
To reproduce the deadlock, we run more or less :
$ cd scilab && make all doc
[…]
As mentionned above, stressing the entropy pool of the system seems to increase the frequency of occurrence of the problem (I have not managed to get two consecutive builds to go through under such conditions).
$ cat /dev/random > /dev/null
And in any case the issue has been only witnessed in virtual machines so far, and we've only tested this on Debian 13 since it is our operating system target. The deadlock occurs apparently while building a localized documentation (if I understand the steps correctly):
-- Building documentation (pt_BR) --
LANG=pt_BR.UTF-8 LC_ALL=C SCI_DISABLE_TK=1 SCI_JAVA_ENABLE_HEADLESS=1 _JAVA_OPTIONS='-Djava.awt.headless=true' HOME=/tmp ./bin/scilab-adv-cli -noatomsautoload -nb -l pt_BR -nouserstartup -e "try xmltojar([],[],'pt_BR');catch disp(lasterror()); exit(-1);end;exit(0);"
Warning: Localization issue. Failed to change the LC_CTYPE locale category. Does not support the locale 'pt_BR' (null) C.
Did you install the system locales?
Warning: Localization issue. Does not support the locale 'pt_BR'
Returned: NULL
Current system locale: C
Did you install the system locales?
Opening a side shell reveals the last process to run is:
/build/reproducible-path/scilab-2024.1.0+dfsg/scilab/.libs/scilab-bin -noatomsautoload -nb -l pt_BR -nouserstartup -e try xmltojar([],[],'pt_BR');catch disp(lasterror()); exit(-1);end;exit(0); -nw
Stracing the process reveals a system call waiting for a lock:
futex(0x7f4dfc8da220, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
What is the expected correct behavior?
The behavior we expect would be to have the documentation building steps to build through reliably.
Error log
We can't really provide an error log, because the build process just ceases running once reaching the deadlock conditions. The CPU is at 0%.
Have a nice day, :)
Étienne Mollier.