Threaded ECL may hang when using stdio or certain types of pthread(3) locks on NetBSD

The bug may be caused by an issue with either ECL, boehm-gc, the NetBSD pthread(3), NetBSD libc(3) stdio, or an uncommon combination of them. I have used threads extensively on NetBSD for many production tasks without similar issues, but without boehm-gc or ECL being involved.

The bug can be reproduced using thread-enabled boehm-gc and ECL, and heavy stdio usage on NetBSD-6. Back when I discovered the bug, I couldn't reproduce it on Linux. The boehm-gc library however uses different strategies depending on the OS, and it is likely that the bug lies there.

It appears that when the GC must increase the heap size, and that one of the threads is on an internal mutex to a stdio-related rwlock, when trying to restart the world after allocation, the library is unable to restart that thread. Even if the library is told to retry sending signals (which it does according to the logs) the thread remains unresponsive to the signal, and remains in a busy loop on the mutex.

The netbsd-boehm-gc branch was created with some minor boehm-gc modifications (to use SIGPWR for SIG_THR_PAUSE and SIGXCPU for SIG_RESTART which are officially supported by NetBSD-6 and which gdb gracefully recognizes; using the real-time signals worked but they are not officially available for userland, and unrecognized by gdb), the heap size limit auto-adjustment (from the heap-size branch), and an additional directory, ecl-gc-bug/ to contain notes and diagnostics for further work.

It would also be nice to reduce the test case to a simpler C program using both threads and boehm-gc which reproduce the bug. This would require some work, as I've never used the boehm-gc library in one of my projects before.

It seems that when avoiding to use stdio is possible, ECL is very stable on NetBSD; I have high uptimes running crow-httpd (http://mmondor.pulsar-zone.net/mmsoftware.html#crow-httpd); the current uptime is 46 days, and it has seen more than 200 in the past.

Uptimes are also reasonable on a currently closed-source ECL-based IDS system and packet analyser, but this latter requires a watchdog script which must occasionally restart it when it also experiences the bug. To mitigate it in the IDS, because the problem appears related to growing the GC heap, some data is pre-allocated to grow the heap at startup, but since it also currently uses libpcap(3) which lacks a lower-level FD-I/O interface, it still uses stdio for now.

Also see http://gnats.netbsd.org/41893 for other related information.

If others want to help with debugging, please see the README, gc-bug-test.lisp program and ktrace/gdb logs under the ecl-gc-bug/ directory. I did not yet try to reproduce the bug on NetBSD-7.