64-bit big-endian SIGSEGV: ecl_symbol_value before init_all_symbols
With ECL 20.4.24 on my 64-bit big-endian machine, ecl_min crashes at startup with a SIGSEGV at address 0x200000000,
Internal or unrecoverable error in:
Got signal before environment was installed on our thread
Abort trap (core dumped)
When ECL boots, it reads the symbol value of *PACKAGE*
before it has initialized the symbol, because it calls ecl_symbol_value() before init_all_symbols(). Most platforms would read p = 2, but my 64-bit big-endian reads p = 0x200000000. The tag (p & 3) == 0 causes the SIGSEGV.
I haven't put a debugger on the big-endian machine. I can use gdb on a little-endian OpenBSD/amd64 to show how ecl reads *PACKAGE*
too early. For this gdb session, I built ecl_min from git commit 329b37d8:
$ CC=cc CPPFLAGS=-I/usr/local/include LDFLAGS=-L/usr/local/lib ./configure \
> --enable-boehm=system --enable-libatomic=system --with-system-gmp=gmp
$ gmake ecl_min
$ cd build
$ egdb ecl_min
...
(gdb) break ecl_symbol_value
Breakpoint 10 at 0x114230: file /home/kernigh/park/ecl/src/c/symbol.d, line 140.
(gdb) break init_all_symbols
Breakpoint 11 at 0x112590: file /home/kernigh/park/ecl/src/c/all_symbols.d, line 287.
(gdb) run
Starting program: /home/kernigh/park/ecl/build/ecl_min
Breakpoint 10, ecl_symbol_value (s=0xfddfb5219d8 <cl_symbols+2520>)
at /home/kernigh/park/ecl/src/c/symbol.d:140
140 if (Null(s)) {
(gdb) bt
#0 ecl_symbol_value (s=0xfddfb5219d8 <cl_symbols+2520>)
at /home/kernigh/park/ecl/src/c/symbol.d:140
#1 0x00000fddfb485649 in ecl_find_package_nolock (
name=0xfddfb5554a0 <str_common_lisp_data>)
at /home/kernigh/park/ecl/src/c/package.d:330
#2 0x00000fddfb4853e8 in ecl_make_package (
name=0xfddfb5554a0 <str_common_lisp_data>, nicknames=0xfe050364ed1,
use_list=0x1, local_nicknames=0x1)
at /home/kernigh/park/ecl/src/c/package.d:230
#3 0x00000fddfb4820d4 in cl_boot (argc=<optimized out>, argv=<optimized out>)
at /home/kernigh/park/ecl/src/c/main.d:589
#4 0x00000fddfb480b9c in main (argc=-78505512, args=0x1)
at /home/kernigh/park/ecl/src/c/cinit.d:175
(gdb) print *(cl_symbol_initializer*)s
$3 = {init = {name = 0xfddfb3f9af4 "*PACKAGE*", type = 2, fun = 0x0,
narg = -1, value = 0x0}, data = {t = -12 '\364', m = -102 '\232',
stype = 63 '?', dynamic = -5 '\373', value = 0x2, gfdef = 0x0,
plist = 0xffff, name = 0x0, hpack = 0x0, binding = 0}}
s
is a union: s->init
is valid and s->data
is garbage, because we have not run init_all_symbols() to turn s->init
into s->data
; but ecl_symbol_value() will return the garbage s->data.value
, which is 2 on this little-endian machine.
This is how a compiler packs the union, if a pointer has 64-bit size and alignment:
offset | init | name
0 | char *name | int8_t t
1 | ^ | int8_t m
2 | ^ | int8_t field1
3 | ^ | int8_t field2
4 | ^ | (pad)
8 | int type | cl_object value
12 | (pad) | ^
16 | void *fun | cl_object gfdef
24 | ... | ...
The 8-byte s->data.value
overlaps the 4-byte s->init.type
and 4 bytes of pad. I find type = 2 and pad = 0, so the garbage in s->data.value
would be a little-endian 2 or a big-endian 0x200000000. Then src/c/package.d ecl_find_package_nolock() will pass this garbage value to ECL_PACKAGEP:
p = ecl_symbol_value(@'*package*');
if (ECL_PACKAGEP(p)) {
If p == 2, then the tag (p & 3) != 0, so ECL_PACKAGEP returns false, and ECL does nothing else with this garbage p; but if p == 0x200000000, then the tag is zero and ECL_PACKAGEP tries to follow the pointer, causing SIGSEGV at 0x200000000.
I don't have a fix for this issue. I would say to call init_all_symbols() earlier, but the comment in src/c/main.d says that I can't do so.
My big-endian machine runs the new and unstable OpenBSD/powerpc64, but I suspect that one can reproduce this bug on other 64-bit big-endian platforms.