Skip to content

64-bit big-endian SIGSEGV: ecl_symbol_value before init_all_symbols

With ECL 20.4.24 on my 64-bit big-endian machine, ecl_min crashes at startup with a SIGSEGV at address 0x200000000,

Internal or unrecoverable error in:
Got signal before environment was installed on our thread
Abort trap (core dumped) 

When ECL boots, it reads the symbol value of *PACKAGE* before it has initialized the symbol, because it calls ecl_symbol_value() before init_all_symbols(). Most platforms would read p = 2, but my 64-bit big-endian reads p = 0x200000000. The tag (p & 3) == 0 causes the SIGSEGV.

I haven't put a debugger on the big-endian machine. I can use gdb on a little-endian OpenBSD/amd64 to show how ecl reads *PACKAGE* too early. For this gdb session, I built ecl_min from git commit 329b37d8:

$ CC=cc CPPFLAGS=-I/usr/local/include LDFLAGS=-L/usr/local/lib ./configure \
> --enable-boehm=system --enable-libatomic=system --with-system-gmp=gmp
$ gmake ecl_min
$ cd build
$ egdb ecl_min
...
(gdb) break ecl_symbol_value
Breakpoint 10 at 0x114230: file /home/kernigh/park/ecl/src/c/symbol.d, line 140.
(gdb) break init_all_symbols
Breakpoint 11 at 0x112590: file /home/kernigh/park/ecl/src/c/all_symbols.d, line 287.
(gdb) run
Starting program: /home/kernigh/park/ecl/build/ecl_min

Breakpoint 10, ecl_symbol_value (s=0xfddfb5219d8 <cl_symbols+2520>)
    at /home/kernigh/park/ecl/src/c/symbol.d:140
140       if (Null(s)) {
(gdb) bt
#0  ecl_symbol_value (s=0xfddfb5219d8 <cl_symbols+2520>)
    at /home/kernigh/park/ecl/src/c/symbol.d:140
#1  0x00000fddfb485649 in ecl_find_package_nolock (
    name=0xfddfb5554a0 <str_common_lisp_data>)
    at /home/kernigh/park/ecl/src/c/package.d:330
#2  0x00000fddfb4853e8 in ecl_make_package (
    name=0xfddfb5554a0 <str_common_lisp_data>, nicknames=0xfe050364ed1, 
    use_list=0x1, local_nicknames=0x1)
    at /home/kernigh/park/ecl/src/c/package.d:230
#3  0x00000fddfb4820d4 in cl_boot (argc=<optimized out>, argv=<optimized out>)
    at /home/kernigh/park/ecl/src/c/main.d:589
#4  0x00000fddfb480b9c in main (argc=-78505512, args=0x1)
    at /home/kernigh/park/ecl/src/c/cinit.d:175
(gdb) print *(cl_symbol_initializer*)s
$3 = {init = {name = 0xfddfb3f9af4 "*PACKAGE*", type = 2, fun = 0x0, 
    narg = -1, value = 0x0}, data = {t = -12 '\364', m = -102 '\232', 
    stype = 63 '?', dynamic = -5 '\373', value = 0x2, gfdef = 0x0, 
    plist = 0xffff, name = 0x0, hpack = 0x0, binding = 0}}

s is a union: s->init is valid and s->data is garbage, because we have not run init_all_symbols() to turn s->init into s->data; but ecl_symbol_value() will return the garbage s->data.value, which is 2 on this little-endian machine.

This is how a compiler packs the union, if a pointer has 64-bit size and alignment:

offset | init       | name
0      | char *name | int8_t    t
1      | ^          | int8_t    m
2      | ^          | int8_t    field1
3      | ^          | int8_t    field2
4      | ^          | (pad)
8      | int   type | cl_object value
12     | (pad)      | ^
16     | void *fun  | cl_object gfdef
24     | ...        | ...

The 8-byte s->data.value overlaps the 4-byte s->init.type and 4 bytes of pad. I find type = 2 and pad = 0, so the garbage in s->data.value would be a little-endian 2 or a big-endian 0x200000000. Then src/c/package.d ecl_find_package_nolock() will pass this garbage value to ECL_PACKAGEP:

  p = ecl_symbol_value(@'*package*');
  if (ECL_PACKAGEP(p)) {

If p == 2, then the tag (p & 3) != 0, so ECL_PACKAGEP returns false, and ECL does nothing else with this garbage p; but if p == 0x200000000, then the tag is zero and ECL_PACKAGEP tries to follow the pointer, causing SIGSEGV at 0x200000000.

I don't have a fix for this issue. I would say to call init_all_symbols() earlier, but the comment in src/c/main.d says that I can't do so.

My big-endian machine runs the new and unstable OpenBSD/powerpc64, but I suspect that one can reproduce this bug on other 64-bit big-endian platforms.