[bugfix incl.] Solaris Debuggers Panic OS with "Nonparity Synchronous Error"

Host environment

Operating system: Windows 11 Pro 23H2 (using MSYS2 Mingw64)
Architecture: x86_64
QEMU flavor: qemu-system-sparc
QEMU version: 8.2.92 (v9.0.0-rc2-27-gce64e6224a-dirty)

QEMU command line:

./qemu-system-sparc -M SS-5 -m 256 -drive file=D:/qemu-stuff/sunos-hdd2.img,bus=0,unit=0

Emulated/Virtualized environment

Operating system: Solaris 9
OS/kernel version: SunOS blade 5.9 Generic_118558-34 sun4m sparc SUNW,SPARCstation-5
Architecture: sparc

Description of problem

General use of a debugger (mdb, adb, gdb), such as single-stepping, causing a breakpoint to trigger, and/or simply running a program will cause a kernel panic of "Nonparity Synchronous Error" on many versions of Solaris / SunOS.

This a well reported issue.

Related Bugs

The problem stems from qemu's sparc softMMU. Specifically, how it resolves "alternate space" instructions for the UserText asi. I have come up with the solution and patch!

I now experience successful debugging sessions in Solaris using mdb, gdb, and adb on both static and dynamically linked executables.
The v1 patch below should also solve all of the related bugs on different Solaris and SunOS versions, since this is an MMU issue.
All tests from make test are passing testlog-partial.txt

Here's the successful fix v1.patch

diff --git a/target/sparc/translate.c b/target/sparc/translate.c
index 319934d9bd..543774714f 100644
--- a/target/sparc/translate.c
+++ b/target/sparc/translate.c
@@ -1159,6 +1159,7 @@ static DisasASI resolve_asi(DisasContext *dc, int asi, MemOp memop)
                || (asi == ASI_USERDATA
                    && (dc->def->features & CPU_FEATURE_CASA))) {
         switch (asi) {
+        case ASI_USERTXT:    /* User text access */
         case ASI_USERDATA:   /* User data access */
             mem_idx = MMU_USER_IDX;
             type = GET_ASI_DIRECT;

It took me quite a lot of research to come up with this. Here is a summary of how I discovered this. It starts with an mdb session of the crash dump's backtrace.

> ::stack
vpanic(3, f0083cdc, 84, 10078, b14078, f0059a98)
small_sun4m_ebe_handler+0x50(10036, 10078, 9, fc085a2c, 1, 48010a1)
trap+0x67c(f62cadf0, fc085a2c, 10078, 10000, 0, 200)
fault+0x84(10078, fc085adc, 8, f0064dec, 0, 712f)
prdostep+0x94(f64d34e0, 1, fc085b6c, ffffffff, 0, f0275800)
post_syscall+0x528(fc085bdc, 3b, 0, 1, f64d34e0, 0)
syscall_trap+0x204(252ad, 39fe8, 39fa5, 39edd, 0, 3)

prdostep is called during single stepping and running of the program being debugged. We can see the exact instruction that caused the panic by inspecting, further up the callstack, the 2nd parameter passed to `trap`, the C-level trap handler. These are saved registers, notably the %pc.

> 0xfc085a2c::print 'struct regs'
{
    r_psr = 0x44000c2
    r_pc = 0xf0064ce8
    r_npc = 0xf0064cf0
    r_y = 0
    r_g1 = 0xf5b035c8
    r_g2 = 0
    r_g3 = 0x100000
    r_g4 = 0xf62cadf0
    r_g5 = 0
    r_g6 = 0xf0041000
    r_g7 = 0xf64d34e0
    r_o0 = 0x10078
    r_o1 = 0xfc085adc
    r_o2 = 0x8
    r_o3 = 0xf0064dec
    r_o4 = 0
    r_o5 = 0x712f
    r_o6 = 0xfc085a78
    r_o7 = 0xf0067860
}
> 0xf0064ce8/i
default_fuword32+0x28:          lda       [%o0] 08, %o0

The problematic instruction is lda or Load Word from Alternate Space. These take an asi (address space identifier), which in this case is 8. That is the User Instruction space.

Looking at lda [%o0] 08, %o0 it tries to load from the address in %o0, which from the struct regs we see r_o0 was 0x10078, which is a user process address. This address unfortunately wasn't accessible from any crash dump, even one with all pages (for some reason, running ::context on my test program's proc address caused 'auxv for proc is missing AT_ENTRY'). So, to see what 0x10078 is, I loaded my test program into gdb, which was usable as long as we didn't step or run the program.

$ file test
test:		ELF 32-bit MSB executable SPARC Version 1, statically linked, not stripped
$ ./test
Hello
$ gdb test
GNU gdb 5.0
[...]
This GDB was configured as "sparc-sun-solaris2.9"...
(gdb) x/i 0x10078
0x10078 <_start>:	clr  %fp

Ah, 0x10078 is the first instruction of the program. So the kernel panic arised from an attempt to single step (on a low level, prdostep) the very first instruction of this program.

For brevity I won't include all of my research, but eventually I landed on qemu's sparc resolving of address spaces, which is in target/sparc/translate.c , in the resolve_asi() function. for example:

        case ASI_USERDATA:   /* User data access */
            mem_idx = MMU_USER_IDX;
            type = GET_ASI_DIRECT;
            break;
        case ASI_KERNELDATA: /* Supervisor data access */
            mem_idx = MMU_KERNEL_IDX;
            type = GET_ASI_DIRECT;
            break;

Among a plethora of other ASI entries, there was none for User or Kernel instruction access. Long story short, adding only an ASI_USERTXT case as per the v1 patch was the holy grail solution. This experimental change is proving successful, the build test cases are all passing, my Solaris 9 can bootup and run fine, and now debugging is working 😄

I wonder, was there ever a reason not to include the UserText asi in resolve_asi()? Was it simply missed?

Part 2

This section is optional to read, but it further pinpoints details of how resolve_asi() implicitly handles the KernelText asi.

Adding a case statement for KernelText above KernelData caused a boot crash. This failed "v2" patch looked as follows:

diff --git a/target/sparc/translate.c b/target/sparc/translate.c
index 319934d9bd..84672bf474 100644
--- a/target/sparc/translate.c
+++ b/target/sparc/translate.c
@@ -1159,10 +1159,12 @@ static DisasASI resolve_asi(DisasContext *dc, int asi, MemOp memop)
                || (asi == ASI_USERDATA
                    && (dc->def->features & CPU_FEATURE_CASA))) {
         switch (asi) {
+        case ASI_USERTXT:    /* User text access */
         case ASI_USERDATA:   /* User data access */
             mem_idx = MMU_USER_IDX;
             type = GET_ASI_DIRECT;
             break;
+        case ASI_KERNELTXT:
         case ASI_KERNELDATA: /* Supervisor data access */
             mem_idx = MMU_KERNEL_IDX;
             type = GET_ASI_DIRECT;

The answer why this crashed is a bit detailed.

Surely, there is an alternate space instruction to the kernel text at boot time. OpenBIOS shows lda on KernelText in /arch/sparc32/entry.S:360

lda     [%g4] ASI_M_KERNELTXT, %g1

After studying the entry.S code, and using qemu's GDB features, I saw that this boot code was using the lda to copy from ROM (0xffd00000) to RAM. However, the v2 patch caused all zeroes to be read from the lda, and subsequently jumping into all zeroes causes an Illegal Instruction. But why? let's find out :)

QEMU 8.2.92 monitor - type 'help' for more information
(qemu) qemu: fatal: Trap 0x02 (Illegal Instruction) while interrupts disabled, Error state
pc: ffd062a0  npc: ffd062a4
%g0-7: 00000000 00000001 ffd062a0 00002000 ffd5e004 ffd5e000 04000000 003ffe00
%o0-7: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
%l0-7: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
%i0-7: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
%f00:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
%f08:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
%f16:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
%f24:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
psr: 044000c0 (icc: -Z-- SPE: SP-) wim: 00000001
fsr: 00080000 y: 00000020

Here is a debug printout of the v2 patched qemu's OpenBios-sparc32 loading a bad zero via lda

Breakpoint 1, 0x00006258 in ?? ()
(gdb) x/i $pc
=> 0x6258:      lda  [ %g4 ] (9), %g1
(gdb) p/x $g1
$1 = 0x10000000
(gdb) si
0x0000625c in ?? ()
(gdb) p/x $g1
$2 = 0x0
(gdb)

Here is a debug printout of it successfully running, as normal (I refer to this as "stock qemu"):

Breakpoint 1, 0x00006258 in ?? ()
(gdb) x/i $pc
=> 0x6258:      lda  [ %g4 ] (9), %g1
(gdb) p/x $g4
$9 = 0xffd00000
(gdb) p/x $g1
$10 = 0x10000000
(gdb) si
0x0000625c in ?? ()
(gdb) p/x $g1
$11 = 0x108017c1

To understand what is happening, let's see how stock qemu is resolving this lda instruction inside resolve_asi(). I added custom debug printouts to detail the important variables. Simplified, it looks like the following in resolve_asi()

     ASIType type = GET_ASI_HELPER;
     int mem_idx = dc->mem_idx;

+    qemu_printf("IN: mem_idx: %d, asi: %d, type: %d\n", mem_idx, asi, type);

[...]

  done:
+    qemu_printf("OUT: mem_idx: %d, asi: %d, type: %d\n\n", mem_idx, asi, type);

In hindsight this should have been conditioned on the kerneltext asi only, so that it's not so spammy and slowing down the emulated system.

So, the boot-time lda with asi 9 (kerneltext) causes these printouts in stock qemu now:

IN:  mem_idx: 2 (MMU_PHYS_IDX), asi: 9, type: 0 (GET_ASI_HELPER)
OUT: mem_idx: 2 (MMU_PHYS_IDX), asi: 9, type: 0 (GET_ASI_HELPER)

The IN and OUT entries are identical since in stock qemu's resolve_asi() there is no conditional code on ASI_KERNELTXT

Notice from the custom printout above that at boot time, during the lda on KernelText, the MMU_PHYS_IDX and GET_ASI_HELPER are used. This is already much different to the other relevant ASIs (KernelData, UserData, UserText) sections in resolve_asi().

But what about long after boot time? Well, the only known Solaris kernel code that uses a lda/sta on KernelText is in the Illegal Instruction Trap Handler. In order for that lda instruction to execute, there must be an illegal instruction in kernel level code. Using qemu's GDB debugging features, I hot patched an instruction in the kernel to be illegal (eg. 0x00000001), got it to execute, hit the illegal instruction trap handler's lda instruction. It's worth noting that the contents of the lda were verified as accurate (in this case 0x00000001). Then, my custom printout from resolve_asi() reacted accordingly:

IN:  mem_idx: 1 (MMU_KERNEL_IDX), asi: 9, type: 0 (GET_ASI_HELPER)
OUT: mem_idx: 1 (MMU_KERNEL_IDX), asi: 9, type: 0 (GET_ASI_HELPER)

This time, the mem_idx is different. It is MMU_KERNEL_IDX. the type is still GET_ASI_HELPER. So, as we can see, the stock qemu's resolve_asi() reacts to Kernel Text asi with multiple different mem_idx depending on (boot) circumstances, and with different type (GET_ASI_HELPER) than the other corresponding ASIs which all use GET_ASI_DIRECT.

And that is why attempting to treat the Kernel Text like Kernel Data in resolve_asi() simply didn't work.

I did further experimenting on KernelText treatment in resolve_asi(), such as changing only the type to GET_ASI_DIRECT, but that caused the same boot failure stemmed from 0x6258 lda [ %g4 ] (9), %g1 loading zeros.

The only other combination worth trying is using, at bootup, GET_ASI_HELPER for MMU_PHYS_IDX, and thereafter using GET_ASI_DIRECT for MMU_KERNEL_IDX, but since all of the alternate space instructions encountered all already work (using GET_ASI_HELPER) , I don't deem that experiment worthwhile.

As a sidenote, I revisited the case ASI_USERTXT in resolve_asi() and had it exclusively use GET_ASI_HELPER rather than GET_ASI_DIRECT, like stock qemu does for ASI_KERNELTXT. However, upon using a debugger the nonparity synchronous kernel panic once again occurred as per this bug report. Thus, after all this experimentation, I feel confident I have the right final values in the v1 patch.

If you made it this far, thank you for your time and concentration. You are a dedicated individual.

References

Solaris 8 Source Code
Sun4M System Architecture Manual
Sparcv8 Manual
Solaris Internals (newer OS, but it can help at times)

Additional information

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information