SMP: Race condition in x86 BIOS implementation

While excessively testing the new x86 BIOS implementation in QEMU (see #39 (closed)) I noticed that there is a race condition in the current SMP code. This has been fairly hard to track down and happened way less often when using GDB, but after some time I finally managed to pin down where things might break apart.

My "kernel" that I'm passing to mkbootimg is a simple assembly which atomatically increments a counter in the .data section and then waits until it reaches 4, which is a hardcoded constant for the amount of cores I'm using during my tests. Example code:

.text
.global kernel_main
kernel_main:
  lock incq core_id

wait:
  cmp $4, core_id
  jne wait
  hlt

.data
core_id: .quad 0

While this works smoothly most of the time (ending up with all 4 cores being halted), sometimes an exception happens at either 0x0e88 or 0x0e93. Using objdump -D -b binary -mi386 -Mintel,addr16,data16 --adjust-vma=0x800 dist/bootboot.bin I tracked this down to the following instructions:

     e7b:	0f 01 16 d0 31       	lgdtw  ds:0x31d0
     e80:	0f 20 c0             	mov    eax,cr0
     e83:	0c 01                	or     al,0x1
     e85:	0f 22 c0             	mov    cr0,eax
     e88:	ea 8d 0e 20 00       	jmp    0x20:0xe8d
     e8d:	66 b8 08 00 8e d8    	mov    eax,0xd88e0008
     e93:	8e c0                	mov    es,ax
     e95:	8e e0                	mov    fs,ax
     e97:	8e e8                	mov    gs,ax
     e99:	8e d0                	mov    ss,ax

This was a first clue that the issue is most likely related to the GDT as it fails exactly on those instructions which would reload the code or data segment registers. The QEMU interrupt debug message also backed this theory up, here with an example where the startup failed at 0x0e88:

     0: v=0d e=0020 i=0 cpl=0 IP=0000:0000000000000e88 pc=0000000000000e88 SP=0000:0000000000000000 env->regs[R_EAX]=0000000060000011
EAX=60000011 EBX=00000000 ECX=00000000 EDX=00000663
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=00000e88 EFL=00000006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
CS =0000 00000000 0000ffff 00009b00 DPL=0 CS16 [-RA]
SS =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
DS =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
FS =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
GS =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000031a0 00000030
IDT=     00000000 0000ffff
CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00000000 CCD=60000011 CCO=LOGICB  
EFER=0000000000000000

While this initially did not make much sense to me, the thread information provided by GDB eventually pointed me towards the culprit:

(gdb) info thread
  Id   Target Id                    Frame 
* 1    Thread 1.1 (CPU#0 [running]) 0xffffffffffe05af7 in wait ()
  2    Thread 1.2 (CPU#1 [running]) 0x00000000000022db in ?? ()
  3    Thread 1.3 (CPU#2 [running]) 0x0000000000000e88 in ?? ()
  4    Thread 1.4 (CPU#3 [running]) 0x0000000000000e88 in ?? ()

This reveals that the processors are in three different states:

The bootstrap processor (thread 1) has already exited bootboot and is now running the "kernel"
The first application processor (thread 2) is currently in bootboot_startcore
The second and third application processor (thread 3+4) are still in real_protmodefunc

Dumping the CR0 and EFER register also shows this different state, with T1+T2 being in long mode while T3+T4 are still stuck in protected mode where reloading the segment registers failed:

Thread 1 (Thread 1.1 (CPU#0 [running])):
cr0            0xc0000011          [ PG CD ET PE ]
efer           0x500               [ LMA LME ]

Thread 2 (Thread 1.2 (CPU#1 [running])):
cr0            0xc0000011          [ PG CD ET PE ]
efer           0x500               [ LMA LME ]

Thread 3 (Thread 1.3 (CPU#2 [running])):
cr0            0x60000011          [ CD NW ET PE ]
efer           0x0                 [ ]

Thread 4 (Thread 1.4 (CPU#3 [running])):
cr0            0x60000011          [ CD NW ET PE ]
efer           0x0                 [ ]

After analyzing the code of BOOTBOOT further, I think the problem is that while the application processor do spinlock until the BSP is ready ( https://gitlab.com/bztsrc/bootboot/-/blob/c8580afba8aad636f7060e2140c2c110be6b30e8/x86_64-bios/bootboot.asm#L866-869 ), the BSP does not wait for all APs to reach the spinlock.

This sometimes ends up in .nosmp overwriting the default GDT provided in the assembly file with the new 64bit GDT before all APs have reached protected mode, therefor ending up in a race condition where an AP tries to initialize protected mode and then triple faults when attempting to reload a segment which has already been configured for 64-bit.

A memory dump of the GDT also confirms this:

(gdb) x /13xw 0x31a0
0x31a0:	0x00000000	0x00000000	0x0000ffff	0x00209800
0x31b0:	0x0000ffff	0x00809300	0x00000068	0x00008900
0x31c0:	0x00000000	0x00000000	0x00000068	0x00cf8900
0x31d0:	0x31a00030

I am not exactly sure what is the best way to fix this with the current architecture, but I hope this information is helpful to you. Feel free to ask at any time if I should provide you with more data and/or test any images.

Last but not least there is one extra question that came up on my side while analyzing BOOTBOOT: I noticed that DATA_PROT in bootboot.asm is initialized as 0x0000FFFF 0x008F9200h - shouldn't the second dword be 0x00CF9200h to signal a 32-bit data segment by setting the Sz flag?

Edited Feb 20, 2021 by Pascal Mathis

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information