SMP: Race condition in x86 BIOS implementation
While excessively testing the new x86 BIOS implementation in QEMU (see #39 (closed)) I noticed that there is a race condition in the current SMP code. This has been fairly hard to track down and happened way less often when using GDB, but after some time I finally managed to pin down where things might break apart.
My "kernel" that I'm passing to mkbootimg is a simple assembly which atomatically increments a counter in the .data section and then waits until it reaches 4, which is a hardcoded constant for the amount of cores I'm using during my tests. Example code:
.text
.global kernel_main
kernel_main:
lock incq core_id
wait:
cmp $4, core_id
jne wait
hlt
.data
core_id: .quad 0
While this works smoothly most of the time (ending up with all 4 cores being halted), sometimes an exception happens at either 0x0e88 or 0x0e93. Using objdump -D -b binary -mi386 -Mintel,addr16,data16 --adjust-vma=0x800 dist/bootboot.bin I tracked this down to the following instructions:
e7b: 0f 01 16 d0 31 lgdtw ds:0x31d0
e80: 0f 20 c0 mov eax,cr0
e83: 0c 01 or al,0x1
e85: 0f 22 c0 mov cr0,eax
e88: ea 8d 0e 20 00 jmp 0x20:0xe8d
e8d: 66 b8 08 00 8e d8 mov eax,0xd88e0008
e93: 8e c0 mov es,ax
e95: 8e e0 mov fs,ax
e97: 8e e8 mov gs,ax
e99: 8e d0 mov ss,ax
This was a first clue that the issue is most likely related to the GDT as it fails exactly on those instructions which would reload the code or data segment registers. The QEMU interrupt debug message also backed this theory up, here with an example where the startup failed at 0x0e88:
0: v=0d e=0020 i=0 cpl=0 IP=0000:0000000000000e88 pc=0000000000000e88 SP=0000:0000000000000000 env->regs[R_EAX]=0000000060000011
EAX=60000011 EBX=00000000 ECX=00000000 EDX=00000663
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=00000e88 EFL=00000006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
CS =0000 00000000 0000ffff 00009b00 DPL=0 CS16 [-RA]
SS =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
DS =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
FS =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
GS =0000 00000000 0000ffff 00009300 DPL=0 DS16 [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT= 000031a0 00000030
IDT= 00000000 0000ffff
CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00000000 CCD=60000011 CCO=LOGICB
EFER=0000000000000000
While this initially did not make much sense to me, the thread information provided by GDB eventually pointed me towards the culprit:
(gdb) info thread
Id Target Id Frame
* 1 Thread 1.1 (CPU#0 [running]) 0xffffffffffe05af7 in wait ()
2 Thread 1.2 (CPU#1 [running]) 0x00000000000022db in ?? ()
3 Thread 1.3 (CPU#2 [running]) 0x0000000000000e88 in ?? ()
4 Thread 1.4 (CPU#3 [running]) 0x0000000000000e88 in ?? ()
This reveals that the processors are in three different states:
- The bootstrap processor (thread 1) has already exited bootboot and is now running the "kernel"
- The first application processor (thread 2) is currently in
bootboot_startcore - The second and third application processor (thread 3+4) are still in
real_protmodefunc
Dumping the CR0 and EFER register also shows this different state, with T1+T2 being in long mode while T3+T4 are still stuck in protected mode where reloading the segment registers failed:
Thread 1 (Thread 1.1 (CPU#0 [running])):
cr0 0xc0000011 [ PG CD ET PE ]
efer 0x500 [ LMA LME ]
Thread 2 (Thread 1.2 (CPU#1 [running])):
cr0 0xc0000011 [ PG CD ET PE ]
efer 0x500 [ LMA LME ]
Thread 3 (Thread 1.3 (CPU#2 [running])):
cr0 0x60000011 [ CD NW ET PE ]
efer 0x0 [ ]
Thread 4 (Thread 1.4 (CPU#3 [running])):
cr0 0x60000011 [ CD NW ET PE ]
efer 0x0 [ ]
After analyzing the code of BOOTBOOT further, I think the problem is that while the application processor do spinlock until the BSP is ready ( https://gitlab.com/bztsrc/bootboot/-/blob/c8580afba8aad636f7060e2140c2c110be6b30e8/x86_64-bios/bootboot.asm#L866-869 ), the BSP does not wait for all APs to reach the spinlock.
This sometimes ends up in .nosmp overwriting the default GDT provided in the assembly file with the new 64bit GDT before all APs have reached protected mode, therefor ending up in a race condition where an AP tries to initialize protected mode and then triple faults when attempting to reload a segment which has already been configured for 64-bit.
A memory dump of the GDT also confirms this:
(gdb) x /13xw 0x31a0
0x31a0: 0x00000000 0x00000000 0x0000ffff 0x00209800
0x31b0: 0x0000ffff 0x00809300 0x00000068 0x00008900
0x31c0: 0x00000000 0x00000000 0x00000068 0x00cf8900
0x31d0: 0x31a00030
I am not exactly sure what is the best way to fix this with the current architecture, but I hope this information is helpful to you. Feel free to ask at any time if I should provide you with more data and/or test any images.
Last but not least there is one extra question that came up on my side while analyzing BOOTBOOT: I noticed that DATA_PROT in bootboot.asm is initialized as 0x0000FFFF 0x008F9200h - shouldn't the second dword be 0x00CF9200h to signal a 32-bit data segment by setting the Sz flag?