I still need confirmation but I think it's coming from 915ecbc4.
The symptom is a sudden freeze of everything: can't ssh into the machine, can't switch VT, SysRq keys not working, only way is a power reset.
It's not 100% reproducible, it happens after some time during a game or a compilation.
I first checked if it was an upstream kernel regression by downgrading Linux 5.7.12 to 5.7.10 while keeping same v5.7-r3 but still had the issue.
I removed the GPU OC/UV but still the same.
I don't have any CPU OC.
I had no crash for a while with linux 5.8 but right after patching it with Project-C v5.8-r0 same type of crash happened less than 5mn into a game.
Nothing was present in dmesg/journal.
Pls try to revert 915ecbc4 on 5.7.x kernel with v5.7-r3 patch, and see if your issue goes away.
As the release note tells, I don't have real HW to test 915ecbc4, but just a simple kernel boot test using qemu to simulate the cpu topology it needs, so there maybe stable issue with this code changes when long run in real HW.
I was asking because it was complaining about reversed or previously applied patch on some files but I ignored those errors and it worked: 0009-prjc_v5.7-r3-no_wake-list.patch
Now I've been playing for an hour and no crash so far.
If I still don't get any crashes after a while, where do we go from there?
If you want to help and join the investigation, there will be some steps to continue.
If revert wake_list works on 5.7, then I'd like to move our investigation to 5.8, as it would be nice to find out solution upon the latest code base.
Firstly, I'd provide a simple patch upon 5.8-r0 to disable using wake list in ttwu code path and pls verify that it works for you.
Then, I'd provide some patches to enable using wake list back piece by piece then verify which code path cause your issue.
We will see what we can do if we find our more information.
Sounds good. Yes, I'd be more than happy to help.
I'll wait some more time to make sure the freeze is indeed coming from the wake_list but I'm away from the computer for a bit more than a week.
I will ping you when I'm sure it's coming from the wake_list.
Sure. And I also want to wait for v5.8.1 or v5.8.2 too, there are some known issues from mainline v5.8, hopeful they can be fixed in .1 or .2 release, which will give us a clean base for debuging.
Just for the record, @Kodehawa is having the same kind of crash. It disappeared when he stopped using BMQ on linux 5.8 but after I told him to try again with the new BMQ release, he started hard freezing again.
Here is the patch apply upon v5.8_r0 project c patch to disable wake list functionality. @Kodehawa@terencode can have a try. But firstly, I'd like @terencode confirm the revert patch work for him on v5.7, that prove wake list is the cause of this issue.
I confirm that not applying 00_v5.8_r0_disable_wake_list.patch made the computer crash the same way, only one day after rebuilding the kernel. This happened during heavy multitasking.
For reference, I never got any crash with the patch applied (tested during 2 weeks).
Also in your patch you added printk(KERN_INFO "sched: sched_ttwu_pending\n"); but my log level was set to 3 so it never got displayed. Let me know if you want me to change it so it's shown.
@terencode With 00_v5.8_r0_disable_wake_list.patch, the addtional printk should not been hit.
Here comes the 01_v5.8_r0_disable_on_cpu_wake_list.patch, there are two place where calls ttwu_queue_wakelist(), and both are disabled in 00_v5.8_r0_disable_wake_list.patch, this patch 01_v5.8_r0_disable_on_cpu_wake_list.patchenables one of them, and I believe this should be the major code path.
The additional printk is removed from 01 patch, b/c it will be a lot when enabled one of the ttwu_queue_wakelist() code path.
Yes, we need to know which code path is problematic.
I will prepare a patch which provide more protection during wakelist ttwu code path. My best guess is this issue is cause in very rare condition in BMQ/PDS.
Now that you mention it, I also had the same issue when I started using pds more than a year ago but thought it was a hardware defect. After changing my CPU and motherboard (upgrading from ryzen 1600x to 3600) I still had the same issue so I just gave up on using pds and started using bmq as at the time it didn't have this issue.
@leucome You can check the 00_v5.8_r0_disable_wake_list.patch here and see if it solve your issue.
@terencode For now, I don't think of any further possible cause in the code path which may cause your system frozen. I'd need to build a debug patch which simulate cores with different LLC to enable all these code path in my system. It will take time to try to reproduce it here. Pls use 00_v5.8_r0_disable_wake_list.patch as a workaround temporary.
Just pushed e43a5a5c, this commit is not aim to the issue reported here, but they are in the same code path for wake-ups. So, I'd suggest you to use this latest code changes and re-test. Thanks.
It was still freezing sometime with e43a5a5c. So I went back to your original suggestion.
I built one with e43a5a5c and 00_v5.8_r0 applied, no issue since then.
@leucome Thanks for testing anyway.
@terencode BTW, I have built a debug kernel which enable ttwu_queue_wakelist code path. Will see if any stable issue occurs or not.
I just want to confirm that I encountered two hard and complete freezes on a Ryzen 3900X with both PDS and BMQ. Once when using a VBox VM, and once in normal system use.
Also, I did NOT encounter any crashes on an Intel Kaby Lake processor, might be related to Zen only?
Unfortunately the crash was so "hard" that I didn't get sufficient log entries.
However, I did see the following appear:
kernel: softirq: huh, entered softirq 7 SCHED [something]
and several
kernel: BUG: scheduling while atomic:
including RCU stalls.
I will now try with your patch and report back.
EDIT: I found the original bug message again, it's here.
It's hard to tell, ttwu_queue_wakelist code path just enable with system which has multiple LLC, amd CPU is most likely has multiple LLC recently. I used intel CPU to force enable this code path among real cores, but no luck to reproduce the issue.
@HougeLangley, did you add the wakelist patch as well?
Alfred, I did a test with r3+wakelist patch, running a heavy compilation process, followed by a long idle period. No crash this time, maybe too early to tell, but it looks promising.
I had rare freezes with bmq, but suspected other things. Sometimes it didn't occur for days and other times quite soon after reboot (usually during idle time at night). Couldn't debug it, nothing in logs or on screen, even with verbose kernel. Anyway, I'm testing pds-r3 with 00_v5.8_r0_disable_wake_list.patch and so far second day looks good. I tried patch 02 before and got a freeze.
Based on the information and benchmark provided by users. Changes will be made to disable wake_list path in project C schedulers, code will be kept just in case to be turned on in the future.