Kernel DRM breaks on some phosh crashes and refuses to give it display access
Sometimes when phosh crashes, the Kernel DRM seems to just break and refuse it to be ever relaunched because it can't get display access again. When that happens, the screen remains frozen on the last shown frame before phosh went down and SSH and everything else works fine, but it's not possible to relaunch it without a reboot (or I wouldn't know how).
Here is the situation right after a phosh crash, with some part of the session still stuck running:
dmesg (doesn't show anything interesting):
[ 296.336076] elogind-daemon[3177]: Watching system buttons on /dev/input/event6 (Bluetooth keyboard Consumer Control)
[ 296.346573] elogind-daemon[3177]: Watching system buttons on /dev/input/event7 (Bluetooth keyboard System Control)
[ 296.347178] elogind-daemon[3177]: Watching system buttons on /dev/input/event5 (Bluetooth keyboard Keyboard)
[ 322.223520] alloc_contig_range: [b0e00, b106b) PFNs busy
[ 322.223756] alloc_contig_range: [b0f00, b116b) PFNs busy
[ 325.473930] alloc_contig_range: [b0e00, b106b) PFNs busy
[ 325.474161] alloc_contig_range: [b0f00, b116b) PFNs busy
[ 330.648017] alloc_contig_range: [b0e00, b1190) PFNs busy
[ 330.648387] alloc_contig_range: [b0f00, b1290) PFNs busy
[ 330.774572] alloc_contig_range: [b0e00, b1190) PFNs busy
[ 330.774853] alloc_contig_range: [b0f00, b1290) PFNs busy
[ 332.505612] alloc_contig_range: [b0e00, b106b) PFNs busy
[ 332.505920] alloc_contig_range: [b0f00, b116b) PFNs busy
[ 393.008399] alloc_contig_range: [b0e00, b1190) PFNs busy
[ 393.008951] alloc_contig_range: [b0f00, b1290) PFNs busy
[ 403.095029] alloc_contig_range: [b0e00, b106b) PFNs busy
[ 403.095323] alloc_contig_range: [b0f00, b116b) PFNs busy
[ 463.411703] alloc_contig_range: [b0e00, b1190) PFNs busy
[ 463.415560] alloc_contig_range: [b0f00, b1290) PFNs busy
[ 536.430685] alloc_contig_range: [b0e00, b106b) PFNs busy
[ 536.430939] alloc_contig_range: [b0f00, b116b) PFNs busy
[ 543.189138] alloc_contig_range: [b0e00, b1190) PFNs busy
[ 543.189365] alloc_contig_range: [b0f00, b1290) PFNs busy
[ 711.607363] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[ 1844.019728] elogind-daemon[3177]: Suspending system...
[ 1844.019782] PM: suspend entry (s2idle)
[ 1844.024282] Filesystems sync: 0.004 seconds
[ 3881.797537] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[12301.094596] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[19184.426138] rfkill: input handler enabled
phosh & phoc are dead:
$ ps aux | grep lightdm
3137 root 0:00 supervise-daemon lightdm --start /usr/bin/lightdm --
9029 root 0:00 /usr/bin/lightdm
9035 user 0:00 grep lightdm
$ ps aux | grep phosh
9037 user 0:00 grep phosh
$ ps aux | grep phoc
9039 user 0:00 grep phoc
$ ps aux | grep session
9102 root 0:00 lightdm --session-child 11 14
9108 user 0:00 grep session
.xsession-errors:
$ cat ~/.xsession-errors
$
Really looks like it's just not even bothering to restart, with something keeping the session still alive exactly as @afontain suggested.
So now let me try to kill the session:
$ sudo kill 9102
[sudo] password for user:
$ ps aux | grep session
9117 user 0:00 grep session
$ ps aux | grep phoc
9119 user 0:00 grep phoc
$ ps aux | grep phosh
9121 user 0:00 grep phosh
$ ps aux | grep lightdm
3137 root 0:00 supervise-daemon lightdm --start /usr/bin/lightdm --
9110 root 0:00 /usr/bin/lightdm
9123 user 0:00 grep lightdm
Let's look at .xsession-errors again:
$ cat ~/.xsession-errors
(phoc:9063): phoc-wlroots-CRITICAL **: 21:34:33.759: [backend/session/logind.c:161] Failed to get session path: Operation timed out
(phoc:9063): phoc-wlroots-CRITICAL **: 21:34:33.760: [backend/session/direct-ipc.c:35] Do not have CAP_SYS_ADMIN; cannot become DRM master
(phoc:9063): phoc-wlroots-CRITICAL **: 21:34:33.760: [backend/session/session.c:96] Failed to load session backend
(phoc:9063): phoc-wlroots-CRITICAL **: 21:34:33.760: [backend/backend.c:195] failed to start a session
(phoc:9063): phoc-wlroots-CRITICAL **: 21:34:33.760: [backend/backend.c:235] failed to start backend 'drm'
(phoc:9063): phoc-server-ERROR **: 21:34:33.761: Could not create backend
dmesg has no new messages at this point. So nothing useful there about what is going on.
So the kernel won't allow it to get graphics access again, that seems bad. I suggest might be worth fixing, since on the run e.g. in public transportation with no other device at hand to SSH into it, the device becomes unusable until it is hard reset.