ps 4.0.2 segfault printing ID_RUSER
In some background automation tasks, I've recently seen a handful of segfaults from ps. Here is an example:
$ ps -w -w -A -o ruser=user -o pid=pid -o ppid=ppid -o pgid=pgrp -o tpgid=tpgid -o nice=nice -o start_time=start -o vsz=size -o rss=rss -o state=state -o etime=etime -o time=time -o %cpu=pctcpu -o command=command
...
root 346491 1 346491 -1 10 18:01 4539004 126648 S 00:12 00:00:03 28.8 /usr/lib/x86_64-linux-gnu/libexec/drkonqi-coredump-processor d26d01f4fb8e46cfb67b0ce0acb53f2a 12060-346483-0
root 346520 2 0 -1 0 18:01 0 0 I 00:09 00:00:00 0.0 [kworker/3:0-events]
root 346576 1 346576 -1 10 18:01 4542120 127596 S 00:08 00:00:04 52.1 /usr/lib/x86_64-linux-gnu/libexec/drkonqi-coredump-processor d26d01f4fb8e46cfb67b0ce0acb53f2a 12061-346574-0
achutina 346599 257911 346599 -1 0 18:01 96596 13596 S 00:05 00:00:00 0.7 /usr/bin/pulseaudio --daemonize=no --log-target=journal
root 346643 1 346643 -1 9 18:01 492400 133204Signal 11 (SEGV) caught by ps (4.0.2).
R 00:04 00:00:04 99.5 (coredump)
root 346644 1 346644 -1 10 18:01 4542120 127364 S 00:04 00:00:03 84.6 /usr/lib/x86_64-linux-gnu/libexec/drkonqi-coredump-processor d26d01f4fb8e46cfb67b0ce0acb53f2a 12062-346642-0
jaadm 346687 344322 344322 -1 0 18:01 94648 22092 S 00:01 00:00:00 0.5 host fshome04ah
jaadm 346708 344133 346708 -1 0 18:01 97440 15200 S 00:00 00:00:00 5.8 /usr/bin/pulseaudio --daemonize=no --log-target=journal
achutina 346766 258027 258026 -1 0 18:01 17328 4440 R 00:00 00:00:00 300 ps -w -w -A -o ruser=user -o pid=pid -o ppid=ppid -o pgid=pgrp -o tpgid=tpgid -o nice=nice -o start_time=start -o vsz=size -o rss=rss -o state=state -o etime=etime -o time=time -o %cpu=pctcpu -o command=command
ps:src/ps/display.c:71: please report this bug
This is on Debian 12 (current stable), which currently has procps 4.0.2.
I cannot reproduce this on demand, but there have been a few instances in my background automation (a distributed testing tool) which have generated core dumps on this machine. It wouldn't surprise me if there is some intermittent machine-specific issue that is causing unexpected data, but regardless of the cause it seems like procps isn't handling this condition.
In the two core dumps I've looked at, the stack trace is identical:
(gdb) bt
#0 0x00007fcde54b1267 in __GI_kill () at ../sysdeps/unix/syscall-template.S:120
#1 0x00005565a5c52e3b in signal_handler (signo=11) at src/ps/display.c:76
#2 <signal handler called>
#3 escape_str (dst=dst@entry=0x7fcde4fa1090 "", src=0x1 <error: Cannot access memory at address 0x1>, bufsize=bufsize@entry=131072,
maxcells=maxcells@entry=0x7ffcfea95ae4) at src/ps/output.c:245
#4 0x00005565a5c593ec in do_pr_name (outbuf=0x7fcde4fa1090 "", name=<optimized out>, u=0) at src/ps/output.c:1206
#5 0x00005565a5c5b2e1 in show_one_proc (p=p@entry=0x7fcde4fc6800, fmt=0x5565ba359700) at src/ps/output.c:2205
#6 0x00005565a5c52904 in simple_spew () at src/ps/display.c:320
#7 main (argc=<optimized out>, argv=<optimized out>) at src/ps/display.c:672
(gdb) fr 5
#5 0x00005565a5c5b2e1 in show_one_proc (p=p@entry=0x7fcde4fc6800, fmt=0x5565ba359700) at src/ps/output.c:2205
2205 if(p && fmt->pr) amount = (*fmt->pr)(outbuf,p);
(gdb) p fmt->pr
$16 = (int (*)(char * const restrict, const struct pids_stack * const restrict)) 0x5565a5c59b60 <pr_ruser>
I believe that since $fmt->pr
points to pr_ruser
, that it is trying to print the real username.
Given the u=0
in the do_pr_name
args, I would also guess the process it's trying to print is owned by root, but maybe that is a red herring (e.g. maybe both the user id and the user name are just corrupt).
Is there anything specific that I can look at in the core dump to help diagnose this? Unfortunately I can't reproduce this on demand, so I can't debug a running ps.
Also, at first I thought this bug might be https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036631, which is still present in Debian 12, but the stack trace is completely different than the one for that bug (and I'm not using the -m
switch).