Don't do anything with capabilities in `--sandbox=none`
We landed a big switch to use this project in our build/test tooling: https://github.com/coreos/coreos-assembler/pull/3428
However, we then discovered a corner case where one of our CI flows was running as uid 0 (inside a Kubernetes pod) but without the capbility to mutate its capabilities.
Again, the goal of specifying --sandbox=none
for this use
case is we don't want virtiofsd to do any additional isolation beyond
that imposed externally; so change the code flow here to skip
capability changes if --sandbox=none
was specified.
Merge request reports
Activity
I’m not 100 % sure on this. I don’t find dropping capabilities to be a sort of sandboxing or isolation, similarly to how I find seccomp is orthogonal.
However, we don’t drop capabilities when
uid != 0
, indicating that we indeed want to skip that if it can’t work. Maybe we could just make it a warning instead of an error if dropping capabilities fails in--sandbox=none
mode? Or add a new switch (similarly to how seccomp is orthogonal to sandboxing and gets its own switch); maybe a special value for--modcaps
.@germag, what’s your opinion?
Edited by Hanna CzenczekI don’t find dropping capabilities to be a sort of sandboxing or isolation, similarly to how I find seccomp is orthogonal.
At a technical level one difference with seccomp is that it actually requires no privileges of the calling process at all (assuming the process has entered no-new-privs).
But changing capabilities requires capabilities! (Also, this does all interact because changing capabilities can be blocked by seccomp too...)
Maybe we could try to tweak things so that we only drop capabilities...the bug i was hitting may be this:
$ oc run --rm -ti --image registry.access.redhat.com/ubi9/ubi:latest rhel9-test If you don't see a command prompt, try pressing enter. # capsh --print Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service=ep Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service
And intersecting that with the
default_caps
it looks like we're trying to addCAP_MKNOD
which isn't in the bounding set. But...again really that's just a more complicated way to no-op in the--sandbox=none
mode, and in other flows we'll have created a user namespace so we'll always be able to drop.We do have
--modcaps
, so you can manually remove capabilities that aren’t available. I think it makes sense that by default we require all capabilities that we will need for unimpeded operation, which includesmknod
. At least we should try, and print a warning if they aren’t available, so that users are aware that certain operations may not work even though virtiofsd runs as root.(When not running a root, I think it’s OK to not modify capabilities at all, because (at least to me personally) it seems obvious that some operations may require privileges that normal users just don’t have.)
As I’ve said, I’d be OK with just a warning: We keep trying to acquire all capabilities, but warn that that wasn’t possible in case of an error (instead of erroring out completely). (Ideally printing what capabilities we tried to set, and what the current set is.)
If you want, after printing the warning, we could still try to at least drop all unneeded capabilities, depending on how much additional complexity that brings. It almost seems better to me if users were to acknowledge the warning message by passing
--modcaps
to explicitly drop the unavailable capabilities, because if that’s done right, we can drop all capabilities we don’t need, making the warning disappear in the process.For that cap set will be enough to run virtiofsd with
--modcaps=-mknod:-setfcap
.This code:
if uid == 0 { drop_capabilities(fs_cfg.inode_file_handles, opt.modcaps); }
it's just an ugly shortcut for
--modcaps=-setfcap:-mknod:-setuid:-setgid:-fsetid:-fowner:-dac_override:-chown
,capng::apply()
will return an error if the process doesn't haveCAP_SETPCAP
(in older versions of libcap-ng it will try to log the error using syslog, so it could fail because of a missing seccomp filter)But, I think it is a good idea to check for
CAP_SETPCAP
first, and also check if we can add each cap and show a warning instead of just failing, but just forsandbox=none
?. I guess to be consistent, maybe we should check and show a warning regardless of the sandbox mode. Just adding a new option like--modcaps=-all
we still need to check forCAP_SETPCAP
, so I think checking each cap is a better option.Just a note: I think we never defined what should be expected for each sandbox mode (see #43 (closed)), also
sandbox=none
is not reallynone
, we still setup seccomp filter, that requires--seccomp=none
to disable it, and drop caps. So, the seccomp filtering is "disconnected" from the sandbox mode.
mentioned in commit freedesktop-sdk/mirrors/github/ostreedev/ostree@2fe88f80