Investigate why Container Scanning is not working with NFS mounts

Summary

The snippet provided for Container Scanning at https://docs.gitlab.com/ee/ci/examples/container_scanning.html does not work when the runner has docker using NFS, e.g. if /var/lib/docker is on an NFS mount.

image: docker:stable 
variables: 
DOCKER_DRIVER: overlay2 
allow_failure: true 
services: 
- docker:stable-dind 
script: 
- docker run --rm -d --name db arminc/clair-db:latest 
- docker run -p 6060:6060 --link db:postgres -d --rm --name clair arminc/clair-local-scan:v2.0.5 
- apk add -U wget ca-certificates 
...

Steps to reproduce

When that runs, the 2nd docker run command will fail with the following error:

docker: Error response from daemon: failed to copy xattrs: failed to set xattr "security.selinux" on /images/docker/volumes/03cbbf539ecd615297b09b444175c2bf01b252c3f61582a6ad4fb5bab6bac6e2/_data/config.yaml: operation not supported.

In this case, /var/lib/docker is a symlink to /images/docker and /images is an NFS mount.

A number of issues have been raised about this - here is one: https://github.com/genuinetools/img/issues/45.

This comment in that issue specifically mentions GitLab shared runners as having this issue as well: https://github.com/genuinetools/img/issues/45#issuecomment-409511025

NOTE: When this fails, all subsequent pipelines that run the above will fail to create the first container - the fault results in the db container being left running and all subsequent runs will fail as a result. I suspect there should be an after_script: that tears those containers down if they're still there (e.g. docker stop foo || true)

The clair-local-scan repo shows that the image they build is pulled from quay.io/coreos/clair:${VERSION} and uses a CMD of ["-config=/config/config.yaml"]. That basically just passes that parameter to the underlying ENTRYPOINT for the CoreOs clair container. The CoreOS clair container, in turn, specifies a VOLUME of /config. That means any container based off this one will end up with a /config volume mounted when docker run is executed.

Since the docker run command for the scanner doesn’t specify a mount point, its going to pick one automagically and that is where the problem lies. Since there is no mount point specified on run, it will pick something in /var/lib/docker. When the container is run, docker will try to copy config.yaml (from the clair-local-scan container Dockerfile) into /config (the VOLUME from the clair container Dockerfile). That copy is on the NFS mount and the copy will try to copy all the selinux extended attributes - which don’t exist. That causes the copy to fail and the image run to fail (which is essentially what the above bugs are pointing out).

Example Project

(If possible, please create an example project here on GitLab.com that exhibits the problematic behaviour, and link to it here in the bug report)

(If you are using an older version of GitLab, this will also determine whether the bug has been fixed in a more recent version)

What is the current bug behavior?

Job fails.

What is the expected correct behavior?

Job succeeds.

Possible fixes

To work around that, create a directory ($PWD/config) and then alter the docker run command to use it (via -v $PWD/config:/config). That works around the NFS issue, but the script still fails on the apk add line with the error apk: command not found.

Tasks

Investigate the underlying environment and scenario
Consider possible solutions (one option has been provided, but needs to be validated by engineers)

Edited Apr 09, 2019 by Fabio Busatto