Services failing to start should support aborting the CI job
Summary
Services failing to start should support aborting the CI job
Steps to reproduce
Set up a Docker-based GitLab runner with privileged = false
, and attempt to use docker:stable-dind
in .gitlab-ci.yml
. The service startup will fail, and the job will the continue to run.
Example Project
N/A
What is the current bug behavior?
[0;33m*** WARNING:[0;m Service runner-5gyo7mm-project-2-concurrent-0-243a732bf87ce5ae-docker.example.com__example__docker-0 probably didn't start properly.
Health check error:
service "runner-5gyo7mm-project-2-concurrent-0-243a732bf87ce5ae-docker.example.com__example__docker-0-wait-for-service" timeout
Health check container logs:
Service container logs:
2023-01-13T07:54:31.295368737Z mount: permission denied (are you root?)
2023-01-13T07:54:31.295434061Z Could not mount /sys/kernel/security.
2023-01-13T07:54:31.295444525Z AppArmor detection and --privileged mode might break.
2023-01-13T07:54:31.296469773Z mount: permission denied (are you root?)
The job then continues to run, subsequently failing since it depends on the Docker-in-Docker instance being available.
What is the expected correct behavior?
The log above is fine; the problem is that it can get obscured by the other parts of the job log (which even exceeds the maximum of 500 KiB in our case because of other reasons).
The https://docs.gitlab.com/ee/ci/services/ page describes:
If the second stage of the check fails, it prints the warning: *** WARNING: Service XYZ probably didn't start properly. This issue can occur because:
- There is no opened port in the service.
- The service was not started properly before the timeout, and the port is not responding.
In most cases it affects the job, but there may be situations when the job still succeeds even if that warning was printed.
Currently, the semantics are optimized for the "there may be" scenario. For the "in most cases" scenarios, it's currently impossible to "fail-fast" the whole job when the service startup has failed.
Suggested workaround
I am proposing something like this, in the .gitlab-ci.yml
:
services:
- name: docker:stable-dind
alias: docker
fail_fast: true # the new setting
The fail_fast
setting would make GitLab Runner abort the job immediately upon failed health check for the service => faster feedback to the person running the job, which can sometimes be very helpful.
Results of GitLab environment info
Expand for output related to GitLab environment info
System: Ubuntu 18.04 Proxy: no Current User: git Using RVM: no Ruby Version: 2.7.7p221 Gem Version: 3.1.6 Bundler Version:2.3.15 Rake Version: 13.0.6 Redis Version: 6.2.8 Sidekiq Version:6.5.7 Go Version: unknown GitLab information Version: 15.7.2-ee Revision: 78771c4b9a0 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: PostgreSQL DB Version: 12.12 URL: https://git.example.com HTTP Clone URL: https://git.example.com/some-group/some-project.git SSH Clone URL: git@git.example.com:some-group/some-project.git Elasticsearch: no Geo: no Using LDAP: yes Using Omniauth: yes Omniauth Providers: GitLab Shell Version: 14.14.0 Repository storages: - default: unix:/var/opt/gitlab/gitaly/gitaly.socket GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell