Services failing to start should support aborting the CI job

Summary

Steps to reproduce

Set up a Docker-based GitLab runner with privileged = false, and attempt to use docker:stable-dind in .gitlab-ci.yml. The service startup will fail, and the job will the continue to run.

Example Project

N/A

What is the current bug behavior?

[0;33m*** WARNING:[0;m Service runner-5gyo7mm-project-2-concurrent-0-243a732bf87ce5ae-docker.example.com__example__docker-0 probably didn't start properly.

Health check error:
service "runner-5gyo7mm-project-2-concurrent-0-243a732bf87ce5ae-docker.example.com__example__docker-0-wait-for-service" timeout

Health check container logs:


Service container logs:
2023-01-13T07:54:31.295368737Z mount: permission denied (are you root?)
2023-01-13T07:54:31.295434061Z Could not mount /sys/kernel/security.
2023-01-13T07:54:31.295444525Z AppArmor detection and --privileged mode might break.
2023-01-13T07:54:31.296469773Z mount: permission denied (are you root?)

The job then continues to run, subsequently failing since it depends on the Docker-in-Docker instance being available.

What is the expected correct behavior?

The log above is fine; the problem is that it can get obscured by the other parts of the job log (which even exceeds the maximum of 500 KiB in our case because of other reasons).

The https://docs.gitlab.com/ee/ci/services/ page describes:

If the second stage of the check fails, it prints the warning: *** WARNING: Service XYZ probably didn't start properly. This issue can occur because:

There is no opened port in the service.

The service was not started properly before the timeout, and the port is not responding.

In most cases it affects the job, but there may be situations when the job still succeeds even if that warning was printed.

Currently, the semantics are optimized for the "there may be" scenario. For the "in most cases" scenarios, it's currently impossible to "fail-fast" the whole job when the service startup has failed.

Suggested workaround

I am proposing something like this, in the .gitlab-ci.yml:

  services:
    - name: docker:stable-dind
      alias: docker
      fail_fast: true # the new setting

The fail_fast setting would make GitLab Runner abort the job immediately upon failed health check for the service => faster feedback to the person running the job, which can sometimes be very helpful.

Results of GitLab environment info

Expand for output related to GitLab environment info

System:		Ubuntu 18.04
Proxy:		no
Current User:	git
Using RVM:	no
Ruby Version:	2.7.7p221
Gem Version:	3.1.6
Bundler Version:2.3.15
Rake Version:	13.0.6
Redis Version:	6.2.8
Sidekiq Version:6.5.7
Go Version:	unknown

GitLab information
Version:	15.7.2-ee
Revision:	78771c4b9a0
Directory:	/opt/gitlab/embedded/service/gitlab-rails
DB Adapter:	PostgreSQL
DB Version:	12.12
URL:		https://git.example.com
HTTP Clone URL:	https://git.example.com/some-group/some-project.git
SSH Clone URL:	git@git.example.com:some-group/some-project.git
Elasticsearch:	no
Geo:		no
Using LDAP:	yes
Using Omniauth:	yes
Omniauth Providers: 

GitLab Shell
Version:	14.14.0
Repository storages:
- default: 	unix:/var/opt/gitlab/gitaly/gitaly.socket
GitLab Shell path:		/opt/gitlab/embedded/service/gitlab-shell

Admin message

Services failing to start should support aborting the CI job

Summary

Steps to reproduce

Example Project

What is the current bug behavior?

What is the expected correct behavior?

Suggested workaround

Results of GitLab environment info

Possible fixes