Draft: Add ci service probes (!49191) · Merge requests · GitLab.org / GitLab

Arran Walker requested to merge ajwalker/ci-service-probes into master Dec 04, 2020

What does this MR do?

Runner currently supports only a TCP port check to determine the status of services it creates for CI jobs. The port checked is the first port the image's metadata exposes. This approach has worked reasonably well, but does not work for many use cases:

The container may expose multiple ports and the first one might not lead to a successful health check.
Just because the port is available to connect to, does not mean the service is actually ready.
If the wrong port is chosen, there's a long timeout, that can add up to a minute before a job starts executing.

This MR introduces CI service probes, designed for familiarity of users of Kubernetes' startup probes, allowing TCP, HTTP Get, and Exec probes to be configured for each CI service. The probes configuration will be passed to Runner, so that the probes can be executed to correctly identify when a service is ready.

We've previously explored using docker's built-in HEALTHCHECK, but this has its own issues:

No docker-library official image uses a HEALTHCHECK. This would mean user's would often have to extend the container.
It only supports executing a command inside of the container. Many containers rely on Kubernetes' probe functionality, so no internal command is added.
It's worth mentioning that Kubernetes doesn't support HEALTHCHECK either, the command only approach is likely too limiting.

A common requested feature for Kubernetes is multiple probe support. Unfortunately, Kubernetes is at a point where they cannot easily introduce this to their API. The workaround to this problem is a sidecar solution, something that we cannot easily offer for CI services. This MR allows for multiple probes to be configured to avoid this issue.

It would be good to settle this problem once and for all: TCP, HTTP Get and Exec probes likely cover most, if not all, of the cases possible to determine whether a service is available/ready.

An example:

job:
  services:
  - name: service1
    probes:
    - tcp:
        port: 8080
      exec:
         command: ["/bin/check"]
      retries: 10 # optional
      initial_delay: 5 # optional
      period: 10 # optional
      timeout: 10 # optional
    - http_get:
        path: /health
        port: 8080
        headers:
          - 'X-Custom-Header: custom'
    - exec:
        command: ["/bin/check/another/thing"]

  - name: service2
    probes:
    - http_get:
        path: /health
        port: 8080
        headers:
          - 'X-Custom-Header: custom'
      initial_delay: 5 # optional
      period: 10 # optional
      timeout: 10 # optional

  - name: service3
    probes:
    - exec:
        command: ["/bin/check"]
      retries: 10 # optional
      initial_delay: 5 # optional
      period: 10 # optional
      timeout: 10 # optional

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers
Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Edited Jan 12, 2021 by Arran Walker

Draft: Add ci service probes

What does this MR do?

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

Merge request reports