Services within pipeline jobs should have configurable health checks

Problem

Service health checks are inadequate and unreliable, potentially causing jobs to fail unexpectedly or proceed when they shouldn't.

Specific Issues:

  • Non-blocking health checks - Jobs continue running even if services fail health checks
  • Fixed timeout - No way to adjust how long to wait for services to become healthy
  • Inflexible health check logic - Cannot customize what constitutes a "healthy" service
  • Port checking limitations - Only checks the lowest 20 TCP ports, ignoring others that might be critical
  • Timing/ordering problems - Services may start before the job workspace is properly set up with builds, cache, or artifacts

This creates reliability issues where jobs might run against unhealthy services or services might not be ready when the job expects them to be, leading to unpredictable pipeline behavior.

Problem Description - original

Currently when using the docker executer (I don't have experience with other executers) if a service is specified in a pipeline a number of things happen (ref) to check for the health of the service. However

  • the health of the service is not a blocking condition which prevents the job from proceeding
  • the timeout to wait for the service to pass health checks is not configurable
  • the healthchecks themselves are not configurable
  • the ports which are healthchecked are arbitrarily truncated (only the numerically lowest 20 exposed TCP ports are checked)
  • the order of service startup is not guaranteed relative to the population of /builds or the population of shared cached or artifacts from previous jobs in the pipeline

Proposal

  • the service definition should be extended to add a healthchecks key that takes a list of dicts with the following keys (not all mandatory)
    • ports - a list of ports or port ranges to consider when conducting the health check. When not provided, the existing "legacy" behavior of using the first 20 exposed ports defined for the container would be used
    • timeout - the maximum time in seconds to wait for the service to pass its healthchecks. When not provided the existing default of 30 seconds will be used
    • required - a boolean, when true a failure of this healthcheck will result in the job to be marked as failed. When not provided the existing default behavior of logging a warning but not failing the job will be used
    • wait, retry - integers that control the retry and back off timing on healthchecks
    • image, entrypoint, command - parameters to specify an alternate healthcheck container to be run instead of simply waiting for exposed TCP ports to be listening. A non-zero return code from PID 1 of this container will be considered a failure of the healthcheck
  • the service definition should be extended to add requires_cache, requires_artifacts, requires... - booleans which can be used to force ordering of the service and/or its healthchecks until after various data has been downloaded/copied/mounted as needed. This would allow for services to use this data at startup time to, for instance, provide configuration or fixture data (such as importing or applying database schemas/data/migrations)

Theoretical Example

test_db_queries:
  stage: test
  services:
    - name: mssql/server:2019-latest
      alias: mssqlserver
      requires_builds: true # wait until /builds has a full git clone so we can use an entrypoint script from our project repo
      requires_cache: false # we aren't going to read or write from the cache, so don't wait for it
      requires_artifacts: true # we're going to generate fixture data in an earlier job, wait for the artifacts to sync before starting the database
      entrypoint:
        - $CI_PROJECT_DIR/load_and_run_fixture_db.sh $CI_PROJECT_DIR/fixtures/test_data.sql
      variables:
        ACCEPT_EULA: Y
      healthchecks:
        - ports:
            - 1433 # check sql server listening on port 1433
          timeout: 180 # timeout after 3 minutes
          wait: 30 # attempt first check 30 seconds after starting service
          retry: 15 # wait 15 second between retries on failure
          required: true # fail job if sql server isn't listening
        - ports:
            - 135 # check for debugger listening on port 135
          required: false # raise a warning if the debugger isn't available
        - image: mcr.microsoft.com/mssql-tools # use the mssql tools container to use a native client
          command: "sqlcmd -P $SQL_PASSWORD -H mssqlserver:1433 -d main -Q 'SELECT 1;'"  # attempt a native authenticated protocol query
          required: true
  script:
    - test_db_queries.sh
  artifacts:
    paths:
      - fixtures
      - coverage_report.json

The above example is somewhat contrived, but demonstrates how all of the proposed features could work in concert to ensure that our service is a flexible part of test fixturing which has strong guarantees around how it is started/configured, presented to and reachable by our test job.

Links to related issues and merge requests / references

Edited Aug 31, 2025 by Darren Eastman
Assignee Loading
Time tracking Loading