Services within pipeline jobs should have configurable health checks
Problem
Service health checks are inadequate and unreliable, potentially causing jobs to fail unexpectedly or proceed when they shouldn't.
Specific Issues:
- Non-blocking health checks - Jobs continue running even if services fail health checks
- Fixed timeout - No way to adjust how long to wait for services to become healthy
- Inflexible health check logic - Cannot customize what constitutes a "healthy" service
- Port checking limitations - Only checks the lowest 20 TCP ports, ignoring others that might be critical
- Timing/ordering problems - Services may start before the job workspace is properly set up with builds, cache, or artifacts
This creates reliability issues where jobs might run against unhealthy services or services might not be ready when the job expects them to be, leading to unpredictable pipeline behavior.
Problem Description - original
Currently when using the docker executer (I don't have experience with other executers) if a service is specified in a pipeline a number of things happen (ref) to check for the health of the service. However
- the health of the service is not a blocking condition which prevents the job from proceeding
- the timeout to wait for the service to pass health checks is not configurable
- the healthchecks themselves are not configurable
- the ports which are healthchecked are arbitrarily truncated (only the numerically lowest 20 exposed TCP ports are checked)
- the order of service startup is not guaranteed relative to the population of
/buildsor the population of shared cached or artifacts from previous jobs in the pipeline
Proposal
- the
servicedefinition should be extended to add ahealthcheckskey that takes a list of dicts with the following keys (not all mandatory)-
ports- a list of ports or port ranges to consider when conducting the health check. When not provided, the existing "legacy" behavior of using the first 20 exposed ports defined for the container would be used -
timeout- the maximum time in seconds to wait for the service to pass its healthchecks. When not provided the existing default of30 secondswill be used -
required- a boolean, when true a failure of this healthcheck will result in the job to be marked as failed. When not provided the existing default behavior of logging a warning but not failing the job will be used -
wait,retry- integers that control the retry and back off timing on healthchecks -
image,entrypoint,command- parameters to specify an alternate healthcheck container to be run instead of simply waiting for exposed TCP ports to be listening. A non-zero return code from PID 1 of this container will be considered a failure of the healthcheck
-
- the
servicedefinition should be extended to addrequires_cache,requires_artifacts,requires...- booleans which can be used to force ordering of the service and/or its healthchecks until after various data has been downloaded/copied/mounted as needed. This would allow for services to use this data at startup time to, for instance, provide configuration or fixture data (such as importing or applying database schemas/data/migrations)
Theoretical Example
test_db_queries:
stage: test
services:
- name: mssql/server:2019-latest
alias: mssqlserver
requires_builds: true # wait until /builds has a full git clone so we can use an entrypoint script from our project repo
requires_cache: false # we aren't going to read or write from the cache, so don't wait for it
requires_artifacts: true # we're going to generate fixture data in an earlier job, wait for the artifacts to sync before starting the database
entrypoint:
- $CI_PROJECT_DIR/load_and_run_fixture_db.sh $CI_PROJECT_DIR/fixtures/test_data.sql
variables:
ACCEPT_EULA: Y
healthchecks:
- ports:
- 1433 # check sql server listening on port 1433
timeout: 180 # timeout after 3 minutes
wait: 30 # attempt first check 30 seconds after starting service
retry: 15 # wait 15 second between retries on failure
required: true # fail job if sql server isn't listening
- ports:
- 135 # check for debugger listening on port 135
required: false # raise a warning if the debugger isn't available
- image: mcr.microsoft.com/mssql-tools # use the mssql tools container to use a native client
command: "sqlcmd -P $SQL_PASSWORD -H mssqlserver:1433 -d main -Q 'SELECT 1;'" # attempt a native authenticated protocol query
required: true
script:
- test_db_queries.sh
artifacts:
paths:
- fixtures
- coverage_report.json
The above example is somewhat contrived, but demonstrates how all of the proposed features could work in concert to ensure that our service is a flexible part of test fixturing which has strong guarantees around how it is started/configured, presented to and reachable by our test job.