feat: add smoke-test.sh for automated container validation

g## What does this MR do?

Add smoke-test.sh and update the README to integrate it into the existing docker-compose testing workflow.

Problem

The container state validation step in the README is manual: run docker compose ps -a repeatedly until containers reach the desired state. There is no automated pass/fail signal. The KAS startup failure in issue #107 went undetected for 6+ months because this step relies on visual inspection.

Changes

smoke-test.sh (new file): automates the container state validation in three phases:

  1. Starts the compose stack (docker compose up -d)
  2. Polls container states with a configurable timeout (default 300s)
  3. Validates every container against its expected state:
    • Init containers (migrations, registry_migrations): must have exited with code 0
    • Services with healthchecks (postgres, redis, webservice, workhorse, etc.): must report healthy
    • Distroless services without healthchecks (kas, sidekiq): must be running

Exits 0 on success, 1 on any failure. Failed services get their last 20 log lines printed for diagnosis.

README.md: update the "test the main images using docker-compose" section to use the script as the entrypoint for container validation, while preserving the manual browser-based verification steps that follow.

The updated README section would read:

### For non-patch releases, test the main images using docker-compose

You need docker (with the compose plugin) and python3 installed for this to work.

1. Export the local tagged images. If you ran `./build-script.sh v18.0.0-ubi`
   for example, your images would have been tagged as `18.0.0`
1. `export GITLAB_TAG=18.0.0`

#### Automated container validation

Run the smoke test script to bring up the stack and validate that all
containers reach their expected state:

    ./smoke-test.sh

The script starts all services, waits for healthchecks and migrations to
complete (up to 300 seconds), and reports pass/fail for each container.
If any service fails, the script prints its last 20 log lines and exits
with a non-zero status.

To use a custom timeout: `./smoke-test.sh 180`

#### Manual verification

Once the smoke test passes, verify the instance is functional:

1. Reach the local gitlab instance by going to `localhost:3000` in your browser
1. Set your root password and login
1. Create a new project and initialize a Readme through the UI to confirm
   gitaly is working
1. Add your ssh key to your account, and clone the test project locally
   through ssh
1. Make a file change to the local git checkout of the test project, and
   push it back up over ssh
1. Refresh the project in your browser to see the change, confirming that
   shell works
1. Run `docker compose down` to clean up the test

Design decisions

  • No healthcheck for KAS: KAS uses a distroless image (/usr/bin/kas is the only binary). No shell, no curl, no grpc_health_probe. CMD-SHELL healthchecks fail with "exec: /bin/sh: no such file or directory". The script checks container state (running vs exited) instead.

  • Init container detection: Uses exact string matching to avoid substring collisions (e.g., gitaly must not match gitaly-init-cgroups).

  • Python dependency: Uses python3 -c for JSON parsing of docker compose ps output. Python 3 is listed as a new prerequisite in the README.

#107

Test plan

Tested on CentOS Stream 10 (GCP, Docker CE 29.3.1) against dsop-scripts compose with CNG v18.10.0 images:

  • Pass case: 3 clean runs from docker compose down -v, all passed (218s, 216s, ~210s)
  • Fail case: Reverted KAS config (removed websocket secret), script correctly detected KAS exited with code 1 and reported FAIL

AI-Generated Content Disclosure: This MR contains code generated with assistance from GitLab Duo and OpenCode. The output has been reviewed for correctness, tested, and validated against project requirements per GitLab's AI contribution guidelines.

Merge request reports

Loading