Skip to content

Send service finish telemetry when exit code changes

What does this merge request do and why?

This MR reintroduces service finish telemetry, which was disabled by !5242 (merged) after it caused a spike in events from 60k to 2.5m per day. The spike happened because crash-looping services sent telemetry on every exit even with the same exit code.

Now we store each service's last exit code and only send telemetry when it changes.

Closes #2966 (closed)

How to set up and validate locally

  1. Start a crash-looping service (e.g., temporarily misconfigure runner like mv ~/.gitlab-runner/config.toml ~/.gitlab-runner/config.toml.bak)
  2. Verify only one service_finish event gets sent to ClickHouse
  3. Run echo "2" > sv/<service-name>/last_exit_code to simulate an exit code change
  4. Confirm a new service_finish event gets sent because the exit code changed from 1 to 2

Impacted categories

The following categories relate to this merge request:

Merge request checklist

  • This MR references an issue describing the change.
  • This change is backward compatible. If not, please include steps to communicate to our users.
  • Tests added for new functionality. If not, please raise an issue to follow-up.
  • Documentation added/updated, if needed.
  • Announcement added, if change is notable.
  • gdk doctor test added, if needed.
Edited by Nao Hashizume

Merge request reports

Loading