Skip to content

Add metrics for counting configuration file access

Built on top of !3858 (merged)

What does this MR do?

Adds metrics describing details about configuration file access by the Runner process.

Why was this MR needed?

In !3713 (merged) we've created a regression that caused Runner to constantly reload the configuration file, over and over again. This was not found until the very last moment before the 15.8 release, because it was not causing any user or operator facing problems.

It was noticed by @josephburnett when testing a totally different thing by starting runner locally - the constantly repeated log lines that Configuration was loaded was a clear sign that something strange is happening.

Unfortunately, there was no other way to notice this problem than to look on the logs and notice that configuration loading seems to happen to often.

This MR adds metrics that could be used to define Prometheus alerts informing that the configuration loading and/or saving rates are not at the expected levels. It also adds metrics that allow to notice that something is wrong with loading and/or saving configuration file from the Runner process context.

Example output of the new metrics:

# HELP gitlab_runner_configuration_loaded_total Total number of times the configuration file was loaded by Runner process
# TYPE gitlab_runner_configuration_loaded_total counter
gitlab_runner_configuration_loaded_total 5
# HELP gitlab_runner_configuration_loading_error_total Total number of times the configuration file was not loaded by Runner process due to errors
# TYPE gitlab_runner_configuration_loading_error_total counter
gitlab_runner_configuration_loading_error_total 8
# HELP gitlab_runner_configuration_saved_total Total number of times the configuration file was saved by Runner process
# TYPE gitlab_runner_configuration_saved_total counter
gitlab_runner_configuration_saved_total 4
# HELP gitlab_runner_configuration_saving_error_total Total number of times the configuration file was not saved by Runner process due to errors
# TYPE gitlab_runner_configuration_saving_error_total counter
gitlab_runner_configuration_saving_error_total 1

And the example Grafana view with all four values before and after adding fix from !3858 (merged):

Screenshot_2023-01-19_at_20-55-09_Runner_metrics_-_Grafana

The long, high value of loading and saving rates between 19:38 and 19:46 is the infinite reloading loop. It ended once Runner was restarted with the fixed version. The small increase at 19:48 is simulation of manual configuration file update (by calling touch config.toml) that was properly detected and handled by Runner.

The increases of loading rate and error rates for both load and save between 19:50 and 19:54 are simulations of file access errors for read, parse and write.

What's the best way to test this MR?

What are the relevant issue numbers?

Edited by Tomasz Maczukin

Merge request reports