Add more metrics and logs to spawn token (!6039) · Merge requests · GitLab.org / gitaly

Quang-Minh Nguyen requested to merge qmnguyen0711/improve-spawn-token-observability into master Jul 07, 2023

Spawn Token is a mechanism to limit the number of spawning processes at the time. We measure its activity mostly via command.spawn_token_wait_ms log field. That single log is not enough to diagnose recent incidents, such as https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24029+. This MR adds some log fields and metrics to make it more observable. The list of new metrics including:

command.spawn_token_fork_ms: measure the time a command spends on forking the process, after it acquires the token.
gitaly_spawn_forking_time_seconds: Histogram of time waiting for spawn tokens.
gitaly_spawn_waiting_time_seconds: Histogram of actual forking time after spawn tokens are acquired.
gitaly_spawn_token_queue_length: The current length of the queue waiting for spawn tokens.

This MR also removes gitaly_command_spawn_token_acquiring_seconds_total metric. That metric is a counter which sums all the time waiting for spawn tokens. It breaks down by grpc_service then cmd. That breakdown doesn't make sense, because spawn token acquisition is process-global. The waiting time doesn't correlate to the command or service. In addition, counter is not a good type for this measurement. It is replaced by the new metrics.

Along the way, this MR refactors the global-ness of spawn tokens. That functionality was introduced multiple years ago. It uses global variables that make it very hard to test and extend. This MR introduces the SpawnTokenManager to wrap up all of its functionalities.

Add more metrics and logs to spawn token

Merge request reports