Ci::Runners::StaleMachinesCleanupCronWorker is not keeping up with stale runner managers
The Ci::Runners::StaleManagersCleanupService
service is supposed to delete all stale runner managers, by deleting a maximum of 1000 stale runner managers daily. It does so by running an SQL statement such as the following until no more records are deleted or at least 1000 have been deleted:
DELETE FROM "ci_runner_machines"
WHERE "ci_runner_machines"."id" IN (
SELECT "ci_runner_machines"."id"
FROM ((
SELECT "ci_runner_machines".*
FROM "ci_runner_machines"
WHERE "ci_runner_machines"."contacted_at" IS NULL)
UNION ALL (
SELECT "ci_runner_machines".*
FROM "ci_runner_machines"
WHERE "ci_runner_machines"."contacted_at" <= '2023-05-12 11:27:55.541974')) ci_runner_machines
WHERE "ci_runner_machines"."created_at" <= '2023-05-12 11:27:55.541882'
LIMIT 100)
By looking at the production Kibana logs (snapshot), we can see that the cron worker has been running as expected and that it reports that records have been deleted:
However, the number of stale runner managers is now at ~76K records from a total of 332K, and the number seems to keep growing:
Query |
---|
We can see that in just over 1 day, around 2K runner machines were either deleted or contacted GitLab (no longer being considered stale).
It could be that deleting only 1K runner managers per day is simply not enough for a large installation such as .com.
Proposals
- Change the service so that it reports the number of records deleted, now that we're using
id_in
before callingdelete_all
. - Change the worker schedule so that it runs hourly instead of daily on .com. This will both help get more frequent feedback, as well as reducing the growth of stale runner managers.
- Perhaps log an attribute containing the individual counts of deleted runner managers per sub-batch.
- Remove
Ci::Runners::StaleManagersCleanupService::MAX_DELETIONS
limit.