Ci::Runners::StaleMachinesCleanupCronWorker is not keeping up with stale runner managers

The Ci::Runners::StaleManagersCleanupService service is supposed to delete all stale runner managers, by deleting a maximum of 1000 stale runner managers daily. It does so by running an SQL statement such as the following until no more records are deleted or at least 1000 have been deleted:

DELETE FROM "ci_runner_machines"
WHERE "ci_runner_machines"."id" IN (
    SELECT "ci_runner_machines"."id"
    FROM ((
        SELECT "ci_runner_machines".*
        FROM "ci_runner_machines"
        WHERE "ci_runner_machines"."contacted_at" IS NULL)
      UNION ALL (
        SELECT "ci_runner_machines".*
        FROM "ci_runner_machines"
        WHERE "ci_runner_machines"."contacted_at" <= '2023-05-12 11:27:55.541974')) ci_runner_machines
    WHERE "ci_runner_machines"."created_at" <= '2023-05-12 11:27:55.541882'
    LIMIT 100)

By looking at the production Kibana logs (snapshot), we can see that the cron worker has been running as expected and that it reports that records have been deleted:

However, the number of stale runner managers is now at ~76K records from a total of 332K, and the number seems to keep growing:

Query

We can see that in just over 1 day, around 2K runner machines were either deleted or contacted GitLab (no longer being considered stale).

It could be that deleting only 1K runner managers per day is simply not enough for a large installation such as .com.

Proposals

Change the service so that it reports the number of records deleted, now that we're using id_in before calling delete_all.
Change the worker schedule so that it runs hourly instead of daily on .com. This will both help get more frequent feedback, as well as reducing the growth of stale runner managers.
Perhaps log an attribute containing the individual counts of deleted runner managers per sub-batch.
Remove Ci::Runners::StaleManagersCleanupService::MAX_DELETIONS limit.

Edited May 20, 2023 by Pedro Pombeiro