[DRAFT] Help understanding how garbage collection and repacks are triggered

Support Request for the Gitaly Team

The goal is to keep these requests public. However, if customer information is required to the support request, please be sure to mark this issue as confidential.

This request template is part of Gitaly Team's intake process.

Author Checklist

Fill out customer information section
- Provide an detail summary under Additional Information:
Severity realistically set
Provided detailed problem description
Provided detailed troubleshooting performed
Clearly articulated what is needed from the Gitaly team to support your request by filling out the What specifically do you need from the Gitaly team

Customer Information

Salesforce Link: https://gitlab.my.salesforce.com/0018X000038OVPPQA4

Zendesk Ticket: https://gitlab.zendesk.com/agent/tickets/518262

Installation Size: ~13k Seats

Architecture Information:

25k ref, however they currently only have a single gitaly node. They are working towards gitaly cluster.

Additional Information: We have de-escalated the situation by manually intervening and repacking the affected project, but now are working towards making sure it remains healthy.

Support Request

Severity

severity4

Context

This is a follow-up on Clarification on cgroups tuning for overloaded Gitaly. The overload we investigated was being caused by a poorly optimized project, which caused pack-object and other git processes running on it to pile up and consume huge amounts of CPU cycles. This was solved by manually invoking repository.optimize_repository on a Rails console, since all other attempts were being terminated before the job could finish.

Crisis averted, we are now looking into why Housekeeping wasn't running regularly to prevent the issue from happening, and how to make sure it runs when expected from now on.

Problem Description

We still don't feel confident that this problem won't reoccur. After checking their logs for traces of the GitGarbageCollectWorker for that specific mbient/meta-mbient project, we get no hits. Checking their settings for housekeeping periodicity, they seem far off from our default values, but we'd still expect the worker to trigger sometimes.

puts "full: #{Gitlab::CurrentSettings.current_application_settings.housekeeping_full_repack_period} incremental: #{Gitlab::CurrentSettings.current_application_settings.housekeeping_incremental_r
epack_period}"

full: 500 incremental: 100

We expect to find instances of incremental_repack and full_repack for that worker and project, but analysing Sidekiq logs covering several days yields nothing:

zcat */*.s | ~/bin/fast-stats -i 96h top
** Splitting input into 96 hour increments **

First event: 2024-04-19 20:09:27.310
Last event:  2024-04-23 07:14:04.120

Troubleshooting Performed

We tried to determine if/when the job was executed during the past weeks, basically to find out when we should expect the algorithm to determine it's time to do a full repack, and whether it's working correctly and managing to execute and finish as expected.

Testing on Housekeeping behavior based on number of pushes:

pushes   action
99       nothing
100      incremental
200      incremental
300      incremental
400      incremental
500      gc (does that actually do a full repack??)
1        nothing

What specifically do you need from the Gitaly team

We need help understanding whether we are going in the right direction here, and if our understanding of the algorithm is correct.

Any ideas on why the incremental/full repacks are not running automatically, or why we can't find the expected entries in the logs even after sufficient pushes have been made.

 /cc @mjwood @andrashorvath @jcaigitlab @john.mcdonnell @gerardo

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information