[DRAFT] Help understanding how garbage collection and repacks are triggered
Support Request for the Gitaly Team
The goal is to keep these requests public. However, if customer information is required to the support request, please be sure to mark this issue as confidential.
This request template is part of Gitaly Team's intake process.
Author Checklist
-
Fill out customer information section -
Provide an detail summary under Additional Information:
-
-
Severity realistically set -
Provided detailed problem description -
Provided detailed troubleshooting performed -
Clearly articulated what is needed from the Gitaly team to support your request by filling out the What specifically do you need from the Gitaly team
Customer Information
Salesforce Link: https://gitlab.my.salesforce.com/0018X000038OVPPQA4
Zendesk Ticket: https://gitlab.zendesk.com/agent/tickets/518262
Installation Size: ~13k Seats
Architecture Information:
25k ref, however they currently only have a single gitaly node. They are working towards gitaly cluster.
Additional Information: We have de-escalated the situation by manually intervening and repacking the affected project, but now are working towards making sure it remains healthy.
Support Request
Severity
Context
This is a follow-up on Clarification on cgroups tuning for overloaded Gitaly. The overload we investigated was being caused by a poorly optimized project, which caused pack-object
and other git processes running on it to pile up and consume huge amounts of CPU cycles. This was solved by manually invoking repository.optimize_repository
on a Rails console, since all other attempts were being terminated before the job could finish.
Crisis averted, we are now looking into why Housekeeping wasn't running regularly to prevent the issue from happening, and how to make sure it runs when expected from now on.
Problem Description
We still don't feel confident that this problem won't reoccur. After checking their logs for traces of the GitGarbageCollectWorker
for that specific mbient/meta-mbient
project, we get no hits. Checking their settings for housekeeping periodicity, they seem far off from our default values, but we'd still expect the worker to trigger sometimes.
puts "full: #{Gitlab::CurrentSettings.current_application_settings.housekeeping_full_repack_period} incremental: #{Gitlab::CurrentSettings.current_application_settings.housekeeping_incremental_r
epack_period}"
full: 500 incremental: 100
We expect to find instances of incremental_repack
and full_repack
for that worker and project, but analysing Sidekiq logs covering several days yields nothing:
zcat */*.s | ~/bin/fast-stats -i 96h top
** Splitting input into 96 hour increments **
First event: 2024-04-19 20:09:27.310
Last event: 2024-04-23 07:14:04.120
Troubleshooting Performed
We tried to determine if/when the job was executed during the past weeks, basically to find out when we should expect the algorithm to determine it's time to do a full repack, and whether it's working correctly and managing to execute and finish as expected.
Testing on Housekeeping behavior based on number of pushes:
pushes action
99 nothing
100 incremental
200 incremental
300 incremental
400 incremental
500 gc (does that actually do a full repack??)
1 nothing
What specifically do you need from the Gitaly team
We need help understanding whether we are going in the right direction here, and if our understanding of the algorithm is correct.
- Any ideas on why the incremental/full repacks are not running automatically, or why we can't find the expected entries in the logs even after sufficient pushes have been made.
/cc @mjwood @andrashorvath @jcaigitlab @john.mcdonnell @gerardo