feat(gc): allow gc to back off on intermittent failures (!1626) · Merge requests · GitLab.org / container-registry

Hayley Swimelar requested to merge 1238/fix-gc-backoff into master May 02, 2024

What does this MR do?

This MR adds a cooldown period of 30 minutes after the GC agent encounters an error. During this cooldown period, the agent will continue to exponentially backoff even after successful runs. This change allows GC workers to respond to periods of intermittent failures, such as high database or object storage latency. Without these changes, successful jobs interleave themselves between errors, resetting the backoff and preventing workers from responding to errors, like in the issue linked below.

In a clustered environment, these changes should gradually reduce the work rate across the entire pool of workers, without off lining too many workers at once or for too long. In a single registry environment with a single agent per job time, the gradual back off combined with review postponing for particular tasks and the relatively short cooldown period should prevent stopping GC completely for persistent errors.

The following chart shows the backoffs that the current cooldown period and multiplier generates. The first case is GC running without errors and always finding the next job. The second case represents the backoff times after encountering a single error. The last case represents continuous backing off, either by errors or by not finding the next job.

Iterations	No Error	One Error	Continuous Errors
1	377.485033ms	427.887727ms	641.117084ms
2	813.08158ms	977.629776ms	1.556249301s
3	946.130319ms	2.392266076s	2.909844249s
4	1.172958952s	6.20768436s	6.847235092s
5	1.244545135s	16.553962728s	23.599521089s
6	1.208952421s	44.885179936s	1m0.462385718s
7	671.374511ms	2m16.834594027s	2m0.397381534s
8	846.136993ms	5m38.305472907s	6m13.408775202s
9	989.698681ms	10m5.59854422s	11m8.80770274s
10	1.101810288s	36m56.278106718s	27m16.637612362s
11	1.166856437s	1.262375171s	1h42m34.022734572s

This chart shows the same with the default multiplication factor. The multiplication introduced in this MR allows workers to back down faster when encountering problems without impacting workers who are running continuously.

Iterations	No Error	One Error	Continuous Errors
1	405.630693ms	453.110756ms	633.344943ms
2	694.006847ms	567.5832ms	754.358943ms
3	889.206589ms	1.029550075s	1.102643646s
4	685.507366ms	2.171233437s	1.699873203s
5	1.03701481s	2.179187595s	3.089507172s
6	765.001584ms	5.01148054s	3.637405975s
7	927.245791ms	6.505289737s	6.096550714s
8	800.533981ms	9.175209708s	9.642728637s
9	835.54655ms	11.85710504s	12.221096116s
10	1.22024873s	21.762579974s	16.944430412s
11	1.173875646s	28.845372026s	25.320037127s
12	673.113884ms	51.588951837s	37.763223682s
13	694.003864ms	45.535294785s	54.546139526s
14	1.253378347s	1m22.131976461s	1m19.496075627s
15	1.096441418s	3m6.349051944s	3m11.825252528s
16	705.507838ms	4m41.304025096s	4m40.222583981s
17	1.048171639s	5m34.314083124s	6m44.455971541s
18	944.868247ms	7m53.103450477s	7m59.200569259s
19	1.269092664s	11m53.014253227s	13m39.353311474s
20	1.099705852s	1.231029703s	13m45.30020798s
21	1.227072977s	1.011153833s	30m47.895501365s

Author checklist

Reviewer checklist

Ensure the commit and MR tittle are still accurate.
If the change contains a breaking change, apply the breaking change label.
If the change is considered high risk, apply the label high-risk-change
Identify if the change can be rolled back safely. (note: all other reasons for not being able to rollback will be sufficiently captured by major version changes).

If the MR introduces database schema migrations:

Ensure the commit and MR tittle start with fix:, feat:, or perf: so that the change appears on the Changelog

If the changes cannot be rolled back follow these steps:

If not, apply the label cannot-rollback.
Add a section to the MR description that includes the following details:
- The reasoning behind why a release containing the presented MR can not be rolled back (e.g. schema migrations or changes to the FS structure)
- Detailed steps to revert/disable a feature introduced by the same change where a migration cannot be rolled back. (note: ideally MRs containing schema migrations should not contain feature changes.)
- Ensure this MR does not add code that depends on these changes that cannot be rolled back.

Edited May 02, 2024 by Hayley Swimelar

feat(gc): allow gc to back off on intermittent failures

What does this MR do?

Author checklist

Reviewer checklist

Merge request reports