Skip to content

feat(gc): allow gc to back off on intermittent failures

Hayley Swimelar requested to merge 1238/fix-gc-backoff into master

What does this MR do?

This MR adds a cooldown period of 30 minutes after the GC agent encounters an error. During this cooldown period, the agent will continue to exponentially backoff even after successful runs. This change allows GC workers to respond to periods of intermittent failures, such as high database or object storage latency. Without these changes, successful jobs interleave themselves between errors, resetting the backoff and preventing workers from responding to errors, like in the issue linked below.

In a clustered environment, these changes should gradually reduce the work rate across the entire pool of workers, without off lining too many workers at once or for too long. In a single registry environment with a single agent per job time, the gradual back off combined with review postponing for particular tasks and the relatively short cooldown period should prevent stopping GC completely for persistent errors.

The following chart shows the backoffs that the current cooldown period and multiplier generates. The first case is GC running without errors and always finding the next job. The second case represents the backoff times after encountering a single error. The last case represents continuous backing off, either by errors or by not finding the next job.

Iterations No Error One Error Continuous Errors
1 377.485033ms 427.887727ms 641.117084ms
2 813.08158ms 977.629776ms 1.556249301s
3 946.130319ms 2.392266076s 2.909844249s
4 1.172958952s 6.20768436s 6.847235092s
5 1.244545135s 16.553962728s 23.599521089s
6 1.208952421s 44.885179936s 1m0.462385718s
7 671.374511ms 2m16.834594027s 2m0.397381534s
8 846.136993ms 5m38.305472907s 6m13.408775202s
9 989.698681ms 10m5.59854422s 11m8.80770274s
10 1.101810288s 36m56.278106718s 27m16.637612362s
11 1.166856437s 1.262375171s 1h42m34.022734572s

This chart shows the same with the default multiplication factor. The multiplication introduced in this MR allows workers to back down faster when encountering problems without impacting workers who are running continuously.

Iterations No Error One Error Continuous Errors
1 405.630693ms 453.110756ms 633.344943ms
2 694.006847ms 567.5832ms 754.358943ms
3 889.206589ms 1.029550075s 1.102643646s
4 685.507366ms 2.171233437s 1.699873203s
5 1.03701481s 2.179187595s 3.089507172s
6 765.001584ms 5.01148054s 3.637405975s
7 927.245791ms 6.505289737s 6.096550714s
8 800.533981ms 9.175209708s 9.642728637s
9 835.54655ms 11.85710504s 12.221096116s
10 1.22024873s 21.762579974s 16.944430412s
11 1.173875646s 28.845372026s 25.320037127s
12 673.113884ms 51.588951837s 37.763223682s
13 694.003864ms 45.535294785s 54.546139526s
14 1.253378347s 1m22.131976461s 1m19.496075627s
15 1.096441418s 3m6.349051944s 3m11.825252528s
16 705.507838ms 4m41.304025096s 4m40.222583981s
17 1.048171639s 5m34.314083124s 6m44.455971541s
18 944.868247ms 7m53.103450477s 7m59.200569259s
19 1.269092664s 11m53.014253227s 13m39.353311474s
20 1.099705852s 1.231029703s 13m45.30020798s
21 1.227072977s 1.011153833s 30m47.895501365s

Related to Improve handling of timeout bursts during onlin... (#1238 - closed) • Hayley Swimelar • 17.0 • At risk

Author checklist

  • Feature flags
    • Added feature flag:
    • This feature does not require a feature flag
  • I added unit tests or they are not required
  • I added documentation (or it's not required)
  • I followed code review guidelines
  • I followed Go Style guidelines
  • For database changes including schema migrations:
    • Manually run up and down migrations in a postgres.ai production database clone and post a screenshot of the result here.
    • If adding new queries, extract a query plan from postgres.ai and post the link here. If changing existing queries, also extract a query plan for the current version for comparison.
      • I do not have access to postgres.ai and have made a comment on this MR asking for these to be run on my behalf.
    • Do not include code that depends on the schema migrations in the same commit. Split the MR into two or more.
  • Ensured this change is safe to deploy to individual stages in the same environment (cny -> prod). State-related changes can be troublesome due to having parts of the fleet processing (possibly related) requests in different ways.

Reviewer checklist

  • Ensure the commit and MR tittle are still accurate.
  • If the change contains a breaking change, apply the breaking change label.
  • If the change is considered high risk, apply the label high-risk-change
  • Identify if the change can be rolled back safely. (note: all other reasons for not being able to rollback will be sufficiently captured by major version changes).

If the MR introduces database schema migrations:

  • Ensure the commit and MR tittle start with fix:, feat:, or perf: so that the change appears on the Changelog
If the changes cannot be rolled back follow these steps:
  • If not, apply the label cannot-rollback.
  • Add a section to the MR description that includes the following details:
    • The reasoning behind why a release containing the presented MR can not be rolled back (e.g. schema migrations or changes to the FS structure)
    • Detailed steps to revert/disable a feature introduced by the same change where a migration cannot be rolled back. (note: ideally MRs containing schema migrations should not contain feature changes.)
    • Ensure this MR does not add code that depends on these changes that cannot be rolled back.
Edited by Hayley Swimelar

Merge request reports