Spike: Investigate refactoring package metadata sync throttling mechanism

Problem to solve

Package metadata sync uses a simple sleep mechanism for throttling in order to ensure it doesn't monopolize an instance's db resources: https://gitlab.com/gitlab-org/gitlab/-/blob/144c23cf655eae0de5e0baf181b374373a3e2204/ee/app/services/package_metadata/sync_service.rb#L33

Original incident: #396649 (closed).

This mechanism is insufficient because:

The throttle rate (750 milliseconds) is chosen so as to make sync usable on the most under-resourced instances. For larger instances this throttle rate is far too conservative.
It uses blocking sleep which does achieve a resource back-off strategy but still keeps the worker and the job around though it may be more efficient to stop execution and restart.
It is complicated and doesn't match the rest of the codebase's worker architecture which creates maintenance problems and slows down review.

Proposal

Since the addition of the throttle in package metadata sync, a new mechanism has become available for throttling sidekiq workers which are db write heavy: https://docs.gitlab.com/ee/development/sidekiq/#deferring-sidekiq-workers

Refactoring package metadata sync to use the "deferred" strategy would help improve performance (by making the throttle time more adaptive, would improve resource utilization for workers by not blocking the sync thread, and would make maintenance easier by being in line with codebase direction and updates.

This issue is intended for investigation and capture whether this strategy would work for current sync requirements and what it would take to make the change.

Note: there is a currently open testing issue for this feature #414843 (closed)

Spike Criteria

This non-exhaustive list of criteria is meant as a general guide:

Number of workers: can this mechanism keep the number of workers predictable?
- Current number of workers is limited to 1 because of load on db table load: #415102
- ExclusiveLeaseGuard is used to ensure workers are locked out: https://gitlab.com/gitlab-org/gitlab/-/blob/144c23cf655eae0de5e0baf181b374373a3e2204/ee/app/workers/package_metadata/sync_worker.rb#L23
- Note: it is possible to increase the number of workers (e.g. by partitioning work by purl_type) but this is not in the scope of the spike.
Regular sync: we currently use cron to trigger a sync check every 5 minutes to keep the db up to date with any upstream changes.
- Is a form of predictable/regular sync possible with the "deferred" strategy?
Initial sync: initial instance vs regular sync.
- Initial sync is the synchronization of the whole dataset and for certain instances takes up to 24 hours.
- Regular sync is syncing any new changes that appear in the dataset.
- Database load increase is likelier on initial sync than on regular sync and thus can/may be treated differently when selecting sync strategies or timing defaults.

Intended users

Feature Usage Metrics

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited Jul 17, 2023 by Igor Frenkel