Multiple SyncWorker instances running at same time

Summary

For large license package_metadata files PackageMetadata::SyncService can run well over its ttl and does not terminate until it is done processing the file.

Note: this does not happen often and is more likely on initial sync and version_format v1 because of the large number of lines in these files.

Impact

This has the effect of allowing other sync workers to start and start processing data at the same time. The sync worker is designed to have a single instance running at a time.

The single worker design was created because of the large amount of db thrashing and write lock timeouts which occurred during early testing when 2 (or more) processes where upserting data into the same table. This transaction cancellation exceptions and increased sync latency by causing re-processing of the same dataset.

Detail

The sync service checks whether it has run out of time by verifying via signal.stop? after each processed file: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L53

The defect arises when the time taken to process the file outlasts the lifetime of sync (e.g. it takes 5m to process the slice vs the 4 minutes allowed).

Steps to reproduce

n/a

Example Project

n/a

What is the current bug behavior?

There are multiple sync workers running at the same time because PackageMetadata::SyncService runs for longer than its allowed run time (4m).

What is the expected correct behavior?

SyncService should terminate close to its allowed run time.

Relevant logs and/or screenshots

This is a time slice of sync worker executions with multiple instances of PackageMetadta::SyncWorker doing work at the same time: https://nonprod-log.gitlab.net/app/r/s/t2DeO

jid: 7c43aaf30ce4e2cf013663c2 (start at 14:00, fail at 14:12:44)
jid: d6189c6bece691a5e8c4bdce (start at 14:05, fail at 14:12:04)

Note: the failure happens because lease.ttl returns a nil rather than a number.

This is a log showing the files which were evaluated by the sync service: https://nonprod-log.gitlab.net/app/r/s/PpFWO

The last file is the one that starts at 14:06 and finishes just before job with jid d6189c6bece691a5e8c4bdce terminates at 14:12:04.

Output of checks

Results of GitLab environment info

n/a

Results of GitLab application Check

n/a

Possible fixes

The error occurs when e file may take more than MAX_SYNC_DURATION seconds to process: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/package_metadata/sync_worker.rb#L10

This has the effect that the lease has timed out but the service is still doing work.

the file is processed in this block https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L46-49
the stop check is done right after https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L53

The side-effect is that another worker will start doing the same work at the same time which is something this system is meant to avoid.

Note, it is not possible to fix this defect by bringing the stop check check into file ingestion block and leaving the block early because we cannot create checkpoint for that file because it is not fully processed. This means that the sync service may never pass this point if the file processing time is always greater than the lease ttl.

Several fixes are possible:

Extend max time sync service is allowed to run.
- Pros: Simple.
- Cons:
  - Jobs are generally discouraged from running longer than 5 minutes.
  - Sidekiq may restart after deploy and the chance of the grace period given to sidekiq jobs on deployment has a higher chance of expiring. Once job is killed, no new syncworker started after sidekiq restart may start sync until the job's MAX_SYNC_DURATION has expired.
  - It's not easy to predict how much time may be needed for processing.
Store a file cursor along with sequence and chunk in the checkpoint. Sync will resume at the cursor position if it is set.
- Pros: allows checkpointing after each file slice
- Cons: complicates syncing logic quite a bit; currently the connectors always start at the file after the checkpoint. They will now have to return the file at checkpoint.
Renew the lease after each slice until the file is processed.
- Pros:
  - Simpler fix than point 2 above.
  - More adaptive than the fix in point 1 in that the lease renewal can be incremental and doesn't have to always take the full amount of time. This means that it won't lock out other workers in the case of a job being killed.
- Cons:
  - Jobs are generally discouraged from running longer than 5 minutes.

Implementation plan

Update PackageMetadata::Checkpoint to add a line attribute.
- Type int.
- Default 0.
- Add migration.
Update SyncService to add line position when stop received
- Track line position in the slice processing block.
- Move stop check into slice processing block.
- If stop received, update checkpoint with the current line position and break out of execution.
  - checkpoint.update(sequence: file.sequence, chunk: file.chunk, line: line_number)
- Update existing checkpoint update to set line to 0 (since processing is done).
Update SyncService to resume line processing from checkpoint.line
- Add next if line_number < checkpoint.line to slice processing block.

Edited Jul 13, 2023 by Igor Frenkel