Multiple SyncWorker instances running at same time
Summary
For large license package_metadata
files PackageMetadata::SyncService can run well over its ttl
and does not terminate until it is done processing the file.
Note: this does not happen often and is more likely on initial sync and version_format v1
because of the large number of lines in these files.
Impact
This has the effect of allowing other sync workers to start and start processing data at the same time. The sync worker is designed to have a single instance running at a time.
The single worker design was created because of the large amount of db thrashing and write lock timeouts which occurred during early testing when 2 (or more) processes where upserting data into the same table. This transaction cancellation exceptions and increased sync latency by causing re-processing of the same dataset.
Detail
The sync service checks whether it has run out of time by verifying via signal.stop?
after each processed file: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L53
The defect arises when the time taken to process the file outlasts the lifetime of sync (e.g. it takes 5m to process the slice vs the 4 minutes allowed).
Steps to reproduce
n/a
Example Project
n/a
What is the current bug behavior?
There are multiple sync workers running at the same time because PackageMetadata::SyncService runs for longer than its allowed run time (4m).
What is the expected correct behavior?
SyncService
should terminate close to its allowed run time.
Relevant logs and/or screenshots
This is a time slice of sync worker executions with multiple instances of PackageMetadta::SyncWorker
doing work at the same time: https://nonprod-log.gitlab.net/app/r/s/t2DeO
- jid:
7c43aaf30ce4e2cf013663c2
(start at 14:00, fail at 14:12:44) - jid:
d6189c6bece691a5e8c4bdce
(start at 14:05, fail at 14:12:04)
Note: the failure happens because lease.ttl
returns a nil
rather than a number.
This is a log showing the files which were evaluated by the sync service: https://nonprod-log.gitlab.net/app/r/s/PpFWO
The last file is the one that starts at 14:06
and finishes just before job with jid d6189c6bece691a5e8c4bdce
terminates at 14:12:04
.
Output of checks
Results of GitLab environment info
n/a
Results of GitLab application Check
n/a
Possible fixes
The error occurs when e file may take more than MAX_SYNC_DURATION
seconds to process: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/package_metadata/sync_worker.rb#L10
This has the effect that the lease has timed out but the service is still doing work.
- the file is processed in this block https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L46-49
- the stop check is done right after https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L53
The side-effect is that another worker will start doing the same work at the same time which is something this system is meant to avoid.
Note, it is not possible to fix this defect by bringing the stop check check into file ingestion block and leaving the block early because we cannot create checkpoint for that file because it is not fully processed. This means that the sync service may never pass this point if the file processing time is always greater than the lease ttl.
Several fixes are possible:
- Extend max time sync service is allowed to run.
- Pros: Simple.
- Cons:
- Jobs are generally discouraged from running longer than 5 minutes.
- Sidekiq may restart after deploy and the chance of the grace period given to sidekiq jobs on deployment has a higher chance of expiring. Once job is killed, no new syncworker started after sidekiq restart may start sync until the job's
MAX_SYNC_DURATION
has expired. - It's not easy to predict how much time may be needed for processing.
- Store a file cursor along with
sequence
andchunk
in the checkpoint. Sync will resume at the cursor position if it is set.- Pros: allows checkpointing after each file slice
- Cons: complicates syncing logic quite a bit; currently the connectors always start at the file after the checkpoint. They will now have to return the file at checkpoint.
- Renew the lease after each slice until the file is processed.
- Pros:
- Simpler fix than point 2 above.
- More adaptive than the fix in point 1 in that the lease renewal can be incremental and doesn't have to always take the full amount of time. This means that it won't lock out other workers in the case of a job being killed.
- Cons:
- Jobs are generally discouraged from running longer than 5 minutes.
- Pros:
Implementation plan
- Update
PackageMetadata::Checkpoint
to add aline
attribute.- Type
int
. - Default
0
. - Add migration.
- Type
- Update SyncService to add line position when
stop
received- Track line position in the slice processing block.
- Move stop check into slice processing block.
- If
stop
received, update checkpoint with the current line position and break out of execution.checkpoint.update(sequence: file.sequence, chunk: file.chunk, line: line_number)
- Update existing checkpoint update to set line to 0 (since processing is done).
- Update SyncService to resume line processing from
checkpoint.line
- Add
next if line_number < checkpoint.line
to slice processing block.
- Add