GithubImporter: Cache imported relation after the importer runs instead of the work is scheduled
Problem Definition
Currently, Gitlab::GithubImport::ParallelScheduling#parallel_import
is marking the relation to be imported as imported
after the job is scheduled, this might cause to some relations to be skipped if sidekiq, for some reason, doesn't process that job.
This issue is based on the following discussion from !62036 (merged) should be addressed:
-
@stanhu started a discussion: In the parallel scheduling case, does this mean that we mark the pull request review as imported after the
ImportPullRequestReviewWorker
job is scheduled, and not when it actually completes successfully? If so, are we okay with that? For example, I could see the Sidekiq job hitting a rate limit. Would we leave it up to the retry mechanism to finish?
Proposed Solution
Instead of marking a relation as imported after scheduling its job, the relation should only be marked as scheduled after the import is complete. Something like:
module Gitlab
module GithubImport
module Importer
class SomeRelationImporter
def execute
# import!
ensure
cache_imported
end
# ...
def cache_imported
cache_key = ParallelScheduling::ALREADY_IMPORTED_CACHE_KEY % {
project: project.id,
collection: :relation_name
}
Gitlab::Cache::Import::Caching.set_add(cache_key, relation.id)
end
end
end
end
end