GithubImporter: Cache imported relation after the importer runs instead of the work is scheduled

Problem Definition

Currently, Gitlab::GithubImport::ParallelScheduling#parallel_import is marking the relation to be imported as imported after the job is scheduled, this might cause to some relations to be skipped if sidekiq, for some reason, doesn't process that job.

This issue is based on the following discussion from !62036 (merged) should be addressed:

  • @stanhu started a discussion:

    In the parallel scheduling case, does this mean that we mark the pull request review as imported after the ImportPullRequestReviewWorker job is scheduled, and not when it actually completes successfully? If so, are we okay with that? For example, I could see the Sidekiq job hitting a rate limit. Would we leave it up to the retry mechanism to finish?

Proposed Solution

Instead of marking a relation as imported after scheduling its job, the relation should only be marked as scheduled after the import is complete. Something like:

module Gitlab
  module GithubImport
    module Importer
      class SomeRelationImporter
        def execute
          # import!
        ensure
          cache_imported
        end

        # ...

        def cache_imported
          cache_key = ParallelScheduling::ALREADY_IMPORTED_CACHE_KEY % {
            project: project.id,
            collection: :relation_name
          }

          Gitlab::Cache::Import::Caching.set_add(cache_key, relation.id)
        end
      end
    end
  end
end