Concurrency issue with GitHub importer

Every now and then, we see an issue with the GitHub importer on GitLab.com that results in slow (and cancelled) queries: gitlab-com/gl-infra/production#571 (closed)

From the logs, we can tell that all jobs that import a particular model (e.g. issues) end up acquiring a lock on internal_ids for that project and model type:

2018-11-15_15:26:50.99094 2018-11-15 15:26:50 GMT [40636]: [94-1] STATEMENT:  SELECT  "internal_ids".* FROM "internal_ids" WHERE "internal_ids"."id" = 1193270 LIMIT 1 FOR UPDATE
2018-11-15_15:26:51.05358 2018-11-15 15:26:51 GMT [38418]: [70-1] LOG:  duration: 14654.838 ms  execute <unnamed>: SELECT  "internal_ids".* FROM "internal_ids" WHERE "internal_ids"."id" = 1193270 LIMIT 1 FOR UPDATE
2018-11-15_15:26:51.05364 2018-11-15 15:26:51 GMT [19948]: [68-1] LOG:  process 19948 acquired AccessExclusiveLock on tuple (4988,79) of relation 79504434 of database 16385 after 7149.702 ms

Background

When we kick-off a GitHub import, we basically retrieve relevant objects from GitHub (an an iterative fashion) and schedule one background (sidekiq) job per object to insert the object into the database. This is done in https://gitlab.com/gitlab-org/gitlab-ce/blob/master/lib/gitlab/github_import/parallel_scheduling.rb#L67.

FWIW, the importer is optimized for performance in a sense that it completely disables any model hooks and uses batch insert to persist models into the database.

Now InternalId introduced consistent management of internal id values per project. This logic is normally triggered from a model hook, but this is not the case with the GitHub importer. Instead, we manually trigger the logic after each insert - which needs to acquire a lock on the record in internal_ids (scope is: project and usage, i.e. issues for the project).

The locking on internal_ids basically serializes model creation (e.g. for issues) for a project. This is in general required to achieve consistency for internal ids (there is a unique constraint on e.g. project_id, iid on issues - so we needed to have a mechanism to generate unique internal ids). Serializing the model creation on that lock however is counter-productive with the concurrency model of the GitHub importer (which schedules on job per object to import).

Proposal

A solution is to remove the internal id tracking from the individual jobs and only do this once per import (and model type) at the end of the process. All relevant models (issues, MRs, milestones) seem to get an iid value assigned from GitHub anyways, so it is ok to assume those are unique and skip the InternalId mechanic.

This comes with the downside that during the import process, it might not be possible to create new issues, milestones or MRs through GitLab. This is because the counter in internal_ids is not aware of instances created by the importer. So basically we're giving up consistency with regard to internal ids while importing stuff from GitHub.

Implementation-wise, we can simply delete the record in internal_ids after the GitHub importer finished for all model types (usages). This is safe to do in general and for any new issues etc. the record is going to be re-initialized correctly. After the Github import finished, those records only exist if a user manually created e.g. an issue during the time the import ran.

So basically two steps:

Delete the relevant records in internal_ids after the GitHub importer finished (if any exist)
Remove individual tracking of internal id values from the GitHub importer

Edited Nov 20, 2018 by Andreas Brandl