Unexpected interruption of big project imports

Problem statement

Big project imports get interrupted before they can finish.

It seems to be due to the fact that sidekiq workers (catchall shard on .com) shutdown, pushing the job back to the queue, potentially due to kubernetes pod scaling. The job then restarts, gets interrupted one more time, and finally fails.

Example: see below logs for a project import attempt - 1.2Gb tar.gz, 2.5Gb decompressed, 1Gb git bundle + 1.5Gb of json data.

JID 3997f8b833699759042632c9 in case you want to check logs yourself.

First attempt receives interruption after 45 minutes
Second attempt receives interruption after ~1hour
Third attempt fails immediately due to a missing import file (unrelated to this)

Just the repository import alone for this particular example takes 95 minutes on .com, which is a Gitaly RPC call. Potential total import time is probably double that.

First attempt

Second attempt

Third attempt (2 in interrupt count)

Expected behaviour

Project Import, while not the fastest or most efficient process, should have an opportunity to finish and only get killed/interrupted after job expiration has been reached (15 hours is the current value http://gitlab.com/gitlab-org/gitlab/blob/master/app/workers/repository_import_worker.rb#L12-12) /job is hardstuck (does not seem to be the case here)/exceeds resource usage.

Success Criterion

After discussion, we've determined it would be warranted that we break this worker out into it's own shard. Let's ensure all monitoring is in place, any documentation related to sidekiq shards are updated, as well as ensuring that we move the worker from the old shard to the new shard.

Milestones

Monitoring: gitlab-com/runbooks!3648 (merged)
Start up a new shard: gitlab-com/gl-infra/k8s-workloads/gitlab-com!930 (merged)
Remove the worker from the old shard: gitlab-com/gl-infra/k8s-workloads/gitlab-com!949 (merged)
Documentation: TODO

Edited Jun 17, 2021 by John Skarbek