Unexpected interruption of big project imports
Problem statement
Big project imports get interrupted before they can finish.
It seems to be due to the fact that sidekiq workers (catchall shard on .com) shutdown, pushing the job back to the queue, potentially due to kubernetes pod scaling. The job then restarts, gets interrupted one more time, and finally fails.
Example: see below logs for a project import attempt - 1.2Gb tar.gz, 2.5Gb decompressed, 1Gb git bundle + 1.5Gb of json data.
JID 3997f8b833699759042632c9
in case you want to check logs yourself.
- First attempt receives interruption after 45 minutes
- Second attempt receives interruption after ~1hour
- Third attempt fails immediately due to a missing import file (unrelated to this)
Just the repository import alone for this particular example takes 95 minutes on .com, which is a Gitaly RPC call. Potential total import time is probably double that.
Third attempt (2
in interrupt count)
Expected behaviour
Project Import, while not the fastest or most efficient process, should have an opportunity to finish and only get killed/interrupted after job expiration has been reached (15 hours is the current value http://gitlab.com/gitlab-org/gitlab/blob/master/app/workers/repository_import_worker.rb#L12-12) /job is hardstuck (does not seem to be the case here)/exceeds resource usage.
Success Criterion
After discussion, we've determined it would be warranted that we break this worker out into it's own shard. Let's ensure all monitoring is in place, any documentation related to sidekiq shards are updated, as well as ensuring that we move the worker from the old shard to the new shard.
Milestones
-
Monitoring: gitlab-com/runbooks!3648 (merged) -
Start up a new shard: gitlab-com/gl-infra/k8s-workloads/gitlab-com!930 (merged) -
Remove the worker from the old shard: gitlab-com/gl-infra/k8s-workloads/gitlab-com!949 (merged) -
Documentation: TODO