Stable Shard for long-lived workers (#508) · Epics · GitLab Infrastructure Team

Stable Shard for long-lived workers

## Background While working on [data cleanup for ci trace chunks](https://gitlab.com/gitlab-org/gitlab/-/issues/330141#note_603412446), we observed that the job frequently would never complete. On further investigation we discovered that this is due to the job regularly taking longer than the lifecycle of the pod on which it runs. There are two possible options that we considered in a discussion about the [Horizontal Pod Autoscaler](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13582#note_608684968): 1. extend the lifetime of the pod 2. make the worker faster We determined that while extending the lifetime of a pod is possible by disabling auto-scaling (setting min/max replicas set to the same value), it is equally possible that workers could start taking longer and longer as the system changes. For example, they may be operating on sets of data that are larger than anticipated. And when we find a worker that is slow (like the `ArchiveTraceWorker`), we need to improve it's ability to scale. That may not be possible overnight, and we could effectively house these workers on a stable shard so that we need to worry less about pods shutting down while jobs are being processed. While that can still happen due to deploys, intentional reconfiguration, and some rarer system operations, reducing the frequency is very helpful. ## Proposal - set up a shard that will have longer-lived pods by disabling auto-scaling - migrate `ArchiveTraceWorker` to this shard to prove that it can work (also mitigating the [ongoing manual cleanup](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4914) that is required on trace chunks) - document a process for deciding when to quarantine a worker on this shard. Necessary details include: - what metrics we can use to identify such workers, - how to bring that identification to someone's attention (automatic alerting, or manual review process), - any criteria used to determine whether we or not we choose to quarantine the worker, - how to initiate work to get the worker *out* of this shard ## After this project We need to set up the shard and get the worker moved here, but then we need to run a second project to figure out how to not need this stable shard anymore. That doesn't mean we should fix all of the things that are not working for this particular worker, but to figure out how to get long-lived workers to cope on pods that cycle. ### Status 2021-07-21 The project is being scoped

epic