Skip to content

GitLab

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

[Gstg | Move LLM Sidekiq workers to Gstg Shard]

In gitlab-org/gitlab#489871 (closed) we discussed possible approaches to improve the resiliency of AI actions to Sidekiq outages. We decided to explore the creation of a new Sidekiq shard exclusive to The Llm::CompletionWorker to isolate it from other workers. To that effect, we'll be following the Creating a Sidekiq Shard runbook. Specifically, the tasks covered here will be:

@nateweinshenker [ ] Modify the necessary items in [runbooks] to ensure the new shard will have it's own dedicated metrics. Includes at least the following:
- Add an entry in shards in metrics-catalog/services/lib/sidekiq-helpers.libsonnet
- The following doesn't seem to exist anymore~~Add a line to services in dashboards/delivery/k8s_migration_overview.dashboard.jsonnet~~
@nateweinshenker [ ] Modify the necessary items in [k8s-workloads/gitlab-helmfiles] such that logging is configured for the new shard.
- Add a new section in lib/fluentd/logging-config.yaml.
~~If necessary create a new dedicated node pool~~: We don't need a new node pool
- ~~Add in terraform; currently in environments/ENV/gke-regional.tf; generally look for the other node pool definitions and duplicate/extend~~
@alejandro: Modify [k8s-workloads/gitlab-com] adding the new sidekiq shard by adding a new section in gitlab.sidekiq.pods with settings determined above
- This prepares a place for the jobs to run but does not cause anything to be routed to them just yet. The "queues" value is the list of queues (probably just one) that this shard will listen on (used in the next step).
- Also add a new entry in the auto-deploy-image-check list.
@alejandro: Modify global.appConfig.sidekiq.routingRules in [k8s-workloads/gitlab-com] to select the jobs you want (by name or other characteristics) in the first array value, and route them to the new queue (the second value in the array, being the name of the queue that the new shard is listening on)

Note: After scalability#1682 is complete, we should move this new shard to use the urgent pgbouncer, see this comment

Edited Nov 26, 2024 by Nathan Weinshenker

Assignee Loading

Time tracking Loading