Set GITLAB_SIDEKIQ_SIZE_LIMITER_MODE and GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES env variables in Production

Production Change

Change Summary

Previously implemented in staging under #4451 (closed).

In gitlab-org/gitlab!53829 (merged), we introduced a Sidekiq payload size limiter aiming to prevent the Sidekiq clients from dispatching an oversized job payload to Redis (which led to an indicent https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3404). This size limiter was already deployed into the production, long ago. By default, it is disabled. The limiter has two "mode":

Track mode. After enablement, the limiter sends an error event into Sentry with a precise job size. The oversized jobs are still scheduled and processed.
Raise mode. After enabling this mode, the oversized jobs are rejected from scheduling.

We want to enable those environment variables on Staging and Production. In detail:

Set GITLAB_SIDEKIQ_SIZE_LIMITER_MODE to track.
Set GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES to 100000 (100kb). The reason for this number is explained in #4451 (comment 566543756)

After setting those flags, we are expecting no changes in the operation of Sidekiq. Our Sentry server starts to receive ExceedLimitError exceptions as a result.

As this change is about Sidekiq client, those environment variables should be set to any places able to dispatch a Sidekiq job, including those services Web, API, Git, Sidekiq (Sidekiq jobs can schedule another job as well).

Change Details

Services Impacted - ServiceWeb ServiceAPI ServiceGit ServiceSidekiq
Change Technician - @cmiskell / @qmnguyen0711
Change Criticality - C3
Change Type - changescheduled
Change Reviewer - @cmiskell / @qmnguyen0711
Due Date - 2021-05-10 03:30 UTC
Time tracking - 120 minutes roll-out, 120 minutes rollback
Downtime Component - None expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10 mins

Get review and approval on https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5423 (Production VMs)
Get review and approval on gitlab-com/gl-infra/k8s-workloads/gitlab-com!831 (merged) (Production k8s)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 90 minutes

Merge and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5423 (Production VMs)
Merge and apply gitlab-com/gl-infra/k8s-workloads/gitlab-com!831 (merged) (Production k8s)

Post-Change Steps - steps to take to verify the change

The whole purpose of setting these environment variables is to track the oversized job. The change shouldn't affect Sidekiq operations. We can verify the change by observing the events in Sentry:

https://sentry.gitlab.net/gitlab/gitlabcom/?query=is%3Aunresolved%20ExceedLimitError

It's likely that the events will appear soon. If that's not the case, we can verify by manually dispatching a job with oversized payload in Rails console on a web/API node (not the 'console' VM, which does not have this configuration supplied):

NewIssueWorker.perform_async(SecureRandom.hex(25000), SecureRandom.hex(25000))

The console should output something of the form:

Sending event UID to Sentry
Raven HTTP Transport connecting to https://sentry.gitlab.net

indicating that an event was sent to Sentry. Not that in the implementation, the objects are checked before processing. The above script dispatches non-existing IDs, hence is safe on production.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10mins

Rollback https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5423 (Production VMs)
Rollback gitlab-com/gl-infra/k8s-workloads/gitlab-com!831 (merged) (Production k8s)

Monitoring

Key metrics to observe

Metric: Sidekiq Service RPS
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=975161318&orgId=1
- What changes to this metric should prompt a rollback: when the RPS of Sidekiq drops suspiciously. If it drops unexpectedly, it's likely that something goes wrong with the job limiter preventing new jobs from being scheduled.
Metric: Web/API/Git/Sidekiq Error Ratio
- Location: https://dashboards.gitlab.net/d/web-main/web-overview?viewPanel=1806708210&orgId=1 / https://dashboards.gitlab.net/d/api-main/api-overview?viewPanel=2019568690&orgId=1 / https://dashboards.gitlab.net/d/git-main/git-overview?viewPanel=1941771315&orgId=1 / https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=4245085395&orgId=1
- What changes to this metric should prompt a rollback: if any of the above ratios peak up.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited May 10, 2021 by Craig Miskell