Skip to content

Set GITLAB_SIDEKIQ_SIZE_LIMITER_MODE and GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES env variables in Production

Production Change

Change Summary

Previously implemented in staging under #4451 (closed).

In gitlab-org/gitlab!53829 (merged), we introduced a Sidekiq payload size limiter aiming to prevent the Sidekiq clients from dispatching an oversized job payload to Redis (which led to an indicent https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3404). This size limiter was already deployed into the production, long ago. By default, it is disabled. The limiter has two "mode":

  • Track mode. After enablement, the limiter sends an error event into Sentry with a precise job size. The oversized jobs are still scheduled and processed.
  • Raise mode. After enabling this mode, the oversized jobs are rejected from scheduling.

We want to enable those environment variables on Staging and Production. In detail:

  • Set GITLAB_SIDEKIQ_SIZE_LIMITER_MODE to track.
  • Set GITLAB_SIDEKIQ_SIZE_LIMITER_LIMIT_BYTES to 100000 (100kb). The reason for this number is explained in #4451 (comment 566543756)

After setting those flags, we are expecting no changes in the operation of Sidekiq. Our Sentry server starts to receive ExceedLimitError exceptions as a result.

As this change is about Sidekiq client, those environment variables should be set to any places able to dispatch a Sidekiq job, including those services Web, API, Git, Sidekiq (Sidekiq jobs can schedule another job as well).

Change Details

  1. Services Impacted - ServiceWeb ServiceAPI ServiceGit ServiceSidekiq
  2. Change Technician - @cmiskell / @qmnguyen0711
  3. Change Criticality - C3
  4. Change Type - changescheduled
  5. Change Reviewer - @cmiskell / @qmnguyen0711
  6. Due Date - 2021-05-10 03:30 UTC
  7. Time tracking - 120 minutes roll-out, 120 minutes rollback
  8. Downtime Component - None expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 10 mins

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 90 minutes

Post-Change Steps - steps to take to verify the change

The whole purpose of setting these environment variables is to track the oversized job. The change shouldn't affect Sidekiq operations. We can verify the change by observing the events in Sentry:

It's likely that the events will appear soon. If that's not the case, we can verify by manually dispatching a job with oversized payload in Rails console on a web/API node (not the 'console' VM, which does not have this configuration supplied):

NewIssueWorker.perform_async(SecureRandom.hex(25000), SecureRandom.hex(25000))

The console should output something of the form:

Sending event UID to Sentry
Raven HTTP Transport connecting to https://sentry.gitlab.net

indicating that an event was sent to Sentry. Not that in the implementation, the objects are checked before processing. The above script dispatches non-existing IDs, hence is safe on production.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10mins

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Craig Miskell