Determine how to handle workers with oversized payloads
This is a following up issue of #825 (comment 514946077)
From the original issue, an individual job itself doesn't affect our Redis cluster match. A wave of big-payload jobs do. There used to be an incident related to this: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3404. We detected a dozen of workers dispatching oversized payloads. Fixing all of them would eat up all of the bandwidth of the scalability team. Hence, we are exploring some other approaches to solve this permanently.
Approach 1: Offload the oversized Sidekiq job payload to the object storage
This is a catch 'em all solution. The idea is to implement an abstraction to handle oversized payload automatically. In detail:
- Tweak the current Sidekiq job payload limiter (gitlab-org/gitlab!53829 (merged)) to support
Offload
mode. - In the offload mode, the middleware pushes the job payload into the object storage. The job payload is replaced by a wrapper that includes the uploaded object ID only.
- Implement a server middleware to pull and restore the job payload.
- If the job finishes successfully, remove the object from the storage
- If the job needs to be retry, the wrapped job hash is pushed into the retry set
- If the job fails permanently, the wrapped job hash is pushed into the interrupted set or dead set
This approach may have many advantages:
- This solution is a future-proof solution. It means that we don't need to care about this in the future. Actually, we would dispatch some metrics and monitor the overall health, instead of playing whack-a-mole game with the new occurrences in the future.
- The solution is straight forward. I like its simplicity.
- It fixes all the workers at once. The effort for this approach is less than the effort to fix all the existing workers.
- It is transparent from the application layer.
In contrast, there maybe something we need to consider:
- More moving parts. The self-managed instances don't need this feature, but our GitLab.com SaaS has to add more dependencies.
- A little bit harder to debug if something goes wrong.
Approach 2: Enable raise mode in Sidekiq payload size limiter and provide some utility helpers to resolve the offenses
From a team sync meeting, we came up with another approach:
- Implement some utility helpers to support manual payload offloading. Of course, we should have a decent documentation, and examples.
- Delegate the troublesome workers to the corresponding stage groups.
- After the top ones are fixed, we pick a reasonable limit, and enable the raise mode (gitlab-org/gitlab!53829 (merged)). The error message (on Sentry?) should include a link to the aforementioned document.
This approach has some advantages:
- More predictable. This gives us more control on which workers should actively offload the payload.
This approach has some disadvantages:
- From my own perspective, the payload oversizing issues are impossible to detect at the development phase. It's even hard to spot on Staging. The size of the payload may gradually build up and increase over time, when the size of data increases. The responsible stage groups will likely not be aware, until some jobs are forced to drop off (likely leading to another incident).
Approach 3: Compress oversized payload
From the analytics in #1054 (comment 567044321), the data (properly biased) conducted that memory is the last factor we need to be worried about. Unless there is a huge wave of a job from a particular job type flowing in, the system is still fine. It's likely that the system will be saturated by other factors first. Hence, one of a simpler solution would be reduce the damage instead. From gitlab-org/gitlab!59249 (comment 553666299), @smcgivern suggested that we could compress the data before pushing into Redis. That's actually a good idea.
Ruby ships with a built-in compression library called Zlib. As all of the job payloads are JSONs, the compression rate would be impressive. Most of the jobs are under 50MB, actually tiny comparing to other use cases needed compression. Hence, at the first glance, this is a cheap, low hanging-fruit approach we can follow.
This approach has some advantages:
- Zlib is built-in. We don't need to update the configuration or introduce more complexity to the system
- The compression rate is good [Need to update the data for this] that the payload size is not a problem anymore
- Easier to debug, at least easier than the object storage approach
And of course some advantages:
- Compression is a CPU-intensive operation. When adding the compression layer to Redis clients (API, Web, Git, etc.), we should be conscious about the compressing frequency, or the overall performance of the above fleets are affected. Picking the right limit is the key here.
Conclusion
The compression approach seems to be the best choice we have at the moment. Following this approach, we will:
- Implement compression mode in Sidekiq payload size limiter. This is done in gitlab-org/gitlab!61667 (merged).
- Compress jobs having payloads greater than 100kb (reasoning in #1054 (comment 567044321) and #1054 (comment 568129605)).
- Finally, enforce the job hard limit (presumably 5MB after compression). Oversized jobs will be rejected afterward.
- The rolling out progress is tracked in #1085 (closed).