Improve `Puma` request handling in multi-threaded mode

Problem

Puma is using round-robin to schedule requests across a fleet of workers and threads.

Described in a more detail here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173

Capacity of Puma

Capacity of Puma web-server is defined by workers * threads. Each of these form a slot that can accept a new request. It means that each slot can accept a new request at random (effectively round-robin).

Now, what happens if we have 2 requests waiting to be processed and two workers, and two threads:

Our total capacity is 4 (W2 * T2),
Each request can be assigned at random to any worker, even the worker that is processing currently request, as we do not control that,
It means that in ideal scenario if we have 2 requests, for the optimial performance they should be assigned to two separate workers,
Puma does not implement any mechanism for 3., rather it first accepting, first wins. It does mean that it is plausible that the two requests will be assigned in sub-optimal way: to the single worker, but multiple threads.
If the two requests are being processed by the same worker, they do share CPU-time, it means that they processing time is increased by the noisy neighbour due to Ruby MRI GVL.

Interestingly, the same latency impact is present on Sidekiq, we just don't see it, as we do not care about real-time aspect of background processing that much.

However, the scheduling changes can improve Sidekiq performance as well if we would target sidekiq as well.

Ideal-scheduling algorithm in multi-threaded scenario

In ideal scenario we would always want to use our capacity efficiently:

Assign to free workers first, as they do offer the best latency,
Assign to workers with the least amount of other requests being processed, as they are the least busy,
Assign to workers that are close to finish the requests being processed, as they will have a free capacity in the future (this is hard to implement, as this becomes a quite complex heuristic to understand the request completion time).

Currently, Puma does nothing from that. For round-robin we pay around 20% of ~performance penalty due to sub-optimal request processing assignment.

It is expected that the closer we get to 100% CPU-usage the more expected is that the Puma-non-scheduling algorithm will not be a problem, due to node being saturated, and threads by balanced by the kernel.

Proposal

We could make Puma to inject very minimal latency between requests, if the worker is already processing requests. This would allow other workers (that are idle) to accept socket much faster than our worker that is already processing another traffic.

Proof of concept

https://github.com/ayufan-research/puma/commit/e0699024e699e7733c89c2dd7fc06ef942a8696f

It seems that injecting 1ms of latency in a busy worker when it tries to accept new connection, allows another worker to accept the connection first. This makes the less-busy workers to accept a connection first and better utilize multiple workers, instead of multiple threads.

This also seems to not cause any noticeable observable impact on regular processing of requests.

Edited Nov 20, 2019 by Craig Gomes