Zooming out to a 7-day view, it is clearer that the unicorn worker saturation is indeed correlated with single core CPU saturation:
That wasn't at all clearer on a 1-hour timescale
It seems like we should increase unicorn concurrency.
Edit: I was labouring under the misconception that single_node_cpu referred to max single-threaded CPU usage, e.g. pegged cores due to unicorn workers. This isn't the case - it shows max CPU saturation per node.
We currently run 38 x 16-core web machines. Total CPU saturation lags behind single-core saturation here:
Looking at these stats, I'd be tempted to scale web unicorn workers per node to 40, but @jarv might want to comment since he reduced that concurrency a few months ago.
Edit: I was labouring under the misconception that single_node_cpu referred to max single-threaded CPU usage, e.g. pegged cores due to unicorn workers. This isn't the case - it shows max CPU saturation per node.
Looking at these stats, I'd be tempted to scale web unicorn workers per node to 40, but @jarv might want to comment since he reduced that concurrency a few months ago.
the main reason we reduced the worker count previously was due to saturation on the unicorn hup which doubled the number of worker processes while we were still taking traffic. The new grace-period logic which takes nodes out of the load balancer on a HUP probably means this is no longer a concern. Instead of going back up to 40 maybe we try going to 35 first?
As I've edited above, I've realised that single_node_cpu is the max CPU saturation of any node in the fleet, and doesn't refer to the core-pegging saturation metric used to track single threaded processes. I'll re-evaluate whether or not the proposed increase in concurrency will cause CPU saturation.
Comparing the non-aggregated unicorn worker saturation to CPU saturation charts for the last 24h, we see that while they are correlated, worker saturation spikes don't appear to coincide with CPU saturation spikes per web worker:
Looking at the last 24h of service level saturations for cpu, workers, and highest node cpu saturation, shows that while these are all correlated, 100% worker saturation spikes don't appear to cause total saturation of CPU:
With all that in mind I think we can proceed with increasing concurrency.
Also, @ahmadsherif on Slack pointed out that we should check pgbouncer saturation before proceeding. No resource there appears saturated, and we should be able to tolerate the 16% increase in clients.
To understand this behavior, we need a way to detect or infer what the worker was doing when it spent so much time on-CPU.
Does it affect a single worker or are multiple worker processes concurrently in a similar state?
Is it a side effect of a short-term change in the workload (e.g. a change in frequency of certain request types or request payloads)?
Is it triggered by an event (e.g. deploy, GC pause, CI job with many callbacks, stalled dependency)?
In my mental model of the life of a request, I expect the Rails worker to spend most of its time waiting on service calls it makes to downstream internal services (e.g. Postgres, Gitaly, Redis, etc.). And those service calls typically complete on a time scale of 1-100 milliseconds but occasionally are much slower (1000-10000 milliseconds).
But I expect some requests don't fit that mental model -- some request types may be satisfied with no calls to slow or contended services. Such requests would spend a larger proportion of their time on-CPU in the Rails app, rather than spending a high proportion of time blocked waiting on responses from internal services.
From that perspective, here are some potential patterns that could cause spikes in this metric -- i.e. at least one Unicorn worker process stays busy on-CPU nearly continuously on a time scale at least as long as the metric's sampling interval (presumably seconds):
Measurement error:
Though unlikely, this possibility always belongs in the list.
How is this metric (single_node_unicorn_workers) measured? Is it sampling each process's on-CPU time counter at some polling interval and then calculating the ratio of CPU time to wall clock time for that entire polling interval? And then filtering that list of surveyed processes to only include Unicorn workers? And reporting the highest ratio among those matching processes as the metric value for that polling interval?
If the above assumptions are all true, that would mean the duration of single-process CPU saturation must be on the same time scale as polling interval length.
In contrast, if the measurement is based on a 1-second sample drawn each N-second polling interval (a technique that requires no state to be persisted by the monitoring agent), then the time scale of the CPU saturation would be just 1 second (or whatever the monitoring agent's sampling duration is).
Either way, our metric makes no claim that the same process is CPU-saturated from one polling interval to the next. It could be a different process each time.
Why does it matter how long an individual Unicorn worker process must remain CPU-saturated to influence this metric? It helps us prune hypotheses. It tells us the minimum time scale of saturation that any explanatory hypothesis must meet or exceed.
Garbage collection pauses:
Depending on the answer to the above questions about what our metric is measuring, it may be implausible to explain with GC pauses (either by 1 long pause or many frequent short pauses).
Ruby must occasionally do GC, and the mark phase should involve pausing application threads.
I think we may have configured GC to run between requests, to reduce the likelihood of a GC pause happening during a request. But of course it would still drive up the CPU-usage metric even if it happens between requests.
Since GC activity is always CPU-intensive, it should saturate a CPU while it's ongoing. We'd need to compare the distribution of GC durations to the time scale of this single-CPU saturation metric to judge the likelihood of this being a contributing factor. And also check the frequency of GC events to see if we have a cluster of pauses (i.e. GC pauses that happen often because they cannot reclaim enough heap space to satisfy demand).
Note: We learned of a design flaw in the way Ruby exposes data about GC events that makes it hard for metrics frameworks to reliably collect data on all GC events. There is no cumulative counter for GC duration; durations must be summed from the GC event log. That log is of limited size and is prone to being wiped after being read. So in our case, having multiple monitoring frameworks that try to read that GC event log meant that none of them were getting the whole picture. So be skeptical about any GC-related data.
Increased frequency or concentration of fast, CPU-intensive requests:
In this scenario, we get a burst of many short requests requiring relatively few slow service calls may keep a worker busy.
The premise here is that these requests may happen all the time but would normally be spread thinly across the fleet such that no one Unicorn worker was likely to handle a lot of them. But an increase in the frequency of this hypothetical request type could cause some Unicorn workers to handle enough of them to fill one or more of our our metric's a sampling interval. It wouldn't have to be the same Unicorn worker handling all these requests; rather there would have to be enough requests to keep at least one Unicorn worker busy for at least the metric's sampling interval to drive the single-CPU-usage metric to 100%.
We know that our system often receives synchronized bursts of requests (e.g. from cronjobs or automation events with callbacks). Some of these may produce concentrated bursts of requests of the same type.
This pattern should affect multiple Unicorn workers, unless we suspect some kind of scheduling bias to concentrate them onto a single worker. At the moment I can't think of a plausible mechanism for such bias, so I tentatively assume there isn't one; therefore expect multiple Unicorn workers (at least within a given host) to share this pattern of increased CPU usage if the cause is an increase in frequency of a particular request type.
There would have to be enough requests to keep at least one worker busy for at least one sampling interval for our saturation metric. That presumably means all workers would be kept busy for that duration, either waiting on internal service calls to respond or doing on-CPU computation.
Request type having a long CPU-intensive phase of data-dependent duration:
In this scenario, a Unicorn worker spends several seconds handling a single request with a high duration and lots of compute time spent in Rails.
Only slow requests could fit this pattern. Such a hypothetical request may also spend a lot of time in remote service calls too (e.g. fetching a lot of data and then doing expensive computation on it), but to produce this symptom, it would have to have a phase where it spent a lot of time in Rails itself, not just interleaving Rails compute time with blocking service calls.
Uncorking a stalled downstream:
In this scenario, a fast downstream service (e.g. Redis) stalls for some reason (e.g. network hiccup, VM migration, memory access latency spike, stuck sending large response to another client and delaying all other pending client requests).
Consequently, incoming requests should fill all idle Unicorn workers and then start accumulating in the TCP listener backlog. Since we're assuming that most requests require access to this stalled share downstream resource, most Unicorn workers should end up stalled while handling such a request.
When that stalled dependency starts responding again, all pending work becomes unblocked.
This creates a microburst of activity until the request backlog is drained.
Instead of having each Unicorn worker be idle for some percentage of each wall clock second, all workers will be busy handling requests. Even though many of those requests spend significant time making blocking downstream service calls, some request types spend no time in slow service calls. This may drive single-CPU usage towards 100% until the request backlog drains, but only if the backlog contains enough of these hypothetical requests that spend very little time in slow internal service calls.
So this scenario is a specific flavor of the generic "burst of many fast requests" scenario.
Firstly, I merged my MR above right before seeing this message come in - I'm sorry! I've run the manual pipeline step and this change will roll out in production over the next 40 mins. It might make a valuable experiment, at the very least! Proper reply incoming.
we need a way to detect or infer what the worker was doing when it spent so much time on-CPU
I'm actually arguing that web nodes are not spending a lot of time in CPU, and that unicorn workers themselves are the bottleneck. I think we have underprovisioned our "queue count" to service requests relative to other resources.
In my mental model of the life of a request, I expect the Rails worker to spend most of its time waiting on service calls it makes to downstream internal services (e.g. Postgres, Gitaly, Redis, etc.). And those service calls typically complete on a time scale of 1-100 milliseconds but occasionally are much slower (1000-10000 milliseconds).
That was my mental model too - I can't speak to the timescales themselves right now though.
How is this metric (single_node_unicorn_workers) measured?
unicorn worker resource saturation is defined as as the ratio of connections to workers per node. The single_node_ part means that for the fleet, the value is the max value on any node across the fleet - i.e. the worst case node.
I've not commented on the rest yet, as I suspect that we're not CPU bound. Hopefully we'll gather some good data about the impact of this change tomorrow during peak.
My mistake! I thought the metric you cited was CPU-related. I should have checked that assumption before commenting.
Most of the above was contemplating ways for a Unicorn worker to become CPU-bound nearly continuously for at least several seconds. That's hard to do when most requests spend a significant proportion of their time waiting synchronously on blocking service calls to complete. But since the saturation metric is really talking about the proportion of non-idle Unicorn workers, it doesn't matter if the Unicorn worker is on-CPU or waiting on an internal request. That's a lot easier to explore (e.g. in which Rails controllers was most of the time spend during the spike versus before/after the spike) and plausibly explain with mundane causes (e.g. request rate increase, db connection pool contention, etc.).
Increasing unicorn worker concurrency to 35 didn't totally fix the problem, we still see web latency spikes correlated with periods of unicorn worker saturation. It's too early to judge if matters improved at all and make a recommendation for further change, as we currently have an ongoing incident: production#1419 (closed).
Over the last 2 days (as a sample), we still see fairly regular unicorn worker saturation that correlate with web (main fleet) latency spikes, and these are not correlated with CPU saturation:
In fact, there is still a lot of CPU (and memory) headroom in the web fleet.
Even though last week I didn't correlate web latency spikes with and fleet-aggregated Gitaly latency, I think it's very likely that our unicorn saturation is due to slow gitaly requests blocking workers when gitaly shards become saturated. If we scale up the unicorn workers, it could allow more requests to be served, which could improve overall latency by allowing fast requests capacity to run while slow requests are being processed. This risks increasing pressure on our Gitaly fleet as a whole.
We recently scaled up the web canary fleet, and this appeared to help (production#1456 (comment 258627293)). Many cny reqs go to file-02, which hosts gitlab-org/gitlab, and as such is very frequently saturated. The problems facing cny were more extreme than the problems facing the main fleet here.
For this reason I'd like to scale up the unicorn worker count per node again. This should alleviate the bottleneck in our web fleet as described above, while increasing utilisation of CPU and memory that we already pay for. Last time we went from 30 to 35. We could try 40 as a starting point?
Closing after slack discussion with @mwasilewski-gitlab and @andrewn about this. In summary, since we see web main node CPU spiking at 70-80% during a HUP, we shouldn't increase the worker:core ratio any more. The path of least resistance to improving web worker saturation is to add more web nodes.
In hindsight, it's obvious that increasing web concurrency would not fix the issue, given how little difference it made going from 30 to 35 workers per node.