Project 'gitlab-com/gl-infra/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
Puma tuning - Reduce increased latency introduced by Puma
Looking through the graphs, I see higher throughput from Puma which I would expect given the additional worker threads available (14*4 > 30), but I'm not sure if any latency tests were done?
Canary Testing
Before we lose the data over the retention window in ELK, I thought I would do a quick comparison:
Queries are for 2019-10-27 00:00:00.000 to 2019-10-29 23:59:59.999. During this period, I understand that puma was enabled on web-01-cny and unicorn was enabled on web-02-cny.
Comparison 1: Rails latencies for all traffic excluding healthchecks
Conclusion 2: roughly the same, except on the p99 where puma performance was much better (half!) of unicorn. This is likely because unicorn reached saturation for longer periods.
Comparison 3: Workhorse request durations for rails requests
This uses a narrower date range as workhorse has a shorter retention schedule.
web-cny-01-sv-gprd
web-cny-02-sv-gprd
Difference
p50
6
5
20.00% slower on web-01-cny
p95
856.622
683.831
25.27% slower on web-01-cny
p99
5,745.05
4,658.77
23.32% slower on web-01-cny
Conclusion 3: across the board web-cny-01-sv-gprd is ~20% slower when comparing from Workhorse.
It's possible that other factors are affecting these results, the hosts might not be equivalent, load balancing, or other things, however I think with a 20% difference in Workhorse latencies we should definitely slow down and consider what is causing this.
Conclusion
Considering that on GitLab.com we are running on a fleet that is not memory constrained, and our primary metrics are error rate and latency, I think we should consider rolling back to half the canary fleet on puma and half the canary fleet on unicorn so that a fair comparison can be done while we tune puma further. After further discussion we have decided to enable puma on a select web/api/git nodes production#1303 (closed)
I'm fairly sure that we can resolve this problem, but while we do we should continue to split traffic between the two servers, unicorn and puma, so that we can continue to do a fair comparison while we tune puma.
Production node testing
Timeline
Mon Nov 4 10:59:44 - Puma enabled on web-{01,02}, git-{01,02}, api-{01,02} in production with W14/T4
Mon Nov 4 12:09:25 - Puma reconfigured on web-02 for W18/T2
Mon Nov 4 12:54:17 - Puma reconfigured on web-02 for W25/T2
Mon Nov 4 14:16:58 - Puma reconfigure on canary plus single main stage nodes for W30/T1 (db_pool: 1)
Wed Nov 6 14:39:00 - Puma disabled on all nodes due to db issues, not related to puma but we saw worse performance production#1327 (closed)
Thu Nov 14 09:33:22 - Puma reconfigured on web-cny-01 for W16/T2
Thu Nov 14 09:37:39 - Puma reconfigured on web-cny-02 for W8/T4
Fri Nov 15 10:36:39 - Puma reconfigured on web-cny-02 for W16/T4
Fri Nov 15 15:50:05 - Puma reconfigured on web-01 for W16/T2
I think that there results are great and it is more fine-tuning now connected with fixing CPU-intensive endpoints, the endpoints that have cpu_s>100ms.
Lets consider the following:
Our canary has 16 CPUs,
Previously we were running 30 Unicorn workers,
Now, we run 14 Puma workers, each with 4 threads, it means that effective capacity is 56 requests concurrently,
It means that currently Puma is over saturating the node (if all is CPU-intensive traffic) by almost 4x,
Also, due to Ruby MRI GIL we might hit the single worker that is processing a mixed workload: very-cpu-intensive and very light, which results in a elevated processing time,
The queueing latencies I would say that they are within a limit, in general we should see roughly the same queueing latencies as for Unicorn, with expectation to running Native extensions (the native code is not interruptible/interchangable with other threads),
The processing time is affected by over-subscribe and noisy environment, single thread can process up-to 4x the work, which can result in the most pessimistic case 4x slower, with Unicorn we would likely see something between 1x-2x (depending on the workload across other processes, as system is being scheduler).
I think that one of the next steps could be to discover, annotate and likely balance our web-servers based on type of traffic (actually super hard to do), similarly how we do it for Sidekiq: gitlab-org/gitlab!18066 (merged). If we could keep a low-latency/non-intensive calls together we would achieve much better resource utilisation, and much lower processing cost.
I think that we might prefer to change Puma to be W16/T2 or W18/T2 to have pretty much** the same results as Unicorn, but with severe reduction of resources.
**: up to the pessimistic case of scheduling the requests, two very expensive requests being processed by single worker
Also, I think that looking at requests taking <300ms for this comparison might simply not make sense, as this is not a user-facing visible change. We trading much more concurrency/resource utilisation for the slightly higher duration, which is within a < 200ms is mostly not visible to users.
We need to find a better baseline to measure a user-impact. For example I think we should be looking for data points that show that something on Unicorn did take 200ms, but on Puma takes 400ms.
John Jarvischanged title from Puma tuning on Canary - Reduce increased latency introduced by Puma to Puma tuning - Reduce increased latency introduced by Puma
changed title from Puma tuning on Canary - Reduce increased latency introduced by Puma to Puma tuning - Reduce increased latency introduced by Puma
We seem to hover at 35-45% (regardless of usage of Puma/Unicorn),
We could easily have more workers, but increase a capacity with number of threads, thus reduce either the needed CPUs or number of needed machines.
I guess we should play and measure different configurations:
W14/T4: significantly increases duration due to multiple threads,
W18/T2: does increase duration by about 15-20% due to multiple threads,
W20/W25/T2?
W30/T2: should give us a good change over Unicorn.
It seems that we should optimise our nodes to have 75-85% of constant CPU-usage to maximise usage of resources. It seems that we can achieve that with more number of workers. Now, we do not need to severe over-provisioning due to blackout period, which was needed before to accept more connections for short period of time.
It seems that we should optimise our nodes to have 75-85% of constant CPU-usage to maximise usage of resources. It seems that we can achieve that with more number of workers.
For a bit more historical context, we used to have more unicorn workers but due to saturating the nodes whenever we received a gitlab-ctl hup unicorn while taking live traffic we lowered them, leaving our nodes over-provisioned. Now with the readiness check and blackout period this problem goes away completely so I am pretty sure we can safely increase the worker count for unicorn, but it probably makes more sense to focus on puma.
I'd like to suggest that we simplify the thought process around utilization. Based on the idea that Ruby MRI can only utilize one core per worker I suggest we use a simple formula of CPUs - 1. This allows for reasonably good utilization, with some spare capacity left over for other system processes.
Note we definitely did have problems with cpu saturation due to too many workers, but it only seen when we issued a gitlab-ctl hup unicorn which saturated the cpu after the fork (30x2) in combination with the old workers staying around for a long time due to our very long worker timeout. I think we are going to be in a much better situation now that we have the readiness check with the blackout period since nodes will be taken out of the loadbalancer when we issue a hup now with unicorn, and this won't be as much as an issue with puma.
Yes, for Puma we might want to ensure that CPU-usage with number of Workers stays around 80%-90%.
So, it really depends. So, far with W18 we have seen only 30-40% of CPU-usage, but nodes were not saturated with requests at all. And we were seeing latency increases anyway (the culprit likely being metrics?).
Using a grid allows you to compare latencies across the entire timeframe, without having to visually compare over the time range
Using workhorse gives us a better idea of end-to-end latency
As things stand at the moment, web-01 is slowest, web-02 is faster, web-03 is the fastest.
@jarv one suggestion: I think that the total concurrency (ie, workers*threads) between the two puma hosts should be equal. This will give us a fairer idea of what happens when we hit saturation.
If there is a mismatch it will skew requests that queue on one of the hosts while still having capacity on the other.
We have 1k requests per-minute, the mean time is 400ms, p95 1s. Anything that escapes to native, is not interruptible and cannot work concurrently, and this particular one does take a significant amount of time to be processed.
We do have /metrics on separate endpoint already implemented, it is not yet fully tested, but I would assume that we might prefer to switch to that: gitlab-org/gitlab#30037 (closed).
Maybe we should enable an issue to start scraping /metrics on staging instead of /-/metrics?.
Maybe we should enable an issue to start scraping /metrics on staging instead of /-/metrics?.
@ayufan I'm fairly certain that I wrote that down when I proposed gitlab-org/gitlab#30201 (closed) but looking back at it, I didn't At the very least, I was thinking it
I definitely think this makes sense. Mostly for the same reasons as moving the health check makes sense.
We just need to be careful as we do it, of course.
Unfortunately I don't have the time to read all of this discussion but I came here to leave a note that since we switched to Puma on Canary, Canary is basically unusable for me and I had to disable it.
Performance bar shows the following for GitLab.com:
This is more complex approach, but could allow us to slash queueing latency: gitlab-org/gitlab#35577 and selectively affect duration based on processed content. Assign thread priority based on workload being run.
@ayufan what are your thoughts for the rest of this week and puma configuration, should we start by bumping the thread count up on a canary VM with 30W?
@andrewn During the incident related to the database last week we noticed that the puma nodes were having more difficulty than the other nodes so we decided to switch them back to unicorn production#1327 (comment 240898245)
I was hoping to have this started yesterday but I am waiting on approvals to update the db_pool settings to use the new defaults, I believe it will happen today.
The results from testing production#1362 (closed) look much better, to summarize what has been done so far
all production non-canary nodes are running unicorn
Thu Nov 14 09:33:22 - Puma reconfigured on web-cny-01 for W16/T2
Thu Nov 14 09:37:39 - Puma reconfigured on web-cny-02 for W8/T4
Fri Nov 15 10:36:39 - Puma reconfigured on web-cny-02 for W16/T4
@andrewn@ayufan next we should move forward with a single node in the main stage again, if we decide to move forward with W16/T4 will it be concerning with a db_pool set to 4? This will drastically increase the number of connections since we are currently running at W30 with a db_pool set to 1.
@andrewn@ayufan next we should move forward with a single node in the main stage again, if we decide to move forward with W16/T4 will it be concerning with a db_pool set to 4? This will drastically increase the number of connections since we are currently running at W30 with a db_pool set to 1.
I wonder. As I said our nodes are over provisioned. Maybe we should run W16/T2 for now? It still gives a significantly more headroom than Unicorn. The problem is that we have 16 CPUs, and we should keep Wn close to number of CPUs.
We seem to hover at less than 50% of CPU-usage (30-40%). Which means that we could test running W8/T4 on another node with the intent that we would shrink number of CPUs up to some point to something like 8/12 CPUs.
Now, keeping W16/T2 or W8/T4 gives us similar amount of connections, so I think that this is fine to stay like that for now. Once we are confident we could experiment with W16/T4.
To summarize the latest changes after fixing metrics endpoint.
TL;DR It does not bring substantional improvement, we continue seeing performance degradation when running Puma.
Description
The web-01 is significantly slower, as another process except Puma is running that is consuming significant amount of resources and impacts system behaviour, it means that it should not be used for comparision,
The web-02 is stasticially relevant: it processes the same amount of requests as web-03, and can be used for comparing Unicorn and Puma production load,
The web-02 and web-03 CPU usage is roughly similar,
The web-02 and web-03 will be used for comparision,
The web-02 is configured with W16/T2 running Puma,
The web-03 is configured with W30 running Unicorn,
We will not compare memory usage as it was validated previously, only performance of requests processing will be evaluated,
Our nodes are severly underutilized, they use mostly around 30-40% of CPU capacity, and a very little of memory (this does impact the timings that we see on Puma)
As expected we see roughly the same amount of time spend on CPU-thread
time. This should not change, as each thread needs to process roughly
the same (on average) amount of data.
We see significantly worse DB processing duration, around ~40%.
It is hard to pin point exactly why this is happening, likely
because of interleaved multi-threading. I assume that we calculate
DB duration by looking at a monotonic time measurement, it means that DB duration
accounted here might represent the work being done by another threads,
even if the response from DB comes earlier.
Recommendation
We should continue running Puma on canary nodes, with T2 as this is the setting that causes the minimal impact on processed requests,
We should continue testing Puma deployment on canary to build a confidence in running Puma on scale and using our deployment scripts,
I don't see a strong reason on running Puma right now on web-01/02 unless we want to continue testing Puma handling production-volume (Canary processes 3x less requests than web-01/02/...), as it causes performance degradation for our users. It is unlikely that this change is noticeable to users (it adds around 20ms on average). I think it is up to production team to decide if they want to continue testing Puma on production-volume, but taking into account the increase in latency,
Overall it seems that Puma works properly as for the request processing/restart-cycle/connection-pools. It seems that it is not yet ~performance-ready. However, we should continue monitoring for any additional discoveries related to running services.
I think we should continue running on Canary (single/dual node is enough) to build confidence in Puma that it works properly, but keeping in mind that it is not yet super performant.
I did validate a theory why we see this performance deficiency. I currently work on writing exact description with data in additional issue that confirms that.
I think we should continue running on Canary (single/dual node is enough) to build confidence in Puma that it works properly, but keeping in mind that it is not yet super performant
I will also see about resolving delivery#537 (closed) so we can increase the traffic volume on the canary fleet.
I don't see a strong reason on running Puma right now on web-01/02
Sounds good, I have gone ahead and switched these two hosts back to unicorn
@ayufan Regarding the DB timings: the database load balancer uses a mutex in a few places. It's possible that when using Puma this leads to degraded database performance, but I would be surprised if it has a serious impact. If the Mutex turns out to be too expensive, maybe we could replace it with a spinlock of sorts. For this we would likely have to use something like http://ruby-concurrency.github.io/concurrent-ruby/master/Concurrent/AtomicBoolean.html.
@yorickpeterse When I did local tests it was not showing any performance degradation on that front, as I was investigating our load-balancing code to understand if each of this causes any bottlenecks. It does not seem so. You did a great job there, pretty much I had nothing to improve there :)
The interleaved concurrent processing is causing these metrics to not be accurate. I would assume that the same is to be seen in sidekiq when we run the same.
I don't see a strong reason on running Puma right now on web-01/02 unless we want to continue testing Puma handling production-volume (Canary processes 3x less requests than web-01/02/...), as it causes performance degradation for our users.
I would suggest that we don't run in production until we are ready for some additional level of testing and limit Puma to staging and Canary workloads.
I would suggest that we don't run in production until we are ready for some additional level of testing and limit Puma to staging and Canary workloads.
Yes, we decided to get back to running Unicorn on all web-XX, leaving Puma only to Canary, Staging and Dev.
The reasons of increase in latencies introduced by Puma
I took a bite to understand why Puma performs worse than Unicorn
on lightly loaded servers, like web-01/02/....
Lightly means that servers do:
Consume less than < 50% of CPU,
Run less or equal requests concurrently as number of workers.
Capacity of Puma
Capacity of Puma web-server is defined by workers * threads. Each of these
form a slot that can accept a new request. It means that each slot can accept
a new request at random (effectively round-robin).
Now, what happens if we have 2 requests waiting to be processed and two workers,
and two threads:
Our total capacity is 4 (W2 * T2),
Each request can be assigned at random to any worker, even the worker that is
processing currently request, as we do not control that,
It means that in ideal scenario if we have 2 requests, for the optimal performance
they should be assigned to two separate workers,
Puma does not implement any mechanism for 3., rather it is first accepting, first wins.
It does mean that it is plausible that the two requests will be assigned in sub-optimal
way: to the single worker, but multiple threads.
If the two requests are being processed by the same worker, they do share CPU-time,
it means that they processing time is increased by the noisy neighbor due to Ruby MRI GVL.
Interestingly, the same latency impact is present on Sidekiq, we just don't see it,
as we do not care about real-time aspect of background processing that much.
However, the scheduling changes can improve Sidekiq performance as well if we would
target sidekiq as well.
Canary
Why we didn't see this on Canary?
The answer for this is due to capacity offered vs capacity consumed.
Considering that each Production and Canary node are configured identically,
and given that we simply run 3x less requests, the probability of hitting
the worker that is already processing the request is reduced significantly.
Due to traffic volume, the same offered capacity, we simply do not hit
that sub-optimal processing often enough to make it statistically significant.
Ideal-scheduling algorithm in multi-threaded scenario
In ideal scenario we would always want to use our capacity efficiently:
Assign to free workers first, as they do offer the best latency,
Assign to workers with the least amount of other requests being processed, as they are the least busy,
Assign to workers that are close to finish the requests being processed, as they will have a free capacity in the future (this is hard to implement, as this becomes a quite complex heuristic to understand the request completion time).
Currently, Puma does nothing from that. For round-robin we pay around 20% of performance penalty due to sub-optimal request processing assignment.
It is expected that the closer we get to 100% CPU-usage the more expected is that the Puma-non-scheduling algorithm will not be a problem, due to node being saturated, and threads by balanced by the kernel.
Tests
I wanted to validate above theory, so I run Puma and Unicorn in different scenarios calculating saturation factor.
#!/bin/bashrun_ab() { result=$(ab -n "$1" -c "$2" http://localhost:3000/-/chaos/cpu_spin\?token\=secret\&duration_s\=1) time_taken=$(echo "$result" | grep "Time taken for tests:" | cut -d" " -f7) time_per_req=$(echo "$result" | grep "Time per request:" | grep "(mean)" | cut -d" " -f10) echo -e "$1\t$2\t$time_taken\t$time_per_req"}for i in 1 2 3 4 5 6 7 8; do run_ab $((i*20)) $i sleep 1done
Idea for resolving that
Now, when I think about it, it is obvious reason :)
I was thinking that if we can somehow make Puma clever and understanding about the other workers, we could delay on busy workers acceptance of socket for request processing.
This should allow us to somehow implement a mechanism that would prefer non-busy-workers first for the request processing, but allow more concurrent capacity when system requires that. Aligning the Puma performance on pair with Unicorn on lightly-saturated nodes (as for GitLab.com), but reduce resource usage significantly.
@ayufan nice finding! Looks like Puma 'round-robin' across all threads, without considering worker level. In that case, I can imagine in worst case more threads, more latency.
Suppose we config Puma to be WmTn(m Workers, each Worker n Threads). If we issue n concurrent requests, they may be scheduled to T11..T1n(all belong to W1). Now if the system actually have more than n cores, only 1 core is used, so it is n times slow than ideal case(run by n different cpu cores).
After these n requests finished, we issue another n concurrent requests again, now the round-robin strategy will schedule them to T21..T2n(all belong to W2).
Repeat above scenario, in WmTn configuration, the performance can be n times slow in theory. I guess this is the worst case.
Looks like Puma 'round-robin' across all threads, without considering worker level. In that case, I can imagine in worst case more threads, more latency.
It is not that they do it deliberately, this is how it works. You just accept another connection on socket.
Repeat above scenario, in WmTn configuration, the performance can be n times slow in theory. I guess this is the worst case.
Yep, fully theoretically possible. Unlikely in real scenario, but this is how it is implemented today.
On 2019/12/18 we had a production incident: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334 where our webcny stage was breaching SLO thresholds on Latency and Error Rate. From the Puma logs, we noticed OOM errors so drained the 2 cny Puma nodes and that resolved the incident.
As of right now (2019/12/18 0700UTC), the 2 nodes (web-cny-01 and web-cny-02 are still in DRAIN state). I think we should figure out why we ran into OOMs and what settings we should look at using before we enable Puma back and I will be happy to help with it (or any of the other oncalls if I am not oncall)
I'm not seeing any unusual amount of restart behavior from web-cny-01 There is a higher background churn rate, but it's not worse in any specific time. In fact, it was better yesterday than it was in the previous day.
I spent some time doing some basic analysis today.
Analysis Technique
Using 24 hours of data (owing to ELK issues, I've been unable to extend this over longer periods). Once our Kibana problems are resolved, I would be happy to rerun this test over a longer period (cc @ansdval)
Measure workhorse latencies. Measuring from workhorse means that we include any time that the request spends queueing in unicorn or puma before being handled.
Comparing like-for-like paths on both servers. This approach means we can (roughly) compare similar workloads passing through unicorn and puma for a more accurate like-for-like comparison. This approach rules out (some of) the possibilities that one subset of nodes is receiving a favourable workload.
Initially the test compared web-01 and web-02 (puma) to web-08 and web-09 (not puma). On request from @jarv and @ayufan, revised to web-02 (puma) and web-09 (not puma). The same number of nodes was used on both sides to ensure that sample sizes were roughly equal.
This is a redacted copy, excluding private information, the 50 top paths on GitLab
How to interpret this table
The rows represent individual paths in the application
The columns are grouped into three sets. The first set is not-puma, the second set is puma, the third set is how much slower puma is that unicorn. Red indicates worst performance, blue indicates better performance.
Each column group contains latencies for p50 (median), p95 (one-twenty worst request) and p99 (one in one hundred worst request). We also include the number of samples for each URL for puma and not-puma. They are reasonably well matched, although for unique URLs, the numbers quickly drop down into the hundreds of samples, which is not really enough.
Walk me through the data
There are a lot of reasons to be concerned with these results, although there may be mitigating circumstances (for example, there is concern that web-02 is not a good host to pick as it may have performance problems)
Multithreading comes with an expected cost to performance, but this cost is currently much higher than I had expected.
Lets look at a single example:
Pretty much the simplest operation in the application is the healthcheck at /-/readiness. This does very little but pass through some basic middlewares and then return. My (naive) expectation had been a ~5% to ~8% slowdown on an endpoint such as this, what we we see is much higher. In unicorn, the requests have the following latency profile: p50 = 8ms, p95 = 12ms, p99 = 17ms. In puma, the profile is p50 = 9ms, p95 = 15ms, p99 = 110ms. This means that the 1 in 100 worst case puma request is more than 5 times slower than the 1 in 100 unicorn request for the most basic request in the application.
This trend repeats on other endpoints, although unfortunately with only a 24 hour range, the sample size quickly becomes too small to use the p99 or p95. Having said that, the endpoints that do have enough data for reasonable p95s and p99s are thoroughly in favour of unicorn.
Once our Kibana problems are resolved and we can fetch more data for greater sample sizes, I will try again, or, alternatively I'll switch the analysis over to bigquery instead of Kibana.
Can we gather data from all nodes and compare them?
We also gonna tweak Puma timings tomorrow. At least in past when looking at the request this discrepancy was comparable, but also we looked at slightly different data set. Missing ELK makes it harder,
Lets understand why these values are all over the place also for Unicorn and as well Puma.
all nodes process roughly the same amount of requests
50th percentile of Puma is substantially better than the one for Unicorn,
95th percentile of Puma is substantially better than the one for Unicorn,
99th percentile of Puma is substantially, except the /-/readiness check, this is somehow expected, as sleep introduces a delay that might be visible for very high throughput traffic