Puma tuning - Reduce increased latency introduced by Puma
Breaking out this into its own issue from https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7455#note_239070865
Looking through the graphs, I see higher throughput from Puma which I would expect given the additional worker threads available (14*4 > 30), but I'm not sure if any latency tests were done?
Canary Testing
Before we lose the data over the retention window in ELK, I thought I would do a quick comparison:
Queries are for 2019-10-27 00:00:00.000 to 2019-10-29 23:59:59.999. During this period, I understand that puma was enabled on web-01-cny and unicorn was enabled on web-02-cny.
Comparison 1: Rails latencies for all traffic excluding healthchecks
Raw data: https://log.gitlab.net/goto/ceb45f1d2cdadae8b15ed13cb8c8652a
web-01-cny | web-02-cny | Difference | |
---|---|---|---|
p50 | 84.586 | 77.85 | 8% slower on web-01-cny |
p95 | 979.315 | 914.22 | 7.12% slower on web-01-cny |
p99 | 2,438.65 | 2,332.90 | 4.53% slower on web-01-cny |
Conclusion 1: on average puma during this period, puma processed requests about 7% slower
Comparison 2: Rails queuing durations for all traffic excluding healthchecks
Raw data: https://log.gitlab.net/goto/e4e0a736ec1e4fefb29d4e7ebf3728c1
web-cny-01-sv-gprd | web-cny-02-sv-gprd | Difference | |
---|---|---|---|
p50 | 7.111 | 6.879 | 3% slower web-01-cny |
p95 | 22.629 | 24.024 | 5% slower on web-02-cny |
p99 | 231.045 | 459.171 | 50% slower on web-02-cny |
Conclusion 2: roughly the same, except on the p99 where puma performance was much better (half!) of unicorn. This is likely because unicorn reached saturation for longer periods.
Comparison 3: Workhorse request durations for rails requests
Raw data: https://log.gitlab.net/goto/12096d2264e86d77371fe8dff29aa8b6
This uses a narrower date range as workhorse has a shorter retention schedule.
web-cny-01-sv-gprd | web-cny-02-sv-gprd | Difference | |
---|---|---|---|
p50 | 6 | 5 | 20.00% slower on web-01-cny |
p95 | 856.622 | 683.831 | 25.27% slower on web-01-cny |
p99 | 5,745.05 | 4,658.77 | 23.32% slower on web-01-cny |
Conclusion 3: across the board web-cny-01-sv-gprd is ~20% slower when comparing from Workhorse.
It's possible that other factors are affecting these results, the hosts might not be equivalent, load balancing, or other things, however I think with a 20% difference in Workhorse latencies we should definitely slow down and consider what is causing this.
Conclusion
Considering that on GitLab.com we are running on a fleet that is not memory constrained, and our primary metrics are error rate and latency, I think we should consider rolling back to half the canary fleet on puma and half the canary fleet on unicorn so that a fair comparison can be done while we tune puma further. After further discussion we have decided to enable puma on a select web/api/git nodes production#1303 (closed)
I'm fairly sure that we can resolve this problem, but while we do we should continue to split traffic between the two servers, unicorn and puma, so that we can continue to do a fair comparison while we tune puma.
Production node testing
Timeline
- Mon Nov 4 10:59:44 - Puma enabled on web-{01,02}, git-{01,02}, api-{01,02} in production with W14/T4
- Mon Nov 4 12:09:25 - Puma reconfigured on web-02 for W18/T2
- Mon Nov 4 12:54:17 - Puma reconfigured on web-02 for W25/T2
- Mon Nov 4 14:16:58 - Puma reconfigure on canary plus single main stage nodes for W30/T1 (db_pool: 1)
- Wed Nov 6 14:39:00 - Puma disabled on all nodes due to db issues, not related to puma but we saw worse performance production#1327 (closed)
- Thu Nov 14 09:33:22 - Puma reconfigured on web-cny-01 for W16/T2
- Thu Nov 14 09:37:39 - Puma reconfigured on web-cny-02 for W8/T4
- Fri Nov 15 10:36:39 - Puma reconfigured on web-cny-02 for W16/T4
- Fri Nov 15 15:50:05 - Puma reconfigured on web-01 for W16/T2