Puma tuning - Reduce increased latency introduced by Puma

Breaking out this into its own issue from https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7455#note_239070865

Looking through the graphs, I see higher throughput from Puma which I would expect given the additional worker threads available (14*4 > 30), but I'm not sure if any latency tests were done?

Canary Testing

Before we lose the data over the retention window in ELK, I thought I would do a quick comparison:

Queries are for 2019-10-27 00:00:00.000 to 2019-10-29 23:59:59.999. During this period, I understand that puma was enabled on web-01-cny and unicorn was enabled on web-02-cny.

Comparison 1: Rails latencies for all traffic excluding healthchecks

Raw data: https://log.gitlab.net/goto/ceb45f1d2cdadae8b15ed13cb8c8652a

	web-01-cny	web-02-cny	Difference
p50	84.586	77.85	8% slower on web-01-cny
p95	979.315	914.22	7.12% slower on web-01-cny
p99	2,438.65	2,332.90	4.53% slower on web-01-cny

Conclusion 1: on average puma during this period, puma processed requests about 7% slower

Comparison 2: Rails queuing durations for all traffic excluding healthchecks

Raw data: https://log.gitlab.net/goto/e4e0a736ec1e4fefb29d4e7ebf3728c1

	web-cny-01-sv-gprd	web-cny-02-sv-gprd	Difference
p50	7.111	6.879	3% slower web-01-cny
p95	22.629	24.024	5% slower on web-02-cny
p99	231.045	459.171	50% slower on web-02-cny

Conclusion 2: roughly the same, except on the p99 where puma performance was much better (half!) of unicorn. This is likely because unicorn reached saturation for longer periods.

Comparison 3: Workhorse request durations for rails requests

Raw data: https://log.gitlab.net/goto/12096d2264e86d77371fe8dff29aa8b6

This uses a narrower date range as workhorse has a shorter retention schedule.

	web-cny-01-sv-gprd	web-cny-02-sv-gprd	Difference
p50	6	5	20.00% slower on web-01-cny
p95	856.622	683.831	25.27% slower on web-01-cny
p99	5,745.05	4,658.77	23.32% slower on web-01-cny

Conclusion 3: across the board web-cny-01-sv-gprd is ~20% slower when comparing from Workhorse.

It's possible that other factors are affecting these results, the hosts might not be equivalent, load balancing, or other things, however I think with a 20% difference in Workhorse latencies we should definitely slow down and consider what is causing this.

Conclusion

Considering that on GitLab.com we are running on a fleet that is not memory constrained, and our primary metrics are error rate and latency, I think we should consider rolling back to half the canary fleet on puma and half the canary fleet on unicorn so that a fair comparison can be done while we tune puma further. After further discussion we have decided to enable puma on a select web/api/git nodes production#1303 (closed)

I'm fairly sure that we can resolve this problem, but while we do we should continue to split traffic between the two servers, unicorn and puma, so that we can continue to do a fair comparison while we tune puma.

Production node testing

Timeline

Mon Nov 4 10:59:44 - Puma enabled on web-{01,02}, git-{01,02}, api-{01,02} in production with W14/T4
Mon Nov 4 12:09:25 - Puma reconfigured on web-02 for W18/T2
Mon Nov 4 12:54:17 - Puma reconfigured on web-02 for W25/T2
Mon Nov 4 14:16:58 - Puma reconfigure on canary plus single main stage nodes for W30/T1 (db_pool: 1)
Wed Nov 6 14:39:00 - Puma disabled on all nodes due to db issues, not related to puma but we saw worse performance production#1327 (closed)
Thu Nov 14 09:33:22 - Puma reconfigured on web-cny-01 for W16/T2
Thu Nov 14 09:37:39 - Puma reconfigured on web-cny-02 for W8/T4
Fri Nov 15 10:36:39 - Puma reconfigured on web-cny-02 for W16/T4
Fri Nov 15 15:50:05 - Puma reconfigured on web-01 for W16/T2

cc @joshlambert @andrewn @ayufan @jarv

Edited Nov 15, 2019 by John Jarvis