Rails appears slower in Kubernetes versus VM's

During the migration of gitlab-shell into Kubernetes, we noticed something that seems a little strange. When monitoring the logging metric duration_ms from the gitlab-shell calls, they were consistently taking longer than the same from our VM infrastructure.

Possible Conclusion

Possible conclusion to this investigation found at this thread: #1349 (comment 448797972)

Issue Details

Saved dashboard

One could quickly assume that this is due to the nature of the infrastructure. It was discussed in the readiness review for this service that we'd see interesting behavior due to a few things:

The networking inside of Kubernetes is much more complex than that of our VM's inside of a VPC
The gitlab-shell reaches out to a different service gitlab-webservice in this case for any calls required for rails activity

While it's easy to assume a network latency addition, @jarv did some fantastic observations. The below is copied directly from his invesigation Source

We can figure rule out network hops by drilling into workhorse/rails metrics, and filtering on the user-agent of GitLab Shell to exclude other calls. It looks like it there is an additional latency all the way to rails:

Workhorse

Authorized_keys

https://log.gprd.gitlab.net/goto/10213240857959ab9494529469d591d7

Internal Allowed

https://log.gprd.gitlab.net/goto/2150950732865cdf10d3ae3fbffe0725

Rails

Internal Allowed

https://log.gprd.gitlab.net/goto/b664417392ff0ecef2dff568a76f8d18

We see this latency increase at rails

https://log.gprd.gitlab.net/goto/b664417392ff0ecef2dff568a76f8d18

Specifically db_duration_s looks like it is a bit worse on K8s:

https://log.gprd.gitlab.net/goto/815190cb3ac1cf77874448273f0d942c

So intercommunication has been ruled out as a primary target, and thus network latency, while it may play a role, it is not as significant as the database calls that rails is performing.

Utilize this issue to determine where discrepancies or performance issues may exist. Maybe start with the following questions:

Is there a configuration difference between our VM's and our Kubernetes installations?
Is there a way we can measure latency of simply speaking to postgres/pgbouncer between kubernetes Pods to measure and compare network latency? Maybe the use of the Google Load Balancer, or consul is playing a role?
Is there some other library we are potentially missing that our omnibus installations have that our Container Images do not?
Is there an optimization in Kubernetes or Google's VPC that we need to dive into?

Potentially we could experiment? The gitlab-shell uses the current default of sending it's traffic to the gitlab-webservice Kubernetes Service object. We could override this and instead send it to our API endpoint. To configure this see the following documentation: https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/#workhorse

This issue can be closed when we can answer the following question: We appear to be suffering from a 400ms loss in performance when using Kubernetes as our Infrastructure. Is this acceptable and can we improve this?

/cc @gitlab-org/database-team /cc @gitlab-org/scalability /cc @jarv

Edited Nov 17, 2020 by John Skarbek