Rails appears slower in Kubernetes versus VM's
During the migration of gitlab-shell
into Kubernetes, we noticed something that seems a little strange. When monitoring the logging metric duration_ms
from the gitlab-shell
calls, they were consistently taking longer than the same from our VM infrastructure.
Possible Conclusion
Possible conclusion to this investigation found at this thread: #1349 (comment 448797972)
Issue Details
One could quickly assume that this is due to the nature of the infrastructure. It was discussed in the readiness review for this service that we'd see interesting behavior due to a few things:
- The networking inside of Kubernetes is much more complex than that of our VM's inside of a VPC
- The
gitlab-shell
reaches out to a different servicegitlab-webservice
in this case for any calls required for rails activity
While it's easy to assume a network latency addition, @jarv did some fantastic observations. The below is copied directly from his invesigation Source
We can figure rule out network hops by drilling into workhorse/rails metrics, and filtering on the user-agent of GitLab Shell
to exclude other calls. It looks like it there is an additional latency all the way to rails:
Workhorse
Authorized_keys
https://log.gprd.gitlab.net/goto/10213240857959ab9494529469d591d7
Internal Allowed
https://log.gprd.gitlab.net/goto/2150950732865cdf10d3ae3fbffe0725
Rails
Internal Allowed
https://log.gprd.gitlab.net/goto/b664417392ff0ecef2dff568a76f8d18
We see this latency increase at rails
https://log.gprd.gitlab.net/goto/b664417392ff0ecef2dff568a76f8d18
Specifically db_duration_s looks like it is a bit worse on K8s:
https://log.gprd.gitlab.net/goto/815190cb3ac1cf77874448273f0d942c
So intercommunication has been ruled out as a primary target, and thus network latency, while it may play a role, it is not as significant as the database calls that rails is performing.
Utilize this issue to determine where discrepancies or performance issues may exist. Maybe start with the following questions:
-
Is there a configuration difference between our VM's and our Kubernetes installations? -
Is there a way we can measure latency of simply speaking to postgres/pgbouncer between kubernetes Pods to measure and compare network latency? Maybe the use of the Google Load Balancer, or consul is playing a role? -
Is there some other library we are potentially missing that our omnibus installations have that our Container Images do not? -
Is there an optimization in Kubernetes or Google's VPC that we need to dive into?
Potentially we could experiment? The gitlab-shell
uses the current default of sending it's traffic to the gitlab-webservice
Kubernetes Service object. We could override this and instead send it to our API endpoint. To configure this see the following documentation: https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/#workhorse
This issue can be closed when we can answer the following question: We appear to be suffering from a 400ms loss in performance when using Kubernetes as our Infrastructure. Is this acceptable and can we improve this?
/cc @gitlab-org/database-team /cc @gitlab-org/scalability /cc @jarv