Skip to content

Long Polling from GitLab Runners not correctly working in a Kubernetes Deployed Environment

Summary

During Change Request gitlab-com/gl-infra/production#4577 (closed) we are shifting traffic in a slow controlled manner off of our VMs and moving that traffic into Kubernetes. Earlier we discovered that the mechanism for long polling does not appear to be working correctly. After discovering a mis-configuration on our Kubernetes installation, we've reverted back to make the necessary correction. However, now that we are trying this again, we continue to see similar behavior.

What is the current bug behavior?

The count of requests coming from Runners to workhorse increases linearly with the transition of traffic over into Kubernetes.

What is the expected correct behavior?

We should not see a dramatic increase in RPS from the workhorse service.

Relevant logs and/or screenshots

The first sign of behavior is a jump in RPS.

image

Source

You can see in this screenshot 3 clearly defined steps that match the timings for which we shifting differing percentages of traffic over into Kubernetes.

The same can be seen from our metrics of our Runners:

image

Source

The next sign of this behavior is a massive drop in the p50 response time for the endpoint that serves our runner traffic:

image

Source

This suggests that Workhorse is not entering a long poll state.

While seeing a p50 drop immensely seems like a good thing, this has negative implications. The CPU usage of the runners will increase as they ramp up the need to reach back out to the API for another request for work to be completed. This also wastes a lot of unnecessary bandwidth as workhorse should be entering a log poll state to reduce overall resource usage of various systems.

A view into the state of long polling can be seen via at least the gitlab_workhorse_queueing_waiting metric, which we can compare the values between our VMs and Kubernetes via this really long url to Thanos

The initial thread about this situation was discussed and resolved: gitlab-com/gl-infra/production#4577 (comment 579990112) but it would appear there's still yet another something at play leading to this behavior.

Output of checks

This bug happens on GitLab.com

Edited by 🤖 GitLab Bot 🤖