Skip to content

Latency and errors when pods are cycled for git https

This issue will document our findings for cycling webservice pods while taking traffic

We are frequently seeing errors when we deploy git-https (webservice pods) in our zonal clusters on gitlab.com.

We have made a couple changes so far:

These changes seem to help, but we are still seeing situations where the number of pods that process git https traffic drops during a deploy, this causes excessive load on the remaining pods, which lowers our apdex score.

It's clear in this graph:

https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.max_source_resolution=0s&g0.expr=count(sum(rate(http_requests_total%7Benv%3D%22gprd%22%2C%20app%3D%22webservice%22%2C%20stage%3D%22main%22%2C%20status%3D%22200%22%7D%5B1m%5D))%20by%20(pod%2C%20region)%20%3E%200)%20by%20(region)&g0.tab=0

image

Where the pods taking traffic (50) drops all the way down to 28

I believe what is happening here is that max surge (25%) is bringing up new pods, but we are terminating old pods before the new ones are ready to take traffic.

For maxUnavailable: 0, what is the definition of "available"? Does this mean "ready"? What if only one container is ready (workhorse), but rails isn't? One theory I have is that we pass the readiness for workhorse very quickly, maybe this is causing us to terminate pods too aggressively?

Edited by John Jarvis