Latency and errors when pods are cycled for git https

This issue will document our findings for cycling webservice pods while taking traffic

We are frequently seeing errors when we deploy git-https (webservice pods) in our zonal clusters on gitlab.com.

We have made a couple changes so far:

Delay 60seconds before sending the readiness check. gitlab-com/gl-infra/k8s-workloads/gitlab-com!467 (merged) We believe right now that the readiness passes to quickly, this waits a bit longer for the readiness check to pass.
Update the deployment strategy, so we set maxUnavailable: 0 gitlab-com/gl-infra/k8s-workloads/gitlab-com!471 (diffs)

These changes seem to help, but we are still seeing situations where the number of pods that process git https traffic drops during a deploy, this causes excessive load on the remaining pods, which lowers our apdex score.

It's clear in this graph:

https://thanos-query.ops.gitlab.net/graph?g0.range_input=2h&g0.max_source_resolution=0s&g0.expr=count(sum(rate(http_requests_total%7Benv%3D%22gprd%22%2C%20app%3D%22webservice%22%2C%20stage%3D%22main%22%2C%20status%3D%22200%22%7D%5B1m%5D))%20by%20(pod%2C%20region)%20%3E%200)%20by%20(region)&g0.tab=0

Where the pods taking traffic (50) drops all the way down to 28

I believe what is happening here is that max surge (25%) is bringing up new pods, but we are terminating old pods before the new ones are ready to take traffic.

For maxUnavailable: 0, what is the definition of "available"? Does this mean "ready"? What if only one container is ready (workhorse), but rails isn't? One theory I have is that we pass the readiness for workhorse very quickly, maybe this is causing us to terminate pods too aggressively?

Edited Oct 22, 2020 by John Jarvis