Investigate git deployment problems leading to 502's and 503's
During 2 consecutive deployments to production, we ran into SLI degradation of the error rate. We were violating the 1 hour SLO for the entire length of time of which the git
fleet were being deployed too. This was captured by the following chart:
We can see this also impacts our Kubernetes deployments, but not as severly. The difference here is that Kuberentes deployments operate in a much significantly different method.
Deployments of questionable activity:
- https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/348891
- https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/349356
Utilize this issue to determine why we may be receiving such a high rate of HTTP502's and 503's during a deployment. This behavior is new and we've not experienced deployments this rough for quite awhile.
Milestones
-
Determine why the deployment is "rough"
- Any configuration changes recently that allowed a different expectation from workhorse?
- Are the nodes being properly drained from haproxy?
- Are existing connections not being closed properly?
- ...
Results of work
We learned the behavior of the nginx-ingress during the investigation of this issue. Nginx was hanging onto connections to Pods that were to be Terminated allowing those Pods to continue to serve traffic. Details about this are in this thread: #1358 (comment 459061429)
We worked to tune nginx to drop those connections quickly. This resolved our issue as noted here: #1358 (comment 461826610)