Investigation followup: The loadbalancer SLI of the web-pages service in region `us-east` has an error rate violating SLO
Summary
The web-pages service returns 502
errors sometimes. Looking into this, it seems that bursts of requests to the API service is getting errors from failed connections between workhorse and rails containers in a pod. Typically (but not exclusively), this appears to occur during fetching domain information.
Web-Pages Get Domain Errors
Workhorse Failures
Workhorse is throwing these error on start up time because it polls the rails container in the same pod information before the rails server is available, we've disabled this feature and it's going to be fixed in gitlab-org/gitlab#350202 (closed)
Some questions to try to answer:
- Can we minimize or eliminate the impact on web-pages to prevent 5xx errors to requests for pages sites?
Request Flow
Pages
- GitLab-pages middleware for host and domain
getHostAndDomain
s.GetDomain
client.Resolve
GetLookup
-
Send Request to API to
/api/v4/internal/pages
gets an error and bubbles it up the stack. -
Error checked in the middleware it checks that its' not
ErrDomainDoesNotExist
and then returns the502
code.
Workhorse
- Workhorse returns
502 Bad Gateway
withdial tcp 127.0.0.1:8080: connect: connection refused
- Container that is throwing the error is
GitLab-workhorse
- Container that is throwing the error is
-
kubectl -n gitlab get po gitlab-webservice-api-5c4f9cd588-8vbx2 -o json
-
webservice
:dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-webservice-ee:14-10-202204011318-58d7b33a450
ports
"ports": [ { "containerPort": 8080, <---- 127.0.0.1:8080 "name": "http-webservice", "protocol": "TCP" }, { "containerPort": 8083, "name": "http-metrics-ws", "protocol": "TCP" } ],
-
gitlab-workhorse
:dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-workhorse-ee:14-10-202204011318-58d7b33a450
ports
"ports": [ { "containerPort": 8181, "name": "http-workhorse", "protocol": "TCP" }, { "containerPort": 9229, "name": "http-metrics-wh", "protocol": "TCP" } ],
-
-
GitLab-pages
->GitLab-workhorse
->Webservice
(via 127.0.0.1:8080)
Readiness probes
webservice
defention
"readinessProbe": {
"failureThreshold": 3,
"httpGet": {
"path": "/-/readiness",
"port": 8080,
"scheme": "HTTP"
},
"initialDelaySeconds": 60,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 2
},
worhorse
defention
"readinessProbe": {
"exec": {
"command": [
"/scripts/healthcheck"
]
},
"failureThreshold": 3,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 2
}
Related Incident(s)
Originating issue(s):
Todo/Action Items
-
Includeuri
for bad gateway requests👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497#note_899661738 -
Disable Geo proxy in gprd
in workhorse👉 gitlab-com/gl-infra/k8s-workloads/gitlab-com!1670 (merged) -
Increase readiness review frequency -
Graceful termination for GitLab-workhorse
-
Add support for workhorse
image for graceful shutdown👉 gitlab-org/build/CNG!972 (merged) -
Enable graceful shutdown in gstg
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-com!1714 (merged) -
Enable graceful shutdown in gprd-cny
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-com!1720 (merged) -
Enable graceful shutdown in gprd-us-east1-b
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-com!1721 (merged) -
Enable graceful shutdown in all of gprd
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-com!1728 (merged) -
Set graceful shutdown for workhorse
container as default by removing the feature flag.-
Remove in CNG 👉 gitlab-org/build/CNG!986 (merged) -
Remove GITLAB_WORKHORSE_EXEC
configuration from helm chart.👉 gitlab-com/gl-infra/k8s-workloads/gitlab-com!1732 (merged)
-
-
-
We've fixed the issue, now we need to understand why containers where still getting requests when pod was unhealthy 👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497#note_920637102
Results
All errors are gone
The fix has been running for around ~90 hours (weekend included) but in the graphs below it's clear that we are no longer seeing any 502 issues:
Pages:
HAProxy: this includes all 5xx
errors
As you can see the graph is a lot more flat rather then being spikey which is good! We only see 2 spikes in 502
errors, and these times correlate with https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6888 where it was a DoS event which is being addressed with https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15618 and https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6882