Investigation followup: The loadbalancer SLI of the web-pages service in region `us-east` has an error rate violating SLO

Summary

The web-pages service returns 502 errors sometimes. Looking into this, it seems that bursts of requests to the API service is getting errors from failed connections between workhorse and rails containers in a pod. Typically (but not exclusively), this appears to occur during fetching domain information.

Web-Pages Get Domain Errors

Source

Workhorse Failures

Source

Workhorse is throwing these error on start up time because it polls the rails container in the same pod information before the rails server is available, we've disabled this feature and it's going to be fixed in gitlab-org/gitlab#350202 (closed)

Some questions to try to answer:

Can we minimize or eliminate the impact on web-pages to prevent 5xx errors to requests for pages sites?

Request Flow

`Pages`

GitLab-pages middleware for host and domain
getHostAndDomain
s.GetDomain
client.Resolve
GetLookup
Send Request to API to /api/v4/internal/pages gets an error and bubbles it up the stack.
Error checked in the middleware it checks that its' not ErrDomainDoesNotExist and then returns the 502 code.

`Workhorse`

Workhorse returns 502 Bad Gateway with dial tcp 127.0.0.1:8080: connect: connection refused
- Container that is throwing the error is GitLab-workhorse

kubectl -n gitlab get po gitlab-webservice-api-5c4f9cd588-8vbx2 -o json

webservice: dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-webservice-ee:14-10-202204011318-58d7b33a450

ports

"ports": [
    {
        "containerPort": 8080,  <---- 127.0.0.1:8080
        "name": "http-webservice",
        "protocol": "TCP"
    },
    {
        "containerPort": 8083,
        "name": "http-metrics-ws",
        "protocol": "TCP"
    }
],

gitlab-workhorse: dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-workhorse-ee:14-10-202204011318-58d7b33a450

ports

"ports": [
    {
        "containerPort": 8181,
        "name": "http-workhorse",
        "protocol": "TCP"
    },
    {
        "containerPort": 9229,
        "name": "http-metrics-wh",
        "protocol": "TCP"
    }
],

GitLab-pages -> GitLab-workhorse -> Webservice (via 127.0.0.1:8080)

Readiness probes

`webservice`

defention

"readinessProbe": {
    "failureThreshold": 3,
    "httpGet": {
        "path": "/-/readiness",
        "port": 8080,
        "scheme": "HTTP"
    },
    "initialDelaySeconds": 60,
    "periodSeconds": 10,
    "successThreshold": 1,
    "timeoutSeconds": 2
},

`worhorse`

defention

"readinessProbe": {
  "exec": {
    "command": [
      "/scripts/healthcheck"
    ]
  },
  "failureThreshold": 3,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 2
}