Investigation followup: The loadbalancer SLI of the web-pages service in region `us-east` has an error rate violating SLO

Summary

The web-pages service returns 502 errors sometimes. Looking into this, it seems that bursts of requests to the API service is getting errors from failed connections between workhorse and rails containers in a pod. Typically (but not exclusively), this appears to occur during fetching domain information.

Web-Pages Get Domain Errors

Screen_Shot_2022-03-22_at_2.30.44_PM

Workhorse Failures

Screen_Shot_2022-03-22_at_2.28.04_PM

Workhorse is throwing these error on start up time because it polls the rails container in the same pod information before the rails server is available, we've disabled this feature and it's going to be fixed in gitlab-org/gitlab#350202 (closed)

Some questions to try to answer:

  1. Can we minimize or eliminate the impact on web-pages to prevent 5xx errors to requests for pages sites?

Request Flow

Pages

  1. GitLab-pages middleware for host and domain
  2. getHostAndDomain
  3. s.GetDomain
  4. client.Resolve
  5. GetLookup
  6. Send Request to API to /api/v4/internal/pages gets an error and bubbles it up the stack.
  7. Error checked in the middleware it checks that its' not ErrDomainDoesNotExist and then returns the 502 code.

Workhorse

  1. Workhorse returns 502 Bad Gateway with dial tcp 127.0.0.1:8080: connect: connection refused
  2. kubectl -n gitlab get po gitlab-webservice-api-5c4f9cd588-8vbx2 -o json
    1. webservice: dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-webservice-ee:14-10-202204011318-58d7b33a450

      ports
      "ports": [
          {
              "containerPort": 8080,  <---- 127.0.0.1:8080
              "name": "http-webservice",
              "protocol": "TCP"
          },
          {
              "containerPort": 8083,
              "name": "http-metrics-ws",
              "protocol": "TCP"
          }
      ],
    2. gitlab-workhorse: dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-workhorse-ee:14-10-202204011318-58d7b33a450

      ports
      "ports": [
          {
              "containerPort": 8181,
              "name": "http-workhorse",
              "protocol": "TCP"
          },
          {
              "containerPort": 9229,
              "name": "http-metrics-wh",
              "protocol": "TCP"
          }
      ],
  3. GitLab-pages -> GitLab-workhorse -> Webservice (via 127.0.0.1:8080)

Readiness probes

webservice

defention
"readinessProbe": {
    "failureThreshold": 3,
    "httpGet": {
        "path": "/-/readiness",
        "port": 8080,
        "scheme": "HTTP"
    },
    "initialDelaySeconds": 60,
    "periodSeconds": 10,
    "successThreshold": 1,
    "timeoutSeconds": 2
},

worhorse

defention
"readinessProbe": {
  "exec": {
    "command": [
      "/scripts/healthcheck"
    ]
  },
  "failureThreshold": 3,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 2
}

Related Incident(s)

Originating issue(s):

Pages

Todo/Action Items

Results

Blog post: https://about.gitlab.com/blog/2022/05/17/how-we-removed-all-502-errors-by-caring-about-pid-1-in-kubernetes/

All errors are gone

The fix has been running for around ~90 hours (weekend included) but in the graphs below it's clear that we are no longer seeing any 502 issues:

Pages:

rate of 502 errors served by GitLab-pages

Source

HAProxy: this includes all 5xx errors

Screenshot_2022-04-25_at_10.58.16

Source

As you can see the graph is a lot more flat rather then being spikey which is good! We only see 2 spikes in 502 errors, and these times correlate with https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6888 where it was a DoS event which is being addressed with https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15618 and https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6882