2021-07-21: Various load balanced services are down (GitLab Pages, Registry, Docs site)

Current Status

Impacted services (GitLab Pages, GitLab Registry, GitLab docs) are now available and we are investigating the root cause.

Specific load-balanced services (GitLab Pages, GitLab Registry, and Docs) were unavailable from 06:45 UTC - 08:3 UTC.

GitLab Pages (includes docs.GitLab.com) was unavailable from 06:45 UTC - 08:46 UTC and GitLab Registry was unavailable from 07:18 UTC - 08:36 UTC.

Summary for CMOC notice / Exec summary:

Customer Impact: All GitLab ServicePages sites are currently unavailable, as well as ServiceContainer Registry Docs.GitLab.com.
Customer Impact Duration: 06:45 UTC - 08:46 UTC (121 minutes)
Current state: See Incident::<state> label
Root cause:

As stated in #5196 (comment 632054352) by @T4cC0re:

We hit this bug in parts of our fleet: https://github.com/GoogleCloudPlatform/guest-agent/issues/103

The fe-registry-XX and fe-pages-XX nodes were running the google-guest-agent on version 20201217.02-0ubuntu1~18.04.0, which misses a PartOf dependency between systemd-networkd and the google-guest-agent services.

Those nodes installed a systemd upgrade from around 06:27 to 06:50, each node restarted their systemd-networkd as part of that upgrade.

This restart dropped the required route for the TCP LB, which the google-guest-agent manages. Because of the missing dependency, this agent was not restarted, and thus the route remained missing. Leading to the following drop in traffic:

source (internal)

Restarting the google-guest-agent service restores the route, and restores traffic flow.

The fe-XX nodes, which run the traffic for GitLab.com itself, run on the predecessor of the google-guest-agent and were not susceptible to this behavior.

Ideally, we should include Google's packages in our unattended-upgrades. An updated google-guest-agent package would have included the fix and thus saved us from this outage.

Internal issue for root cause analysis https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13830

Timeline

View recent production deployment and configuration events / gcp events (internal only)

All times UTC.

2021-07-21

06:45 - Blackbox probes for GitLab Pages begin to fail
06:49 - Pingdom reports GitLab pages as down (includes docs.GitLab.com)
06:53 - @cindy declares incident in Slack.
06:55 - Investigation leads to load balancers not seeing healthy hosts
07:05 - Hosts are able to be reached and are determined healthy, investigation pivots to looking into changes related to the load balancers.
07:18 - Pingdom reports registry.GitLab.com as down.
07:27 - Investigation started with GCP
08:15 - An issue with Load Balancers has been identified.
08:20 - Commencing Registry LB reboots
08:36 - Registry service is restored
08:46 - Pages service is restored

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Preliminary, these will be converted to issue links when they are confirmed

Ensure google-guest-agent is upgraded across all environments #5198 (closed)
Add google-guest-agent to the list of packages we automatically update https://gitlab.com/gitlab-cookbooks/gitlab-server/-/merge_requests/276
Add process monitoring for google-guest-agent
Improve or create a runbook for the specific case of a networking issue between the Cloud LB and HAProxy.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Summary

Starting around 06:45 UTC, GitLab pages stopped taking all traffic. The GitLab pages frontend servers were functional and returned 200s on their health endpoint, but the Google TCP load balancer somehow marked all backend servers as being in an unhealthy state and removed all traffic going to the haproxy servers from the TCP load balancer. At around 07:18 UTC the TCP load balancers routing traffic to the GitLab container registry started displaying the same behavior, but other TCP load balancers for other services were unaffected. The issue stemmed from a bug in the google-guest-agent process where the agent doesn't restart when SystemD network unit is restarted. This restart dropped the required route for the Google TCP load balancer which is added by process google-guest-agent. The nodes affected had a a systemd upgrade from around 06:27 to 06:50 and each node restarted their systemd-networkd as part of that upgrade.

Service(s) affected: ServicePages and ServiceContainer Registry
Team attribution: Infra
Time to detection: 4 minutes
Minutes downtime or degradation: 121 minutes

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. External and internal customers
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Service was completely unavailable
How many customers were affected?
1. Every customer trying to use our docs site, GitLab pages, and registry
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Not known

What were the root causes?

The root cause was a bug in google-guest-agent: https://github.com/GoogleCloudPlatform/guest-agent/issues/103.

Incident Response Analysis

How was the incident detected?
1. Alerting
How could detection time be improved?
1. None
How was the root cause diagnosed?
1. Engaging with google support, comparing the routing table of an affected host to an unaffected host, and looking at the logs for google-guest-agent.
How could time to diagnosis be improved?
1. Unsure if there was something that could have been improved
How did we reach the point where we knew how to mitigate the impact? 1.
How could time to mitigation be improved?
1. We could have added additional frontend servers sooner to serve traffic. At the time we were unsure if it was GLB outage or service disruption.
What went well?
1. Finding the service that was the issue and engaging with Google support early on

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No

Lessons Learned

We have a process that manages routes needed for GLBs
This process should be updated and monitored on our haproxy nodes because it is crucial to GLBs working properly
We were not keeping our Google packages up-to-date. If we had, we would've avoided this bug as it was resolved in the newer versions of the agent. This is being addressed by @T4cC0re in https://gitlab.com/gitlab-cookbooks/gitlab-server/-/merge_requests/276
We are also looking into overhauling update management in general in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/545 (internal during implementation)

Guidelines

Blameless RCA Guideline

Resources

https://console.cloud.google.com/support/cases/detail/28536362?project=gitlab-production

Edited Jul 28, 2021 by Hendrik Meyer