2021-07-21: Various load balanced services are down (GitLab Pages, Registry, Docs site)
Current Status
Impacted services (GitLab Pages, GitLab Registry, GitLab docs) are now available and we are investigating the root cause.
Specific load-balanced services (GitLab Pages, GitLab Registry, and Docs) were unavailable from 06:45 UTC - 08:3 UTC.
GitLab Pages (includes docs.GitLab.com) was unavailable from 06:45 UTC - 08:46 UTC and GitLab Registry was unavailable from 07:18 UTC - 08:36 UTC.
Summary for CMOC notice / Exec summary:
- Customer Impact: All GitLab ServicePages sites are currently unavailable, as well as ServiceContainer Registry Docs.GitLab.com.
- Customer Impact Duration: 06:45 UTC - 08:46 UTC (121 minutes)
- Current state: See
Incident::<state>
label - Root cause:
As stated in #5196 (comment 632054352) by @T4cC0re:
We hit this bug in parts of our fleet: https://github.com/GoogleCloudPlatform/guest-agent/issues/103
The
fe-registry-XX
andfe-pages-XX
nodes were running thegoogle-guest-agent
on version20201217.02-0ubuntu1~18.04.0
, which misses aPartOf
dependency betweensystemd-networkd
and thegoogle-guest-agent
services.Those nodes installed a
systemd
upgrade from around 06:27 to 06:50, each node restarted theirsystemd-networkd
as part of that upgrade.This restart dropped the required route for the TCP LB, which the
google-guest-agent
manages. Because of the missing dependency, this agent was not restarted, and thus the route remained missing. Leading to the following drop in traffic:
Restarting the
google-guest-agent
service restores the route, and restores traffic flow.The
fe-XX
nodes, which run the traffic forGitLab.com
itself, run on the predecessor of thegoogle-guest-agent
and were not susceptible to this behavior.Ideally, we should include Google's packages in our unattended-upgrades. An updated
google-guest-agent
package would have included the fix and thus saved us from this outage.
Internal issue for root cause analysis https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13830
Timeline
View recent production deployment and configuration events / gcp events (internal only)
All times UTC.
2021-07-21
-
06:45
- Blackbox probes for GitLab Pages begin to fail -
06:49
- Pingdom reports GitLab pages as down (includes docs.GitLab.com) -
06:53
- @cindy declares incident in Slack. -
06:55
- Investigation leads to load balancers not seeing healthy hosts -
07:05
- Hosts are able to be reached and are determined healthy, investigation pivots to looking into changes related to the load balancers. -
07:18
- Pingdom reports registry.GitLab.com as down. -
07:27
- Investigation started with GCP -
08:15
- An issue with Load Balancers has been identified. -
08:20
- Commencing Registry LB reboots -
08:36
- Registry service is restored -
08:46
- Pages service is restored
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
Preliminary, these will be converted to issue links when they are confirmed
- Ensure
google-guest-agent
is upgraded across all environments #5198 (closed) - Add
google-guest-agent
to the list of packages we automatically update https://gitlab.com/gitlab-cookbooks/gitlab-server/-/merge_requests/276 - Add process monitoring for
google-guest-agent
- Improve or create a runbook for the specific case of a networking issue between the Cloud LB and HAProxy.
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
Summary
Starting around 06:45 UTC, GitLab pages stopped taking all traffic. The GitLab pages frontend servers were functional and returned 200
s on their health endpoint, but the Google TCP load balancer somehow marked all backend servers as being in an unhealthy state and removed all traffic going to the haproxy servers from the TCP load balancer. At around 07:18 UTC the TCP load balancers routing traffic to the GitLab container registry started displaying the same behavior, but other TCP load balancers for other services were unaffected. The issue stemmed from a bug in the google-guest-agent
process where the agent doesn't restart when SystemD network unit is restarted. This restart dropped the required route for the Google TCP load balancer which is added by process google-guest-agent
. The nodes affected had a a systemd
upgrade from around 06:27 to 06:50 and each node restarted their systemd-networkd
as part of that upgrade.
- Service(s) affected: ServicePages and ServiceContainer Registry
- Team attribution: Infra
- Time to detection: 4 minutes
- Minutes downtime or degradation: 121 minutes
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- External and internal customers
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Service was completely unavailable
-
How many customers were affected?
- Every customer trying to use our docs site, GitLab pages, and registry
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Not known
What were the root causes?
The root cause was a bug in google-guest-agent
: https://github.com/GoogleCloudPlatform/guest-agent/issues/103.
Incident Response Analysis
-
How was the incident detected?
- Alerting
-
How could detection time be improved?
- None
-
How was the root cause diagnosed?
- Engaging with google support, comparing the routing table of an affected host to an unaffected host, and looking at the logs for
google-guest-agent
.
- Engaging with google support, comparing the routing table of an affected host to an unaffected host, and looking at the logs for
-
How could time to diagnosis be improved?
- Unsure if there was something that could have been improved
- How did we reach the point where we knew how to mitigate the impact? 1.
-
How could time to mitigation be improved?
- We could have added additional frontend servers sooner to serve traffic. At the time we were unsure if it was GLB outage or service disruption.
-
What went well?
- Finding the service that was the issue and engaging with Google support early on
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No
Lessons Learned
- We have a process that manages routes needed for GLBs
- This process should be updated and monitored on our haproxy nodes because it is crucial to GLBs working properly
- We were not keeping our Google packages up-to-date. If we had, we would've avoided this bug as it was resolved in the newer versions of the agent. This is being addressed by @T4cC0re in https://gitlab.com/gitlab-cookbooks/gitlab-server/-/merge_requests/276
- We are also looking into overhauling update management in general in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/545 (internal during implementation)