2020-10-28: QA failures on staging blocking the deployment pipeline - Praefect errors

Summary

Multiple QA failures on staging due to 500 errors:

https://sentry.gitlab.net/gitlab/staginggitlabcom/issues/2098861/?query=is%3Aunresolved

https://nonprod-log.gitlab.net/goto/c584300c5718107da65438b194663acb

2:accessor call: route repository accessor: get synced node: get shard for "nfs-file22": primary is not healthy.

Timeline

All times UTC.

2020-10-28

08:28 - @jarv declares incident in Slack.
10:26 - Restart of Praefact nodes began.
10:44 - @jarv tries to downgrade Praefect to immediately previous version.
11:06 - @jarv initiates a full environmental rollback.
11:38 - Prometheus metrics scraping is disabled.
11:40 - Restarted Praefect Apps.
11:45 - Drastic reduction in CPU saturation on Praefect is observed.
11:52 - @ahmadsherif discovers that query analysis indicates Execution Time: 695.130 ms, but with metric aggregation enabled: Execution Time: 41216.373 ms.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

For a period of 180 minutes (between 2020-10-28 08:28 UTC and 2020-10-28 11:28 UTC), the ServicePraefect in Environmentgstg began to fail to respond to GRPC invocations during QA pipeline stages, and exhibited symptoms of a failover loop.

Further investigation ultimately led to the discovery that health checks configured to poll the /metrics route on port 9652 were causing a metrics scraping process implemented by Praefect to make multiple -- and progressively stacking -- long running queries in the praefect staging database running on a GCP CloudSQL instance. It appears that the root cause of the slow queries was metric aggregation enabled by a Praefect update deployed to the staging environment.

Service(s) affected: Praefect
Team attribution: Datastores (@gitlab-com/gl-infra/sre-datastores) & Gitaly
Minutes downtime or degradation: 180

Metrics

Rate of requests fell:

Error rates erratically jumped beyond SLO:

Praefect CloudSQL database had been overloaded for about a day:

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
- Internal customers only -- namely, Delivery and Development.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- All QA pipelines were blocked -- preventing all deployments.
How many customers were affected?
- No external customers.
If a precise customer impact number is unknown, what is the estimated potential impact?
- If such an incident were to occur in the production environment, then all customers whose project repositories maintained residency on shards balanced by Praefect would have been prevented from accessing assets stored on those shards -- so at least a severity2 incident.

Incident Response Analysis

How was the event detected?
- QA failures on staging
How could detection time be improved?
- Alerting on certain praefect errors or on praefect database resource saturation
How did we reach the point where we knew how to mitigate the impact?
- Trial and error with toggling Prometheus monitoring
- Lateral thinking by @cmcfarland
How could time to mitigation be improved?
- More alerting
- More monitoring
- Increase SRE skills for identifying long-running queries in PostgreSQL
- Better design/implementation/configuration for health checks

Post Incident Analysis

How was the root cause diagnosed?
- It appears that after some trial and error between a few engineers, @cmcfarland noticed that the internal load balance health check was triggering the /metrics route, and experimented with disabling that method for health checking.
How could time to diagnosis be improved?
- Increase SRE skills for identifying long-running queries in PostgreSQL
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Pre-dating this incident, nothing except this closed older epic: Observability of PostgresSQL query durations: &86 (closed)
- Corrective action: Specific training to SREs: how to interpret slow queries and connect them to upstream traffic, controllers and endpoints: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11787
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
- Staging Praefect service degradation that ultimately blocked QA testing stages and subsequently blocked development continuous deployments was apparently initially caused by a deployed change in staging.
  - Deployed from 13.5.0-rc2-105-gb011445d9 to 13.5.0-rc2-114-gb85367529: #2918 (comment 437430351)
  - gitlab-org/gitaly@b011445d...b8536752

5 Whys

Staging QA pipelines began to fail after a deployment of new Praefect code (apparently to both staging and production environments) and blocked progression of continuous deployments which were initiated subsequently to the deployment to Praefect which caused the incident, so why did smoke testing not detect this service degradation at an earlier point when the earlier deployment to Praefect was conducted or completed?
Why were internal load balancer health checks using an http request to the /metrics route?
Why is it necessary for the /metrics service to conduct so many queries against the database? That is to ask, why could there not be alternative methods to obtain such information?
Why does enabling metric aggregation slow down queries that were occurring during the incident?
Why did it take so long to establish a Cloud Shell terminal session to the CloudSQL database node through the GCP website?
- It in fact, never opened a connection, and I suspect it never would have.
- The stand-alone terminal page appears to work: https://shell.cloud.google.com/?hl=en_US&fromcloudshell=true&show=terminal#id=I0_1605124521291&_gfid=I0_1605124521291&parent=https://console.cloud.google.com&pfname=&rpctoken=33645007
- But I doubt it would have worked anyway during the period of time in which the node was under such high load.
- Furthermore, it is not clear how useful such a session would be for troubleshooting work.

Lessons Learned

Speed to identify slow queries in PostgreSQL might do with improvement.
Praefect errors and its CloudSQL instance might do with monitoring and alerting expansion.

Corrective Actions

@craigf, could you add any additional corrective action issues/MRs to that enumeration?

Guidelines

Blameless RCA Guideline

Edited Nov 17, 2020 by Craig Furman