Skip to content

2020-10-28: QA failures on staging blocking the deployment pipeline - Praefect errors

Summary

Multiple QA failures on staging due to 500 errors:

https://sentry.gitlab.net/gitlab/staginggitlabcom/issues/2098861/?query=is%3Aunresolved

https://nonprod-log.gitlab.net/goto/c584300c5718107da65438b194663acb

2:accessor call: route repository accessor: get synced node: get shard for "nfs-file22": primary is not healthy.

Timeline

All times UTC.

2020-10-28


Click to expand or collapse the Incident Review section.

Incident Review

Summary

For a period of 180 minutes (between 2020-10-28 08:28 UTC and 2020-10-28 11:28 UTC), the ServicePraefect in Environmentgstg began to fail to respond to GRPC invocations during QA pipeline stages, and exhibited symptoms of a failover loop.

Further investigation ultimately led to the discovery that health checks configured to poll the /metrics route on port 9652 were causing a metrics scraping process implemented by Praefect to make multiple -- and progressively stacking -- long running queries in the praefect staging database running on a GCP CloudSQL instance. It appears that the root cause of the slow queries was metric aggregation enabled by a Praefect update deployed to the staging environment.

  1. Service(s) affected: Praefect
  2. Team attribution: Datastores (@gitlab-com/gl-infra/sre-datastores) & Gitaly
  3. Minutes downtime or degradation: 180

Metrics

Rate of requests fell:

image

Error rates erratically jumped beyond SLO:

image

Praefect CloudSQL database had been overloaded for about a day:

Screenshot_2020-10-28_at_10.59.43

Customer Impact

  1. Who was impacted by this incident? (i.e. external customers, internal customers)
    • Internal customers only -- namely, Delivery and Development.
  2. What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
    • All QA pipelines were blocked -- preventing all deployments.
  3. How many customers were affected?
    • No external customers.
  4. If a precise customer impact number is unknown, what is the estimated potential impact?
    • If such an incident were to occur in the production environment, then all customers whose project repositories maintained residency on shards balanced by Praefect would have been prevented from accessing assets stored on those shards -- so at least a severity2 incident.

Incident Response Analysis

  1. How was the event detected?
    • QA failures on staging
  2. How could detection time be improved?
    • Alerting on certain praefect errors or on praefect database resource saturation
  3. How did we reach the point where we knew how to mitigate the impact?
    • Trial and error with toggling Prometheus monitoring
    • Lateral thinking by @cmcfarland
  4. How could time to mitigation be improved?
    • More alerting
    • More monitoring
    • Increase SRE skills for identifying long-running queries in PostgreSQL
    • Better design/implementation/configuration for health checks

Post Incident Analysis

  1. How was the root cause diagnosed?
    • It appears that after some trial and error between a few engineers, @cmcfarland noticed that the internal load balance health check was triggering the /metrics route, and experimented with disabling that method for health checking.
  2. How could time to diagnosis be improved?
    • Increase SRE skills for identifying long-running queries in PostgreSQL
  3. Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
  4. Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
    • Staging Praefect service degradation that ultimately blocked QA testing stages and subsequently blocked development continuous deployments was apparently initially caused by a deployed change in staging.

5 Whys

  1. Staging QA pipelines began to fail after a deployment of new Praefect code (apparently to both staging and production environments) and blocked progression of continuous deployments which were initiated subsequently to the deployment to Praefect which caused the incident, so why did smoke testing not detect this service degradation at an earlier point when the earlier deployment to Praefect was conducted or completed?
  2. Why were internal load balancer health checks using an http request to the /metrics route?
  3. Why is it necessary for the /metrics service to conduct so many queries against the database? That is to ask, why could there not be alternative methods to obtain such information?
  4. Why does enabling metric aggregation slow down queries that were occurring during the incident?
  5. Why did it take so long to establish a Cloud Shell terminal session to the CloudSQL database node through the GCP website?

Lessons Learned

  1. Speed to identify slow queries in PostgreSQL might do with improvement.
  2. Praefect errors and its CloudSQL instance might do with monitoring and alerting expansion.

Corrective Actions

@craigf, could you add any additional corrective action issues/MRs to that enumeration?

Guidelines

Edited by Craig Furman