2021-03-20: Increased Error Rate Across Fleet

Current Status

We initially saw increased error rates with APIs and CI Runners, but saw spikes of errors in all other services. During the investigation and remediation efforts we saw degradation for APIs and an outage of CI Runners, and then a complete outage of all services, due to loss of the primary database, for 4 minutes from 20:03 UTC - 20:07 UTC. Recovery of all services back into a healthy state began at 20:22 UTC and full recovery of all services was seen at 20:28 UTC.

Summary for CMOC notice / Exec summary:

  1. Customer Impact: Runners, API, Web, Git
  2. Customer Impact Duration: 18:45 - 20:28 UTC (103 minutes)
  3. Current state: Resolved
  4. Known cause: Continuing investigation, but symptoms and eventual solution similar to recent postgres query issues (#3875 (closed) and #4011 (closed))
  5. True root cause: TBD

Timeline

View recent production deployment and configuration events (internal only)

All times UTC.

2021-03-20

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

  • Increase rate of programmatic analyze namespaces job to every 30 minutes from every 3 hours - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12893
  • Communicate the importance of immediately executing an ANALYZE namespaces; operation in reaction to high saturation of either the single_node_cpu component, or the disk_sustained_write_throughput component.
    • This situation could have been avoided, had such action been taken earlier.
  • For myself, @nnelson, specifically, I need more practice with the instruction that our Infrastructure SREs and Datastores Reliability Team had received for how to terminate long-running or locking postgres processes.
  • I think it would also be beneficial to enhance the searchability/findability for specific runbooks and runbook sections. Perhaps a tagging system or keywords notation would be helpful.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.


Incident Review

Summary

  1. Service(s) affected: Primarily CI Runners, Patroni; Secondarily almost all services
  2. Team attribution: @gitlab-com/gl-infra/sre-datastores
  3. Time to detection: 1 minute
  4. Minutes downtime or degradation: 20:40 - 18:42 = 118 minutes

Screen_Shot_2021-03-22_at_10.14.53_AM

Screen_Shot_2021-03-22_at_10.15.00_AM

Metrics

Screen_Shot_2021-03-22_at_10.23.10_AM

Source: GitLab Triage

Screen_Shot_2021-03-22_at_10.15.54_AM

Source: patroni: Overview

Screen_Shot_2021-03-22_at_10.16.34_AM

Source: ci-runners: Overview

Screen_Shot_2021-03-22_at_10.17.58_AM

Source: postgresql queries

Screen_Shot_2021-03-22_at_10.18.37_AM

Source: web: Overview

Screen_Shot_2021-03-22_at_10.19.34_AM

Source: Cloudflare traffic overview

Customer Impact

  1. Who was impacted by this incident? (i.e. external customers, internal customers)
    1. Both internal and external customers were impacted by this incident.
  2. What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
    1. Between 1 to 4 minutes where api, git, and web requests return 500 server errors.
  3. How many customers were affected?
    1. All customers invoking client requests to gitlab.com or whose CI pipelines were executing a job were effected.
  4. If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
    1. Research for this is still in progress.

What were the root causes?

  • Effectively two causes: 1.) Queries related to the namespaces table were using a recursive query derived from a bad plan due to a postgresql bug. 2.) I used sudo kill -9 <procpid> instead of sudo gitlab-psql --command='SELECT pg_cancel_backend(procpid);'.

Incident Response Analysis

  1. How was the incident detected?
    1. An alert was triggered that paged me through PagerDuty: Increased Error Rate Across Fleet: https://gitlab.pagerduty.com/incidents/PQD6NBW
  2. How could detection time be improved?
    1. Put more monitoring and alerting in place around long queries related to the namespaces table and the use of recursive functions in such queries.
  3. How was the root cause diagnosed?
    1. An OnGres Engineer suggested that we try to invoke ANALYZE namespaces; and it immediately reduced the database saturation.
  4. How could time to diagnosis be improved?
    1. At least attempt to invoke ANALYZE namespaces; immediately upon noticing any elevated saturation, particularly wrt operations on the namespaces table.
  5. How did we reach the point where we knew how to mitigate the impact?
    1. We exhausted investigation into a couple red herring avenues.
  6. How could time to mitigation be improved?
    1. Mitigation was related to a successful experiment during diagnosis.
  7. What went well?
    1. OnGres engineers responded pretty quickly to a PagerDuty contributor addition escalation.

Post Incident Analysis

  1. Did we have other events in the past with the same root cause?
    1. Yes.
  2. Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
    1. Yes.
  3. Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
    1. No.

Lessons Learned

  • The kill -9 operation or SIGKILL should never be one's first-choice default. It should be one's last resort and even then, only if the process doesn't respond to its normal shutdown requests and a kill -15 or SIGTERM has had no effect. That's true of postgresql and pretty much everything else.
  • Use sudo gitlab-psql --command='SELECT pg_cancel_backend(procpid);' instead.
    • To kill the database connection: SELECT pg_terminate_backend(procpid);
    • To get an overview of the current transactions and obtain process identifiers (procpid): SELECT * FROM pg_stat_activity;
  • Use sudo gitlab-psql --command='ANALYZE namespaces;' on the leader patroni read-write node, when one notices elevated saturation especially in conjunction with any activity on the namespaces table.

Guidelines

Resources

  1. If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Nels Nelson