2021-03-20: Increased Error Rate Across Fleet

Current Status

We initially saw increased error rates with APIs and CI Runners, but saw spikes of errors in all other services. During the investigation and remediation efforts we saw degradation for APIs and an outage of CI Runners, and then a complete outage of all services, due to loss of the primary database, for 4 minutes from 20:03 UTC - 20:07 UTC. Recovery of all services back into a healthy state began at 20:22 UTC and full recovery of all services was seen at 20:28 UTC.

Summary for CMOC notice / Exec summary:

Customer Impact: Runners, API, Web, Git
Customer Impact Duration: 18:45 - 20:28 UTC (103 minutes)
Current state: Resolved
Known cause: Continuing investigation, but symptoms and eventual solution similar to recent postgres query issues (#3875 (closed) and #4011 (closed))
True root cause: TBD

Timeline

View recent production deployment and configuration events (internal only)

All times UTC.

2021-03-20

18:47 - Alert triggered: Increased Error Rate Across Fleet: https://gitlab.pagerduty.com/incidents/PQD6NBW
18:49 - @nnelson declares incident in Slack.
18:52 - Alert triggered: The rails_primary_sql SLI of the patroni service (main stage) has an apdex violating SLO: https://gitlab.pagerduty.com/incidents/P5HDHT9
18:52 - Alert fired: IncreasedErrorRateOtherBackends
19:58 - Emergency ticket from large customer reporting the runner job pickup issue: https://gitlab.zendesk.com/agent/tickets/200844
20:03 - patroni-03 is inadvertently taken out of the rotation.
20:07 - patroni-03 is back in the rotation
20:20 - analyze namespaces is run against the database on patroni-03
20:22 - we begin to see recovery across all services
20:28 - full recovery is seen across all services

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Increase rate of programmatic analyze namespaces job to every 30 minutes from every 3 hours - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12893
Communicate the importance of immediately executing an ANALYZE namespaces; operation in reaction to high saturation of either the single_node_cpu component, or the disk_sustained_write_throughput component.
- This situation could have been avoided, had such action been taken earlier.
For myself, @nnelson, specifically, I need more practice with the instruction that our Infrastructure SREs and Datastores Reliability Team had received for how to terminate long-running or locking postgres processes.
I think it would also be beneficial to enhance the searchability/findability for specific runbooks and runbook sections. Perhaps a tagging system or keywords notation would be helpful.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Summary

Service(s) affected: Primarily CI Runners, Patroni; Secondarily almost all services
Team attribution: @gitlab-com/gl-infra/sre-datastores
Time to detection: 1 minute
Minutes downtime or degradation: 20:40 - 18:42 = 118 minutes

Metrics

Source: GitLab Triage

Source: patroni: Overview

Source: ci-runners: Overview

Source: postgresql queries

Source: web: Overview

Source: Cloudflare traffic overview

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Both internal and external customers were impacted by this incident.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Between 1 to 4 minutes where api, git, and web requests return 500 server errors.
How many customers were affected?
1. All customers invoking client requests to gitlab.com or whose CI pipelines were executing a job were effected.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Research for this is still in progress.

What were the root causes?

Effectively two causes: 1.) Queries related to the namespaces table were using a recursive query derived from a bad plan due to a postgresql bug. 2.) I used sudo kill -9 <procpid> instead of sudo gitlab-psql --command='SELECT pg_cancel_backend(procpid);'.

Incident Response Analysis

How was the incident detected?
1. An alert was triggered that paged me through PagerDuty: Increased Error Rate Across Fleet: https://gitlab.pagerduty.com/incidents/PQD6NBW
How could detection time be improved?
1. Put more monitoring and alerting in place around long queries related to the namespaces table and the use of recursive functions in such queries.
How was the root cause diagnosed?
1. An OnGres Engineer suggested that we try to invoke ANALYZE namespaces; and it immediately reduced the database saturation.
How could time to diagnosis be improved?
1. At least attempt to invoke ANALYZE namespaces; immediately upon noticing any elevated saturation, particularly wrt operations on the namespaces table.
How did we reach the point where we knew how to mitigate the impact?
1. We exhausted investigation into a couple red herring avenues.
How could time to mitigation be improved?
1. Mitigation was related to a successful experiment during diagnosis.
What went well?
1. OnGres engineers responded pretty quickly to a PagerDuty contributor addition escalation.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Yes.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Yes.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No.

Lessons Learned

The kill -9 operation or SIGKILL should never be one's first-choice default. It should be one's last resort and even then, only if the process doesn't respond to its normal shutdown requests and a kill -15 or SIGTERM has had no effect. That's true of postgresql and pretty much everything else.
Use sudo gitlab-psql --command='SELECT pg_cancel_backend(procpid);' instead.
- To kill the database connection: SELECT pg_terminate_backend(procpid);
- To get an overview of the current transactions and obtain process identifiers (procpid): SELECT * FROM pg_stat_activity;
Use sudo gitlab-psql --command='ANALYZE namespaces;' on the leader patroni read-write node, when one notices elevated saturation especially in conjunction with any activity on the namespaces table.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Mar 31, 2021 by Nels Nelson