2021-03-20: Increased Error Rate Across Fleet
Current Status
We initially saw increased error rates with APIs and CI Runners, but saw spikes of errors in all other services. During the investigation and remediation efforts we saw degradation for APIs and an outage of CI Runners, and then a complete outage of all services, due to loss of the primary database, for 4 minutes from 20:03 UTC - 20:07 UTC. Recovery of all services back into a healthy state began at 20:22 UTC and full recovery of all services was seen at 20:28 UTC.
Summary for CMOC notice / Exec summary:
- Customer Impact: Runners, API, Web, Git
- Customer Impact Duration: 18:45 - 20:28 UTC (103 minutes)
- Current state: Resolved
- Known cause: Continuing investigation, but symptoms and eventual solution similar to recent postgres query issues (#3875 (closed) and #4011 (closed))
- True root cause: TBD
Timeline
View recent production deployment and configuration events (internal only)
All times UTC.
2021-03-20
-
18:47
- Alert triggered: Increased Error Rate Across Fleet: https://gitlab.pagerduty.com/incidents/PQD6NBW -
18:49
- @nnelson declares incident in Slack. -
18:52
- Alert triggered: The rails_primary_sql SLI of the patroni service (main
stage) has an apdex violating SLO: https://gitlab.pagerduty.com/incidents/P5HDHT9 -
18:52
- Alert fired:IncreasedErrorRateOtherBackends
-
19:58
- Emergency ticket from large customer reporting the runner job pickup issue: https://gitlab.zendesk.com/agent/tickets/200844 -
20:03
- patroni-03 is inadvertently taken out of the rotation. -
20:07
- patroni-03 is back in the rotation -
20:20
-analyze namespaces
is run against the database on patroni-03 -
20:22
- we begin to see recovery across all services -
20:28
- full recovery is seen across all services
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- Increase rate of programmatic
analyze namespaces
job to every 30 minutes from every 3 hours - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12893 - Communicate the importance of immediately executing an
ANALYZE namespaces;
operation in reaction to high saturation of either thesingle_node_cpu
component, or thedisk_sustained_write_throughput
component.- This situation could have been avoided, had such action been taken earlier.
- For myself, @nnelson, specifically, I need more practice with the instruction that our Infrastructure SREs and Datastores Reliability Team had received for how to terminate long-running or locking postgres processes.
- I think it would also be beneficial to enhance the searchability/findability for specific runbooks and runbook sections. Perhaps a tagging system or keywords notation would be helpful.
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
Summary
- Service(s) affected: Primarily CI Runners, Patroni; Secondarily almost all services
- Team attribution: @gitlab-com/gl-infra/sre-datastores
- Time to detection:
1 minute
- Minutes downtime or degradation:
20:40 - 18:42 = 118 minutes
Metrics
Source: GitLab Triage
Source: patroni: Overview
Source: ci-runners: Overview
Source: postgresql queries
Source: web: Overview
Source: Cloudflare traffic overview
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Both internal and external customers were impacted by this incident.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Between 1 to 4 minutes where api, git, and web requests return 500 server errors.
-
How many customers were affected?
- All customers invoking client requests to gitlab.com or whose CI pipelines were executing a job were effected.
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Research for this is still in progress.
What were the root causes?
- Effectively two causes: 1.) Queries related to the
namespaces
table were using a recursive query derived from a bad plan due to a postgresql bug. 2.) I usedsudo kill -9 <procpid>
instead ofsudo gitlab-psql --command='SELECT pg_cancel_backend(procpid);'
.
Incident Response Analysis
-
How was the incident detected?
- An alert was triggered that paged me through PagerDuty: Increased Error Rate Across Fleet: https://gitlab.pagerduty.com/incidents/PQD6NBW
-
How could detection time be improved?
- Put more monitoring and alerting in place around long queries related to the
namespaces
table and the use of recursive functions in such queries.
- Put more monitoring and alerting in place around long queries related to the
-
How was the root cause diagnosed?
- An OnGres Engineer suggested that we try to invoke
ANALYZE namespaces;
and it immediately reduced the database saturation.
- An OnGres Engineer suggested that we try to invoke
-
How could time to diagnosis be improved?
- At least attempt to invoke
ANALYZE namespaces;
immediately upon noticing any elevated saturation, particularly wrt operations on thenamespaces
table.
- At least attempt to invoke
-
How did we reach the point where we knew how to mitigate the impact?
- We exhausted investigation into a couple red herring avenues.
-
How could time to mitigation be improved?
- Mitigation was related to a successful experiment during diagnosis.
-
What went well?
- OnGres engineers responded pretty quickly to a PagerDuty contributor addition escalation.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Yes.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Yes.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No.
Lessons Learned
- The
kill -9
operation orSIGKILL
should never be one's first-choice default. It should be one's last resort and even then, only if the process doesn't respond to its normal shutdown requests and akill -15
orSIGTERM
has had no effect. That's true ofpostgresql
and pretty much everything else. - Use
sudo gitlab-psql --command='SELECT pg_cancel_backend(procpid);'
instead.- To kill the database connection:
SELECT pg_terminate_backend(procpid);
- To get an overview of the current transactions and obtain process identifiers (
procpid
):SELECT * FROM pg_stat_activity;
- To kill the database connection:
- Use
sudo gitlab-psql --command='ANALYZE namespaces;'
on the leader patroni read-write node, when one notices elevated saturation especially in conjunction with any activity on thenamespaces
table.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)