2022-01-18: QA tests in staging are failing due to pgbouncer issues

Current Status

QA tests failed in ~~gstg-cny~~ gstg when deploying to gstg-cny since yesterday (probably starting after 7pm UTC). This is blocking our deployments. Note that gstg-cny tests had only recently been changed to blocking as part of the Staging Improvements project. The gtsg-cny test suites hit the gstg-cny environment as well as the staging environment to test for mixed deployment problems. When the failures occurred in the staging environment, they blocked deployment of gstg-cny.

gstg-cny QA was reverted back to non-blocking unblocking as a short term mitigation: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/merge_requests/456

Following investigation the cause of the flakiness was found to be Pgbouncer issues. After the reprovision of the Patroni CI cluster, the local Pgbouncer wasn't configured as expected. We created and merged the MR https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1204/diffs to fix the issue.

Timeline

Recent Events (available internally only):

All times UTC.

2022-01-17

14:30 - patroni-ci re-creation CR is executed, error rates in staging go up
18:52 - gstg-cny reliable gitlab-qa smoke QA job fails for 14.7.202201171506, failing jobs take very long.
22:17 - gstg-cny QA fails randomly with several retries for 14.7.202201172020 and jobs take very long, e.g. this job.
22:53 same for 14.7.202201171820

2022-01-18

08:51 - 14.7.202201180621 gstg-cny QA fails with same symptoms. Several retries do not make it pass.
12:31 - @hphilipps declares incident in Slack.
13:21 - @svistas creates MR for removing flaky npm tests from reliable suite: gitlab-org/gitlab!78466 (merged) (deployed in gstg with 14.7.202201181820 around 21:30)
14:35 - 14.7.202201180621 gstg-cny QA succeeds after many retries
15:34 - @mayra-cabrera prepared an MR to make gstg-cny QA test non-blocking https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/merge_requests/456
15:06 - 14.7.202201181320 gstg-cny deploy starts
16:05 - gstg-cny QA for 14.7.202201181320 succeeds without any failures
16:58 - @zeffmorgan confirms that gstg-cny test failures are the result of Staging issues
17:03 - @T4cC0re finds kube_container_memory problems on Staging
`17:39' - @rspeicher disables the Feature flag corpus_management from Staging
18:18 - @rspeicher confirms Staging pipelines look stable following feature flag removal but it is unknown if this has fixed the problem due to the failures not consistently appearing
20:30 - 14.7.202201181820 deployed in gstg-cny, which is removing flaky npm tests from reliable suite (gitlab-org/gitlab!78466 (merged)).
21:30 - 14.7.202201181820 deployed in gstg

2022-01-19

08:12 - @amyphillips sets to IncidentMitigated as tests seems to be reliably passing since the feature flag change
17:47 - A fix to resolve the root cause is merged - https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1204. Error rates in staging go down again.

2022-01-20

07:50 - @hphilipps sets the incident to resolved

Takeaways

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Increase awareness of how QA tests are working: delivery#2155 (closed)
Add alias for pinging IM oncall in slack- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15059
Add monitoring for Pgbouncer and PostgreSQL ports #6232 (closed)
Improve the precision of staging service-level monitoring alerts &668

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. NA
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. NA
How many customers were affected?
1. NA
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. NA

What were the root causes?

We re-build the patroni-ci cluster in staging with a missing pgbouncer configuration while it already was in use which caused DB requests in staging to fail.

Incident Response Analysis

How was the incident detected?
1. Seemingly random QA failures in gstg-cny deployments
How could detection time be improved?
1. Checking the monitoring section while executing CR #6150 (closed) would have showed CI DB error rates going up and requests going down.
How was the root cause diagnosed?
1. dashboards showed a correlation of APDEX drop with patroni-ci error rates and traffic drop
2. logs showed failing DB requests to addresses of the patroni-ci cluster
How could time to diagnosis be improved?
1. Earlier escalation to infra and dev departments. Most investigation was done by quality and delivery teams initially.
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. Great job identifying the failure cause by @niskhakova.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes, #6150 (closed).

What went well?

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Jan 31, 2022 by Rehab