2021-11-16 pipelines failing on public and private runners

Current Status

This incident has been mitigated. @ayufan is working on this MR and @ahegyi will review it.

Timeline

Recent Events (available internally only):

All times UTC.

2021-11-16

11:54 - Zendesk Ticket reported by a Customer stating they are seeing this error, "There has been a structural integrity problem detected, please contact system administrator"
12:00 - @mwasilewski-gitlab declares incident in Slack.
12:02 - @rehab noticed there was a slight increase in the pending queue size.
12:02 - @engwan observed the same error on a gitlab-org/gitlab MR pipeline
12:10 - Support engineers invited to the call (@asmaa.hassan & @niklasjanz )
12:18 - @mbruemmer reports that failing pages:deploy job is throwing an undefined method 'load_balancer' error
12:23 - @rmongare , support engineer joins the call
12:18 - @manojmj confirms that this is seen in multiple places in the Sentry log
12:24 - @ahegyi identifies the MR causing the problem
12:25 - @rehab pinged release managers in case we had to do a rollback
12:26 - @ayufan determines that the attraction_reader:load_balancer was removed and causing the problem.
12:26 - @rehab pinged CMOC
12:31 - @rehab determined we needed to draft a message for status.io
12:31 - @ayufan opened an MR to avoid this bug
12:32 - @smcgivern determines that preserve_latest_wal_locations_for_idempotent_jobs can mitigate the risk by blocking pgn_lsn_wal_diff from being called.
12:32 - @rmongare sent the first message through status.io reporting an increase in error rates in pipelines (both Shared and Private Runners).
12:34 - @mwasilewski updates the severity to severity1 .
12:35 - @smcgivern disabled the feature flag
12:48 - @mwasilewski the Incident has been marked as mitigated
12:49 - @rmongare sent the second message through status.io reporting that we identified the cause of the incident and have rolled out a change to mitigate it
13:09 - @rmongare sent the third message through status.io reporting that the increase in error rates in pipelines is no longer occurring and this incident is officially resolved

Corrective Actions

corrective action Update specs in gitlab-org/gitlab!74615 (merged) ✅
corrective action Re-expose model.connection.load_balancer to have LB of the connection ✅
corrective action Improve Transparency of files included in a release
corrective action Re-enable the preserve_latest_wal_locations_for_idempotent_jobs FF (gitlab-org/gitlab#338350 (closed)) ✅
corrective action Ensure that staging and gitlab.com are configured as close as possible in terms of feature flags.
corrective action Ensure that each new codepath of a new methods are executed at least once as part of test suite
corrective action Implement zero sentry exception policy for staging

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Both Internal and External Customers were impacted by this incident.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. The customer saw failed pipeline jobs. They saw the following error on shared and private runners, "There has been a structural integrity problem detected, please contact system administrator"
How many customers were affected?
1. 3425 , src: https://log.gprd.gitlab.net/goto/d104c20a26964ea407bf1451f9f086aa
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. 0.4% , src: https://log.gprd.gitlab.net/goto/15bcc85e118bb838ee2d4efe7194f75c

What were the root causes?

Note: written by @ayufan

We were hit by the very tricky set of merge requests being merged, with a mix of feature flags, and really hard time to discover this failure. The problematic codepath was mitigated with disable of preserve_latest_wal_locations_for_idempotent_jobs.

On 16th of September we merged a change gitlab-org/gitlab!69372 (merged) which added a method pg_wal_lsn_diff. This method was stubbed, thus never executed in specs. This also added a feature flag that was disabled by default: preserve_latest_wal_locations_for_idempotent_jobs
We configured preserve_latest_wal_locations_for_idempotent_jobs=true for staging on 25th of September: https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues/7397
We configured preserve_latest_wal_locations_for_idempotent_jobs=true for gitlab.com on 27th of October: https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues/7869
On 9th of November we merged a fix to handling of load_balancers sticking behavior: gitlab-org/gitlab!73949 (merged) which removed an accessor via model.connection.load_balancer.
On 15th of November we merged gitlab-org/gitlab!72381 (merged) a refactor introducing a model.database. concept, and moving pg_wal_lsn_diff method into LoadBalancer class. This MR introduced an invalid call to model.connection.load_balancer.pg_wal_lsn_diff that caused production incident.

The triggering conditions:

The invalid call to pg_wal_lsn_diff was executed only in a very particular set of configuration: when scheduling Sidekiq job that had been configured as: idempotent! and (deduplicate :until_executed or deduplicate :until_executing).
For this code path to be executed the preserve_latest_wal_locations_for_idempotent_jobs had to be set to true. The current in default state of a feature flag is disabled by default: https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/feature_flags/development/preserve_latest_wal_locations_for_idempotent_jobs.yml#L8

The discovery of the bug:

We noticed this bug on gitlab.com by receiving customer report
We did not notice this bug on staging.gitlab.com when running QA tests, even though Sentry shows this happening: https://sentry.gitlab.net/gitlab/staginggitlabcom/?query=is%3Aunresolved+undefined+method+%60load_balancer. This seems clearly correlated with CI pipelines.
We did not notice this bug in rspec since this method was never executed as part of unit tests since it was stubbed: gitlab-org/gitlab!69372 (diffs). This hide the bug from being present as part of CI pipeline before merging the gitlab-org/gitlab!72381 (merged).
We did not notice this in manual testing as it requires the preserve_latest_wal_locations_for_idempotent_jobs to be set to true, which in development environment is configured by default to false: https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/feature_flags/development/preserve_latest_wal_locations_for_idempotent_jobs.yml#L8

Corrective actions:

Ensure that we don't stub, and we actually test the invocation and actual behavior: gitlab-org/gitlab!74615 (merged)
Discover if we can or should look at Staging Sentry after QA runs to acknowledge all new errors present. It is clearly visible in Staging this to be happening. gitlab-org/gitlab#345957 (closed)
This was not the case in here, but this could go unnoticed next time. Ensure that staging and gitlab.com are configured as close as possible in terms of feature flags. If in this case we would have the preserve_latest_wal_locations_for_idempotent_jobs=false it would go completely unnoticed. gitlab-org/gitlab#345958 (closed)

We were fortunate to be able to disable via the feature flag preserve_latest_wal_locations_for_idempotent_jobs without creating production patch.

In general. I think it was very hard to notice this bug ahead of time from development and infrastructure. However, we definitely saw this happening in staging and it did not trigger alarm. Likely we need to figure out how to better review exceptions visible from staging.

Incident Response Analysis

How was the incident detected?
1. Customer submitted a Zendesk Ticket
2. GitLab team members reported pipeline integrity issues internally
How could detection time be improved?
1. This was a relatively low error rate, so our monitoring would not have caught this without being overly sensitive: #5931 (comment 735721855)
How was the root cause diagnosed?
1. Viewed Sentry Logs
2. There were errors that began that coincided with the start time of the pipeline errors
3. Investigation of the logs showed that the errors also coincided with a recent canary deployment
4. Investigated the list of files that changed in the recent release
5. From this investigation, the SREs were able to diagnose the problem
How could time to diagnosis be improved?
1. Faster ways of finding the files changed in the latest deploy: #5931 (comment 735163254)
How did we reach the point where we knew how to mitigate the impact?
1. Through investigating and reviewing logs
How could time to mitigation be improved?
1. The MR in question was not very convenient for a feature flag, but it may have been possible. In this case we had a related feature flag that we were able to use, otherwise we would have needed to patch.
What went well?
1. @smcgivern was able to determine that an existing Feature Flag could be disabled that would avoid the path that would trigger this error. After disabling, there was just a few minutes for the errors to stop. If he would not have found this FF, we would have had two choices, rollback or make the code change to fix the problem. It was also determined that disabling this FF would not have a massive negative impact. Given this FF was not related to the MR that caused the problem, this was an important factor to look into.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes, this incident was caused by this MR - Refactor Database::Connection into separate types

Lessons Learned

Usage of feature flags in critical areas such as scheduling jobs is very important. As it turned out we can use the feature flag to quickly restore functionality
This is a great example of mixture of things happening (see gitlab-org/gitlab#345962 (comment 736404291) for details) / shared with #backend_maintainers and #backend:
- Stubbing out the class under test +
- Removing a key method that is widely used (model.load_balancer) +
- Filtering out the class for coverage reports

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Nov 23, 2021 by Thong Kuah