2021-11-16 pipelines failing on public and private runners
Current Status
This incident has been mitigated. @ayufan is working on this MR and @ahegyi will review it.
Timeline
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
All times UTC.
2021-11-16
-
11:54
- Zendesk Ticket reported by a Customer stating they are seeing this error, "There has been a structural integrity problem detected, please contact system administrator" -
12:00
- @mwasilewski-gitlab declares incident in Slack. -
12:02
- @rehab noticed there was a slight increase in the pending queue size. -
12:02
- @engwan observed the same error on a gitlab-org/gitlab MR pipeline -
12:10
- Support engineers invited to the call (@asmaa.hassan & @niklasjanz ) -
12:18
- @mbruemmer reports that failing pages:deploy job is throwing an undefined method 'load_balancer' error -
12:23
- @rmongare , support engineer joins the call -
12:18
- @manojmj confirms that this is seen in multiple places in the Sentry log -
12:24
- @ahegyi identifies the MR causing the problem -
12:25
- @rehab pinged release managers in case we had to do a rollback -
12:26
- @ayufan determines that the attraction_reader:load_balancer was removed and causing the problem. -
12:26
- @rehab pinged CMOC -
12:31
- @rehab determined we needed to draft a message for status.io -
12:31
- @ayufan opened an MR to avoid this bug -
12:32
- @smcgivern determines that preserve_latest_wal_locations_for_idempotent_jobs can mitigate the risk by blocking pgn_lsn_wal_diff from being called. -
12:32
- @rmongare sent the first message through status.io reporting an increase in error rates in pipelines (both Shared and Private Runners). -
12:34
- @mwasilewski updates the severity to severity1 . -
12:35
- @smcgivern disabled the feature flag -
12:48
- @mwasilewski the Incident has been marked as mitigated -
12:49
- @rmongare sent the second message through status.io reporting that we identified the cause of the incident and have rolled out a change to mitigate it -
13:09
- @rmongare sent the third message through status.io reporting that the increase in error rates in pipelines is no longer occurring and this incident is officially resolved
Corrective Actions
-
corrective action Update specs in gitlab-org/gitlab!74615 (merged)
✅ -
corrective action Re-expose
model.connection.load_balancer
to have LB of the connection✅ - corrective action Improve Transparency of files included in a release
-
corrective action Re-enable the preserve_latest_wal_locations_for_idempotent_jobs FF (gitlab-org/gitlab#338350 (closed))
✅ - corrective action Ensure that staging and gitlab.com are configured as close as possible in terms of feature flags.
- corrective action Ensure that each new codepath of a new methods are executed at least once as part of test suite
- corrective action Implement zero sentry exception policy for staging
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Both Internal and External Customers were impacted by this incident.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- The customer saw failed pipeline jobs. They saw the following error on shared and private runners, "There has been a structural integrity problem detected, please contact system administrator"
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
What were the root causes?
Note: written by @ayufan
We were hit by the very tricky set of merge requests being merged, with a mix of feature flags, and really hard time to discover this failure. The problematic codepath was mitigated with disable of preserve_latest_wal_locations_for_idempotent_jobs
.
- On 16th of September we merged a change gitlab-org/gitlab!69372 (merged) which added a method
pg_wal_lsn_diff
. This method was stubbed, thus never executed in specs. This also added a feature flag that was disabled by default:preserve_latest_wal_locations_for_idempotent_jobs
- We configured
preserve_latest_wal_locations_for_idempotent_jobs=true
for staging on 25th of September: https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues/7397 - We configured
preserve_latest_wal_locations_for_idempotent_jobs=true
for gitlab.com on 27th of October: https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues/7869 - On 9th of November we merged a fix to handling of
load_balancers
sticking behavior: gitlab-org/gitlab!73949 (merged) which removed an accessor viamodel.connection.load_balancer
. - On 15th of November we merged gitlab-org/gitlab!72381 (merged) a refactor introducing a
model.database.
concept, and movingpg_wal_lsn_diff
method intoLoadBalancer
class. This MR introduced an invalid call tomodel.connection.load_balancer.pg_wal_lsn_diff
that caused production incident.
The triggering conditions:
- The invalid call to
pg_wal_lsn_diff
was executed only in a very particular set of configuration: when scheduling Sidekiq job that had been configured as:idempotent!
and (deduplicate :until_executed
ordeduplicate :until_executing
). - For this code path to be executed the
preserve_latest_wal_locations_for_idempotent_jobs
had to be set totrue
. The current in default state of a feature flag isdisabled by default
: https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/feature_flags/development/preserve_latest_wal_locations_for_idempotent_jobs.yml#L8
The discovery of the bug:
- We noticed this bug on gitlab.com by receiving customer report
- We did not notice this bug on staging.gitlab.com when running QA tests, even though Sentry shows this happening: https://sentry.gitlab.net/gitlab/staginggitlabcom/?query=is%3Aunresolved+undefined+method+%60load_balancer. This seems clearly correlated with CI pipelines.
- We did not notice this bug in
rspec
since this method was never executed as part ofunit tests
since it was stubbed: gitlab-org/gitlab!69372 (diffs). This hide the bug from being present as part of CI pipeline before merging the gitlab-org/gitlab!72381 (merged). - We did not notice this in manual testing as it requires the
preserve_latest_wal_locations_for_idempotent_jobs
to be set totrue
, which in development environment is configured by default tofalse
: https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/feature_flags/development/preserve_latest_wal_locations_for_idempotent_jobs.yml#L8
Corrective actions:
- Ensure that we don't stub, and we actually test the invocation and actual behavior: gitlab-org/gitlab!74615 (merged)
- Discover if we can or should look at Staging Sentry after QA runs to acknowledge all new errors present. It is clearly visible in Staging this to be happening. gitlab-org/gitlab#345957 (closed)
- This was not the case in here, but this could go unnoticed next time. Ensure that staging and gitlab.com are configured as close as possible in terms of feature flags. If in this case we would have the
preserve_latest_wal_locations_for_idempotent_jobs=false
it would go completely unnoticed. gitlab-org/gitlab#345958 (closed)
We were fortunate to be able to disable via the feature flag preserve_latest_wal_locations_for_idempotent_jobs
without creating production patch.
In general. I think it was very hard to notice this bug ahead of time from development and infrastructure. However, we definitely saw this happening in staging and it did not trigger alarm. Likely we need to figure out how to better review exceptions visible from staging.
Incident Response Analysis
-
How was the incident detected?
- Customer submitted a Zendesk Ticket
- GitLab team members reported pipeline integrity issues internally
-
How could detection time be improved?
- This was a relatively low error rate, so our monitoring would not have caught this without being overly sensitive: #5931 (comment 735721855)
-
How was the root cause diagnosed?
- Viewed Sentry Logs
- There were errors that began that coincided with the start time of the pipeline errors
- Investigation of the logs showed that the errors also coincided with a recent canary deployment
- Investigated the list of files that changed in the recent release
- From this investigation, the SREs were able to diagnose the problem
-
How could time to diagnosis be improved?
- Faster ways of finding the files changed in the latest deploy: #5931 (comment 735163254)
-
How did we reach the point where we knew how to mitigate the impact?
- Through investigating and reviewing logs
-
How could time to mitigation be improved?
- The MR in question was not very convenient for a feature flag, but it may have been possible. In this case we had a related feature flag that we were able to use, otherwise we would have needed to patch.
-
What went well?
- @smcgivern was able to determine that an existing Feature Flag could be disabled that would avoid the path that would trigger this error. After disabling, there was just a few minutes for the errors to stop. If he would not have found this FF, we would have had two choices, rollback or make the code change to fix the problem. It was also determined that disabling this FF would not have a massive negative impact. Given this FF was not related to the MR that caused the problem, this was an important factor to look into.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Yes, this incident was caused by this MR - Refactor Database::Connection into separate types
Lessons Learned
- Usage of feature flags in critical areas such as scheduling jobs is very important. As it turned out we can use the feature flag to quickly restore functionality
- This is a great example of mixture of things happening (see gitlab-org/gitlab#345962 (comment 736404291) for details) / shared with #backend_maintainers and #backend:
- Stubbing out the class under test +
- Removing a key method that is widely used (model.load_balancer) +
- Filtering out the class for coverage reports
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)