2021-02-11 redis-sidekiq unavailable

Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.

Summary

In preparation for a change, an effort was made to remove drift from infrastructure state to terraform code.

There were multiple existing diffs on each of the redis-sidekiq nodes among others. A safety-net was not in place at the time of applying changes and caused a shutdown of the nodes. The safety net's absence was not noticed until after it caused the incident.

As a result, sidekiq was unable to process jobs, causing git pushes, CI pipelines, incoming mail and potentially other sidekiq activities such as mirror updates to fail.

Timeline

All times UTC.

2021-02-11

10:41 - A metadata change was applied.
10:42 - redis-sidekiq nodes start to shut down and services start to be unavailable
10:44 - 3/3 redis-sidekiq nodes are offline.
10:44 - mailroom queue starts to fill up, resulting in incoming mail being unable to be processed.
10:46 - @andrewn declares incident in Slack.
in between
- Incident bridge is filled.
- It is determined, that the nodes did not automatically reboot, but needed to be started manually.
- Once started, redis did not start automatically either, so it also needed to be started manually.
10:51 - redis is manually restarted onredis-sidekiq nodes
10:52 - sidekiq is healthy again and services start to recover almost instantly.
in between
- It was detected that mailroom did not process any more emails.
- Due to the way it is deployed, we agreed to delete the currently hanging mailroom pod to force it to be rescheduled.
11:29 - mailroom processed all backlog of emails

Corrective Actions

Create alert /warning mechanism and process for unclean TF plan on gprd / ops
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9479
Update allow_stopping_for_update = true -> false in terraform
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12591
Prevent checking in allow_stopping_for_update=true in terraform (add CI checks)
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12592
Issue template created detailing how to accomplish Terraform changes in Production
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12600
Dogfood Terraform integration in Merge Requests
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12613
Mailroom doesn't fail liveness checks when it's unable to connect to redis-sidekiq (and remains in this state after redis is back online)
- gitlab-org/charts/gitlab#2576 (closed)

Incident Review

Summary

Service(s) affected: ServiceInfrastructure ServiceAPI ServiceCI Runners ServiceGit ServiceGitLab Rails ServiceMailroom ServiceSidekiq
Team attribution: ~"team::Core-Infra"
Time to detection: 0 minutes
Minutes downtime or degradation: ~11 Minutes (~48 minutes for ServiceMailroom)

Unless otherwise noted, the links to metrics in this paragraph are only accessible for GitLab Team members. Screenshots are attached in the metrics section.

For a period of 11 minutes (between 2021-02-11 10:41 UTC and 2021-02-11 10:52 UTC), GitLab.com experienced increased error rates due to the loss of the whole redis-sidekiq fleet.
For a period of 48 minutes (between 2021-02-11 10:41 UTC and 2021-02-11 10:29 UTC), incoming emails to the GitLab.com SaaS application were not processed. Emails sent to other @gitlab.com email addresses were not affected.

In preparation for a change, an effort was made to remove drift from infrastructure state to terraform code.

There were multiple existing diffs on each of the redis-sidekiq nodes among others. One diff was removed via a merge request which re-aligned the node-type in terraform to what is deployed in GCP.
After it was merged the instance type disappeared from the terraform diff.

The change which was left for the redis-sidekiq nodes was this: (redacted for this issue, full text in #3579 - internal)

~ service_account {
    email  = "<REDACTED>",
  ~ scopes = [
      - "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
      + "https://www.googleapis.com/auth/<REDACTED>",
    ]
}

A service account change can sometimes lead to a system restart to apply it, but that is not always the case.
Usually, we have the allow_stopping_for_update set to false on all machines. At some point in the past, this was changed/set to true for a subset of machines, which also includes the redis-sidekiq nodes.
This setting being false would cause a terraform apply to fail (originating from the GCP API) if a restart is required to apply changes. A value of true however, will do whatever it takes to apply the change, including a reboot.

This safety-net was not in place at the time of the apply at around 10:41 and its absence was not noticed until after it already caused the incident at 10:45.

The redis servers were shutdown cleanly, and redis persisted it's state prior to the reboot. Data loss might have occurred for in-flight requests whose data was not fully acknowledged by redis. Acknowledged data was persisted.

Immediately following on this, pushes to git repositories, inbound email, and other sidekiq dependant services stopped processing requests, while they struggled to reach redis.

Due to sikekiq redis not being being available they dropped their connections to pgbouncer-sidekiq nodes and caused an overall reduction of tuple updades on the postgres database and also transaction counts on the primary dropped.

The impact of this can be very well seen on the metrics of enqueued jobs. For the period of time where the redis was offline, no new jobs were queued, followed by 2 bumps. The first being the general application load, and the second being inflicted scheduled jobs and among them, project mirrors not run the hour before.

During the time of the redis downtime, we observed an increase of error-rates to up to 4.5% of failed requests (internal) visible across various backends, but api being hit the worst. This is presumably because we also consume the API internally and thus this backend's metric might be inflated. api_rate_limit which is the backend used for most customers was not hit as hard, with an increase to about 2.7% of requests, followed by the web backend peaking around 1.1%.

Nonetheless, even though a large portion of requests were successful, it is safe to assume any action, that is not to be considered read-only has failed during this time. For example, it was not possible to start a CI job or to push log traces of CI jobs.

The machines were not rebooted, but rather just shut down. So during the investigation, we manually needed to start the machines and the redis process, which also did not automatically start.

After the machines were brought back on-line everything except mailroom recovered within a short period of time.

Because mailroom doesn't fail liveness checks when it's unable to connect to redis-sidekiq (gitlab-org/charts/gitlab#2576 (closed)), mailroom did not automatically restart once it crashed because of the missing redis.

Metrics

For approximately 15 minutes, from 10:45 to 11:00 there were increased error rates across all services web and api git that prevented Git pushes and CI failures. (This metric is slightly delayed, more detailed graph below)

For approximately 48 minutes, from 10:45 to 11:30 we were unable to process incoming mail

reduction of tuple updades on the postgres database (internal)
transactions on the postgres primary dropped (internal)
pgbouncer-sidekiq dropped it's connection count to 0 (internal)
enqueued jobs are bottomed out (internal)
comparison of project mirrors during the incident (green) vs. 1 hour before (yellow) (internal)
The error rate increased from 0.01% to up to 4.5% (internal)
CI runner queues were depleted for 7 Minutes (internal) indicating reduced or missing functionality around CI jobs.

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. internal customers
2. external customers
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. unable to push git repositories
2. CI pipelines may not have been started
3. incoming mail was queued up
4. project mirrors have not been exectuted during this time
How many customers were affected?
1. All customers performing the actions above.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Somewhere between 1.1% and 2.7% of requests from external customers failed, whereas 4.5% of requests failed for internal API use and customers on an allowlist.

What were the root causes?

"5 Whys"

Incident Response Analysis

How was the incident detected?
1. The engineer executing on the terraform codebase detected the issue immediately, followed by alerts firing.
How could detection time be improved?
How was the root cause diagnosed?
1. Part of the terraform maintenance was the root cause.
How could time to diagnosis be improved?
How did we reach the point where we knew how to mitigate the impact?
1. We assessed whether the nodes where still available in GCP
2. Once we found they were, we started them to assess the status of the machines.
3. We found, we needed to manually start redis. Services recovered afterwards.
How could time to mitigation be improved?
What went well?
1. Collaboration of the Engineer On-call, fellow team members and the executing engineer lead to an immidiately identified root cause.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Possibly, although I could not find a specific issue.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9479
2. A dirty terraform codebase was brought up in the past
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. There is no issue for this change, as the change was already merged. Result of 2. More details on this thread: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12600#note_507858366
2. But as laid out in Lessons Learned 5. below, it should have been accompanied by an issue.

Lessons Learned

We need to make sure to always have a clean codebase without diffs.
1. When we need to change something managed in terraform, we need to prioritize integrating those changes into terraform
  - This change was made by someone unfamiliar with the intricicies of the outstanding diff. While trying to the best of their abilities, it is not possible for someone to know the full blast-radius of a change, unless they are very familiar.
2. If possible, we should seek to implement asks and requirements into out tooling, and only diverge if a quick remediation is critical. In those cases, we should seize all action on terraform until this is implemented into the codebase.
We should never trust assumptions around safety-nets being in place.
- While the aforementioned flag would have prevented this specific incident, other changes might not have a safety-net either.
Execute manual changes on production, or production relevant environments, only while another engineer is double-checking in a pairing-session.
- This might have prompted more questions and the allow_stopping_for_update flag being true when it should not be.
Not have allow_stopping_for_update set to true in checked-in terraform code.
- Whenever we need this flag to be true, the appropriate change requires manual action in most cases anyways. We should strive to not commit this flag as true, but only ever change it locally when we absolutely have to, and keep the change local.
- Ideally, this should be enforced via CI.
Every change should have a change management issue.
- This one did not, as the engineers rationale saw the codebase was already being in the default branch, deemed it to be a safe operation (see 2.) and was not a dedicated change.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Mar 02, 2021 by Hendrik Meyer (xLabber)