2021-02-11 redis-sidekiq unavailable
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
Summary
In preparation for a change, an effort was made to remove drift from infrastructure state to terraform code.
There were multiple existing diffs on each of the redis-sidekiq
nodes among others. A safety-net was not in place at the time of applying changes and caused a shutdown of the nodes. The safety net's absence was not noticed until after it caused the incident.
As a result, sidekiq was unable to process jobs, causing git pushes, CI pipelines, incoming mail and potentially other sidekiq activities such as mirror updates to fail.
Timeline
All times UTC.
2021-02-11
-
10:41
- A metadata change was applied. -
10:42
-redis-sidekiq
nodes start to shut down and services start to be unavailable -
10:44
- 3/3redis-sidekiq
nodes are offline. -
10:44
-mailroom
queue starts to fill up, resulting in incoming mail being unable to be processed. -
10:46
- @andrewn declares incident in Slack. - in between
- Incident bridge is filled.
- It is determined, that the nodes did not automatically reboot, but needed to be started manually.
- Once started, redis did not start automatically either, so it also needed to be started manually.
-
10:51
- redis is manually restarted onredis-sidekiq
nodes -
10:52
- sidekiq is healthy again and services start to recover almost instantly. - in between
- It was detected that mailroom did not process any more emails.
- Due to the way it is deployed, we agreed to delete the currently hanging mailroom pod to force it to be rescheduled.
-
11:29
- mailroom processed all backlog of emails
Corrective Actions
- Create alert /warning mechanism and process for unclean TF plan on gprd / ops
- Update allow_stopping_for_update = true -> false in terraform
- Prevent checking in allow_stopping_for_update=true in terraform (add CI checks)
- Issue template created detailing how to accomplish Terraform changes in Production
- Dogfood Terraform integration in Merge Requests
- Mailroom doesn't fail liveness checks when it's unable to connect to redis-sidekiq (and remains in this state after redis is back online)
Incident Review
Summary
- Service(s) affected: ServiceInfrastructure ServiceAPI ServiceCI Runners ServiceGit ServiceGitLab Rails ServiceMailroom ServiceSidekiq
- Team attribution: ~"team::Core-Infra"
- Time to detection: 0 minutes
- Minutes downtime or degradation: ~11 Minutes (~48 minutes for ServiceMailroom)
Unless otherwise noted, the links to metrics in this paragraph are only accessible for GitLab Team members. Screenshots are attached in the metrics section.
For a period of 11 minutes (between 2021-02-11 10:41 UTC and 2021-02-11 10:52 UTC), GitLab.com experienced increased error rates due to the loss of the whole redis-sidekiq
fleet.
For a period of 48 minutes (between 2021-02-11 10:41 UTC and 2021-02-11 10:29 UTC), incoming emails to the GitLab.com SaaS application were not processed. Emails sent to other @gitlab.com
email addresses were not affected.
In preparation for a change, an effort was made to remove drift from infrastructure state to terraform code.
There were multiple existing diffs on each of the redis-sidekiq
nodes among others. One diff was removed via a merge request which re-aligned the node-type in terraform to what is deployed in GCP.
After it was merged the instance type disappeared from the terraform diff.
The change which was left for the redis-sidekiq
nodes was this: (redacted for this issue, full text in #3579 - internal)
~ service_account {
email = "<REDACTED>",
~ scopes = [
- "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
+ "https://www.googleapis.com/auth/<REDACTED>",
]
}
A service account change can sometimes lead to a system restart to apply it, but that is not always the case.
Usually, we have the allow_stopping_for_update
set to false
on all machines. At some point in the past, this was changed/set to true
for a subset of machines, which also includes the redis-sidekiq
nodes.
This setting being false
would cause a terraform apply to fail (originating from the GCP API) if a restart is required to apply changes. A value of true
however, will do whatever it takes to apply the change, including a reboot.
This safety-net was not in place at the time of the apply at around 10:41 and its absence was not noticed until after it already caused the incident at 10:45.
The redis servers were shutdown cleanly, and redis persisted it's state prior to the reboot. Data loss might have occurred for in-flight requests whose data was not fully acknowledged by redis. Acknowledged data was persisted.
Immediately following on this, pushes to git repositories, inbound email, and other sidekiq
dependant services stopped processing requests, while they struggled to reach redis.
Due to sikekiq redis not being being available they dropped their connections to pgbouncer-sidekiq
nodes and caused an overall reduction of tuple updades on the postgres database and also transaction counts on the primary dropped.
The impact of this can be very well seen on the metrics of enqueued jobs. For the period of time where the redis was offline, no new jobs were queued, followed by 2 bumps. The first being the general application load, and the second being inflicted scheduled jobs and among them, project mirrors not run the hour before.
During the time of the redis downtime, we observed an increase of error-rates to up to 4.5% of failed requests (internal) visible across various backends, but api
being hit the worst. This is presumably because we also consume the API internally and thus this backend's metric might be inflated. api_rate_limit
which is the backend used for most customers was not hit as hard, with an increase to about 2.7% of requests, followed by the web
backend peaking around 1.1%.
Nonetheless, even though a large portion of requests were successful, it is safe to assume any action, that is not to be considered read-only has failed during this time. For example, it was not possible to start a CI job or to push log traces of CI jobs.
The machines were not rebooted, but rather just shut down. So during the investigation, we manually needed to start the machines and the redis process, which also did not automatically start.
After the machines were brought back on-line everything except mailroom recovered within a short period of time.
Because mailroom doesn't fail liveness checks when it's unable to connect to redis-sidekiq
(gitlab-org/charts/gitlab#2576 (closed)), mailroom did not automatically restart once it crashed because of the missing redis.
Metrics
- For approximately 15 minutes, from 10:45 to 11:00 there were increased error rates across all services web and api git that prevented Git pushes and CI failures. (This metric is slightly delayed, more detailed graph below)
- For approximately 48 minutes, from 10:45 to 11:30 we were unable to process incoming mail
-
reduction of tuple updades on the postgres database (internal)
-
transactions on the postgres primary dropped (internal)
-
pgbouncer-sidekiq dropped it's connection count to 0 (internal)
-
enqueued jobs are bottomed out (internal)
-
comparison of project mirrors during the incident (green) vs. 1 hour before (yellow) (internal)
-
The error rate increased from 0.01% to up to 4.5% (internal)
-
CI runner queues were depleted for 7 Minutes (internal) indicating reduced or missing functionality around CI jobs.
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- internal customers
- external customers
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- unable to push git repositories
- CI pipelines may not have been started
- incoming mail was queued up
- project mirrors have not been exectuted during this time
-
How many customers were affected?
- All customers performing the actions above.
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Somewhere between 1.1% and 2.7% of requests from external customers failed, whereas 4.5% of requests failed for internal API use and customers on an allowlist.
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- The engineer executing on the terraform codebase detected the issue immediately, followed by alerts firing.
-
How could detection time be improved?
-
-
How was the root cause diagnosed?
- Part of the terraform maintenance was the root cause.
-
How could time to diagnosis be improved?
-
-
How did we reach the point where we knew how to mitigate the impact?
- We assessed whether the nodes where still available in GCP
- Once we found they were, we started them to assess the status of the machines.
- We found, we needed to manually start redis. Services recovered afterwards.
-
How could time to mitigation be improved?
-
-
What went well?
- Collaboration of the Engineer On-call, fellow team members and the executing engineer lead to an immidiately identified root cause.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Possibly, although I could not find a specific issue.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9479
- A dirty terraform codebase was brought up in the past
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- There is no issue for this change, as the change was already merged. Result of 2. More details on this thread: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12600#note_507858366
- But as laid out in
Lessons Learned
5. below, it should have been accompanied by an issue.
Lessons Learned
- We need to make sure to always have a clean codebase without diffs.
- When we need to change something managed in terraform, we need to prioritize integrating those changes into terraform
- This change was made by someone unfamiliar with the intricicies of the outstanding diff. While trying to the best of their abilities, it is not possible for someone to know the full blast-radius of a change, unless they are very familiar.
- If possible, we should seek to implement asks and requirements into out tooling, and only diverge if a quick remediation is critical. In those cases, we should seize all action on terraform until this is implemented into the codebase.
- When we need to change something managed in terraform, we need to prioritize integrating those changes into terraform
- We should never trust assumptions around safety-nets being in place.
- While the aforementioned flag would have prevented this specific incident, other changes might not have a safety-net either.
- Execute manual changes on production, or production relevant environments, only while another engineer is double-checking in a pairing-session.
- This might have prompted more questions and the
allow_stopping_for_update
flag beingtrue
when it should not be.
- This might have prompted more questions and the
- Not have
allow_stopping_for_update
set totrue
in checked-in terraform code.- Whenever we need this flag to be
true
, the appropriate change requires manual action in most cases anyways. We should strive to not commit this flag astrue
, but only ever change it locally when we absolutely have to, and keep the change local. - Ideally, this should be enforced via CI.
- Whenever we need this flag to be
- Every change should have a change management issue.
- This one did not, as the engineers rationale saw the codebase was already being in the default branch, deemed it to be a safe operation (see 2.) and was not a dedicated change.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)