2020-06-10: Elevated web latency

2020-06-10: Elevated web latency -

Incident Review

Summary

For a period of 3 hours and 38 minutes, between 2020-06-10 13:35 UTC and 2020-06-10 17:13 UTC, customers experienced increased latency when utilizing GitLab.com. This latency was due to a sizeable increase in IO related to local disk space utilization and saturation of Puma workers on nodes of the web fleet. The root cause of the increase in disk space utilization was a deployment containing an MR (gitlab-org/gitlab!33943 (merged)) which inadvertently was using the local disk to cache instead of the dedicated Redis cache cluster. A patch was developed and deployed, but ultimately failed in staging due to errors it introduced causing 503s. A second patch was developed and deployed, correctly redirecting caching activity to the Redis cache cluster.

Service(s) affected: GitLab.com web, api, and git servers
Team attribution: Plan:Project Management
Minutes downtime or degradation: 218 minutes

Metrics

Metric 1 - Degradation in apdex

Metric 2 - puma worker threads

Customer Impact

100% of GitLab.com users experienced slower load times for a period of 3 hours and 38 minutes (2020-06-10 13:35 UTC and 2020-06-10 17:13 UTC).

Incident Response Analysis

How was the event detected?

The event was detected by an alert from PagerDuty on elevated latency.

How could detection time be improved?

The alarm threshold could be adjusted to alarm on a lower threshold
Alarms on Redis cache utilization based on deviations from normal usage may have indicated an issue.

How did we reach the point where we knew how to mitigate the impact?

Investigations into recent deployments quickly uncovered the MR which needed to be patched over.

How could time to mitigation be improved?

An ability to do rollbacks of entire deployments to GitLab.com would have cut mitigation time by hours.

Post Incident Analysis

How was the root cause diagnosed?

An investigation into a recent deployment began right away. Assistance from Development on-call helped lead the diagnosis to an MR.

How could time to diagnosis be improved?

Smaller deployment packages

Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?

Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?

yes, gitlab-org/gitlab!33943 (merged)

Timeline

2020-06-10 13:28 UTC: Deployment (containing affecting MR gitlab-org/gitlab!33943 (merged)) begins to GitLab.com web nodes. (https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/1292274)
2020-06-10 13:35 UTC: Graphs show increasing latency in the GitLab.com web fleet.
2020-06-10 13:46 UTC: EOC is paged for component_apdex_ratio_burn_rate_slo_out_of_bounds_upper and begins investigating, seeing an increase in latency in the web fleet.
2020-06-10 13:46 UTC: EOC opens this incident issue and opens this incident. Degradation in apdex (a derived metric of latency) is moderate, but stable (see Metrics 1 above).
2020-06-10 13:55 UTC: Deploy proceeds, web nodes are becoming saturated on puma worker threads (see Metrics 2 above)
2020-06-10 14:00 UTC: Inspection of dashboards indicate that the latency degradation is correlated with high root disk I/O, and the root filesystem is filling up on each node. This is unusual for stateless nodes.
2020-06-10 14:12 UTC: IMOC joins the Zoom call after the incident is increased to ~S2 and asks the CMOC to join the call as well to begin customer communications.
2020-06-10 14:22 UTC: IMOC notifies Development on-call via Slack
2020-06-10 14:25 UTC: Development on-call joins incident call.
2020-06-10 14:30 UTC: Developer on-call inspects the diff of what was deployed, and identifies the MR (gitlab-org/gitlab!33943 (merged)) as the likely culprit. They explain that switching the cache to local disk explains the sudden latency degradation ("dropped" cache), and also the slow, imperfect warm-up (cache is local to each web node, and not shared).
2020-06-10 ~14:30 UTC: Team begins exploring strategies to mitigate disk space consumption.
2020-06-10 15:00 UTC: EOC shift change, previously engaged EOC remains on the call.
2020-06-10 15:14 UTC: It is determined there is roughly 3 hours until disk space will be 95% utilized. Investigation into mitigation strategies continues while a patch is developed.
2020-06-10 15:30 UTC: A patch rollout is attempted on staging, in testing on staging, it is determined that the patch generates 503 errors and deployment to Production is prevented. Work on a new patch begins.
2020-06-10 17:13 UTC: A patch is successfully applied.
2020-06-11 08:09 UTC: Infrastructure team is notified of cache inconsistency issues for long-TTL keys. This is mitigated by dropping the cache. Discussion in #2263 (closed).

5 Whys

There was high latency and disk i/o on the web nodes. Why?
- Cache that was normally saved in redis was being written to disk
Why did the cache start getting saved on disk?
- When config/initializers/7_redis.rb was run which should have set the cache store to Redis, this didn't do anything anymore.
Why?
- Some code checking feature flags triggered an access to Rails.cache before the cache store was setup resulting in Rails falling back to the default which is file store.
Why?
- In gitlab-org/gitlab!33943 (merged), we moved the initialization of Rails' config.cache_store because the Redis pool size now depends on reading a value from gitlab.yml which can only be done after config/initializers/1_settings.rb is run.
Why was this not caught locally?
- Manual testing wasn't fully conducted for the caching aspect of this change.
Why was this not caught on staging?
- We don't have adequate load testing or smoke testing on staging. When the QA test passed, we continue with deploying.
Why was this caught on dev.GitLab.org (#2259 (comment 358954870)) but not investigated as a serious issue?
- It was being investigated by Development, but there is no step in investigations of dev.gitlab.org which ask the question "Has the code being investigated in dev.gitlab.org in the process of being deployed or already deployed to gitlab.com?" and no "andon-cord" to pull when a breaking issue is being investigated in dev.gitlab.org which prevents deployments and notifies Infrastructure about errors which may have escaped into Production.
Why was a patch necessary?
- It is not presently possible to safely conduct a rollback of deployments to gitlab.com.
Why is it not safe to conduct rollbacks?
- Deployments can contain database migrations and code dependent on the migration outcome. Creating a one-way door.

Lessons Learned

The git and api servers were affected too, but not realized until well into the cleanup effort.
We are missing alerting/testing on staging to catch significant variances in things like disk space utilization.
This problem occurred on dev.gitlab.org before the deployment to production, but wasn't identified as an issue indicating the need to stop deployment pipelines and evaluate if the code in dev.gitlab.org had beed deployed to gitlab.com
2 weeks is quite a long TTL and a lower would have prevented post-incident cache clean-up activities.
Patching is expensive in time spent to mitigate
An ineffective patch in staging which needed to be rolled back prior to an effective patch is expensive

Corrective Actions

delivery#916 (closed)
delivery#909 (closed)
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10580
Allow for rollbacks of deployments to gitlab.com

Guidelines

Blameless RCA Guideline

Edited Jul 08, 2020 by Brent Newton