2020-06-10: Elevated web latency
2020-06-10: Elevated web latency -
Incident Review
Summary
For a period of 3 hours and 38 minutes, between 2020-06-10 13:35 UTC and 2020-06-10 17:13 UTC, customers experienced increased latency when utilizing GitLab.com. This latency was due to a sizeable increase in IO related to local disk space utilization and saturation of Puma workers on nodes of the web fleet. The root cause of the increase in disk space utilization was a deployment containing an MR (gitlab-org/gitlab!33943 (merged)) which inadvertently was using the local disk to cache instead of the dedicated Redis cache cluster. A patch was developed and deployed, but ultimately failed in staging due to errors it introduced causing 503s. A second patch was developed and deployed, correctly redirecting caching activity to the Redis cache cluster.
- Service(s) affected: GitLab.com web, api, and git servers
- Team attribution: Plan:Project Management
- Minutes downtime or degradation: 218 minutes
Metrics
Metric 1 - Degradation in apdex
Metric 2 - puma worker threads
Customer Impact
100% of GitLab.com users experienced slower load times for a period of 3 hours and 38 minutes (2020-06-10 13:35 UTC and 2020-06-10 17:13 UTC).
Incident Response Analysis
- How was the event detected?
- The event was detected by an alert from PagerDuty on elevated latency.
- How could detection time be improved?
- The alarm threshold could be adjusted to alarm on a lower threshold
- Alarms on Redis cache utilization based on deviations from normal usage may have indicated an issue.
- How did we reach the point where we knew how to mitigate the impact?
- Investigations into recent deployments quickly uncovered the MR which needed to be patched over.
- How could time to mitigation be improved?
- An ability to do rollbacks of entire deployments to GitLab.com would have cut mitigation time by hours.
Post Incident Analysis
- How was the root cause diagnosed?
- An investigation into a recent deployment began right away. Assistance from Development on-call helped lead the diagnosis to an MR.
- How could time to diagnosis be improved?
- Smaller deployment packages
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- No
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
Timeline
- 2020-06-10 13:28 UTC: Deployment (containing affecting MR gitlab-org/gitlab!33943 (merged)) begins to GitLab.com web nodes. (https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/1292274)
- 2020-06-10 13:35 UTC: Graphs show increasing latency in the GitLab.com web fleet.
- 2020-06-10 13:46 UTC: EOC is paged for
component_apdex_ratio_burn_rate_slo_out_of_bounds_upper
and begins investigating, seeing an increase in latency in the web fleet. - 2020-06-10 13:46 UTC: EOC opens this incident issue and opens this incident. Degradation in apdex (a derived metric of latency) is moderate, but stable (see Metrics 1 above).
- 2020-06-10 13:55 UTC: Deploy proceeds, web nodes are becoming saturated on puma worker threads (see Metrics 2 above)
- 2020-06-10 14:00 UTC: Inspection of dashboards indicate that the latency degradation is correlated with high root disk I/O, and the root filesystem is filling up on each node. This is unusual for stateless nodes.
- 2020-06-10 14:12 UTC: IMOC joins the Zoom call after the incident is increased to ~S2 and asks the CMOC to join the call as well to begin customer communications.
- 2020-06-10 14:22 UTC: IMOC notifies Development on-call via Slack
- 2020-06-10 14:25 UTC: Development on-call joins incident call.
- 2020-06-10 14:30 UTC: Developer on-call inspects the diff of what was deployed, and identifies the MR (gitlab-org/gitlab!33943 (merged)) as the likely culprit. They explain that switching the cache to local disk explains the sudden latency degradation ("dropped" cache), and also the slow, imperfect warm-up (cache is local to each web node, and not shared).
- 2020-06-10 ~14:30 UTC: Team begins exploring strategies to mitigate disk space consumption.
- 2020-06-10 15:00 UTC: EOC shift change, previously engaged EOC remains on the call.
- 2020-06-10 15:14 UTC: It is determined there is roughly 3 hours until disk space will be 95% utilized. Investigation into mitigation strategies continues while a patch is developed.
- 2020-06-10 15:30 UTC: A patch rollout is attempted on staging, in testing on staging, it is determined that the patch generates 503 errors and deployment to Production is prevented. Work on a new patch begins.
- 2020-06-10 17:13 UTC: A patch is successfully applied.
- 2020-06-11 08:09 UTC: Infrastructure team is notified of cache inconsistency issues for long-TTL keys. This is mitigated by dropping the cache. Discussion in #2263 (closed).
5 Whys
- There was high latency and disk i/o on the web nodes. Why?
- Cache that was normally saved in redis was being written to disk
- Why did the cache start getting saved on disk?
- When
config/initializers/7_redis.rb
was run which should have set the cache store to Redis, this didn't do anything anymore.
- When
- Why?
- Some code checking feature flags triggered an access to
Rails.cache
before the cache store was setup resulting in Rails falling back to the default which is file store.
- Some code checking feature flags triggered an access to
- Why?
- In gitlab-org/gitlab!33943 (merged), we moved the initialization of Rails'
config.cache_store
because the Redis pool size now depends on reading a value fromgitlab.yml
which can only be done afterconfig/initializers/1_settings.rb
is run.
- In gitlab-org/gitlab!33943 (merged), we moved the initialization of Rails'
- Why was this not caught locally?
- Manual testing wasn't fully conducted for the caching aspect of this change.
- Why was this not caught on staging?
- We don't have adequate load testing or smoke testing on staging. When the QA test passed, we continue with deploying.
- Why was this caught on
dev.GitLab.org
(#2259 (comment 358954870)) but not investigated as a serious issue?- It was being investigated by Development, but there is no step in investigations of dev.gitlab.org which ask the question "Has the code being investigated in dev.gitlab.org in the process of being deployed or already deployed to gitlab.com?" and no "andon-cord" to pull when a breaking issue is being investigated in dev.gitlab.org which prevents deployments and notifies Infrastructure about errors which may have escaped into Production.
- Why was a patch necessary?
- It is not presently possible to safely conduct a rollback of deployments to gitlab.com.
- Why is it not safe to conduct rollbacks?
- Deployments can contain database migrations and code dependent on the migration outcome. Creating a one-way door.
Lessons Learned
- The
git
andapi
servers were affected too, but not realized until well into the cleanup effort. - We are missing alerting/testing on staging to catch significant variances in things like disk space utilization.
- This problem occurred on dev.gitlab.org before the deployment to production, but wasn't identified as an issue indicating the need to stop deployment pipelines and evaluate if the code in dev.gitlab.org had beed deployed to gitlab.com
- 2 weeks is quite a long TTL and a lower would have prevented post-incident cache clean-up activities.
- Patching is expensive in time spent to mitigate
- An ineffective patch in staging which needed to be rolled back prior to an effective patch is expensive
Corrective Actions
- delivery#916 (closed)
- delivery#909 (closed)
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10580
- Allow for rollbacks of deployments to gitlab.com