2020-10-30: haproxy errors on https-git backends - action cable related
Summary
From approximately 09:30 to 10:26 on 2020-10-30 we saw increased error rates for the Git HTTPs service. This was due to increased memory pressure from enabling the action cable feature flag in production delivery#1210 (closed). The feature flag was enabled the day previously, but due to additional traffic we hit the memory limit for workhorse (currently set to 150MB) where normally workhorse memory sits around 50MB.
To resolve the issue we disabled the feature flag.
https://thanos-query.ops.gitlab.net/graph?g0.range_input=4d&g0.max_source_resolution=0s&g0.expr=max%20(container_memory_working_set_bytes%7Benv%3D~%22gprd%22%2C%20id!%3D%22%2F%22%2Cpod%3D~%22%5Egitlab-(cny-)%3Fwebservice.%24%22%2Cnode%3D~%22%5E.%24%22%2C%20container%3D%22gitlab-workhorse%22%2C%20namespace%3D%22gitlab%22%7D)%20by%20(region)&g0.tab=0&g1.range_input=4d&g1.max_source_resolution=0s&g1.expr=sum(kube_replicaset_spec_replicas%7Breplicaset%3D~%22%5Egitlab-webservice.%22%2C%20cluster%3D~%22gprd-us-east1-b.%22%7D)&g1.tab=0
What's also interesting to note here as that as we started seeing memory issues with workhorse (top), we were scaling down pods. We think it's possible that was a contributing factor, though we have kept this pod minimum the same since the incident and have not seen the large memory consumption since.
Timeline
All times UTC.
2020-10-29
- 08:42 - actioncable feature flag enabled
- 18:36 - alert on higher than normal error rates for git fetches
2020-10-30
- 09:33 - cfurman declares incident in Slack.
- 10:06 - actioncable feature flag disabled
- 10:26 - error rates return to normal
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected: git https / websockets
- Team attribution: Infrastructure
- Minutes downtime or degradation: 60minutes
Metrics
Customer Impact
- Who was impacted by this incident? all users of git https
- What was the customer experience during the incident? failures for a subset of git https connections
- How many customers were affected? Approximately 6% of customers making requests using git https
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected? Alerting
- How could detection time be improved? Better dashboarding on container memory utilization
- How did we reach the point where we knew how to mitigate the impact? Noticed that workhorse was failing to start due to memory
- How could time to mitigation be improved? Once the root cause was found, we are able to quickly mitigate
Post Incident Analysis
- How was the root cause diagnosed? Examining metrics and checking the status of the cluster
- How could time to diagnosis be improved? Better dashboarding on container memory utilization
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? Yes, gitlab-org/charts/gitlab#2334 (closed) to isolate action cable traffic to its own node pool
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? Yes, feature flag. Part of delivery#1210 (closed)
5 Whys
Lessons Learned
- We might have avoided this if we were a bit more cautious with the feature-flag rollout, like an incremental rollout using percentages. Some of these improvements to the process are being discussed in gitlab-org/gitlab#260342 (closed)
- For anything that may impact memory utilization we should separately monitor workhorse and puma, in this case our memory tracking was at the pod level so we missed the additional memory utilization on workhorse
- For new workloads like this, we should try to isolate as much as possible by using a separate deployment or even an isolated node pool. For this change, it wasn't possible but will be soon as soon as we have this added to charts in gitlab-org/charts/gitlab#2334 (closed)
Corrective Actions
- Improve dashboarding for k8s delivery#1286 (closed)
- Isolate actioncable to its own node pool gitlab-org/charts/gitlab#2334 (closed)
- Disable actioncable on git https pods gitlab-com/gl-infra/k8s-workloads/gitlab-com!496 (merged)
- Increase limits for workhorse (while we continue to investigate) gitlab-com/gl-infra/k8s-workloads/gitlab-com!497 (merged)
- Improve Guidance around Progressive Feature Rollouts gitlab-org/gitlab#260342 (closed)
- HPA configuration for webservice delivery#1338 (closed)