2021-01-19 - Elevated Error rates across fleet, workhorse
GitLab.com is currently stable, but we are continuing to investigate the root cause of the error spikes which occurred at 15:07 and again at 18:28 UTC. This could be related to a puma issue and we are going engaging more teams to look at the conditions which arose around the times of the errors.
All times UTC.
14:02- A deployment to canary starts https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/429843 with this diff https://gitlab.com/gitlab-org/security/gitlab/compare/d06ebade1b4...1194e3e3c4d
16:05- A deployment to prod starts https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/429843 with this diff https://gitlab.com/gitlab-org/security/gitlab/compare/d06ebade1b4...1194e3e3c4d
17:08errors start increasing
17:21- @alex declares incident in Slack.
17:26- web deployment finishes
17:36second large error spike
17:41error rates return to normal
18:14- Deployment finishes.
18:27Errors spike again
18:29Error rates return to normal
- 09:00 -
"ENABLE_PUMA_NAKAYOSHI_FORK": "false"deployed to cny.
- 10:10 -
"ENABLE_PUMA_NAKAYOSHI_FORK": "false"deployed to all of gprd.
More information will be added as we investigate the issue.
"ENABLE_PUMA_NAKAYOSHI_FORK": "false"env vars after we got a fix deployed in code, as this is a delta between staging and gprd now (here and here).
Cleanup elastic cloud watcher, or decide if we want to keep this alert in some way: gitlab-com/runbooks!3135 (merged)
Click to expand or collapse the Incident Review section.
For a period of (between 2021-01-19 17:08 and 2021-01-19 18:29), GitLab.com experienced a rise in error rates. The underlying cause has been determined as an issue with an upstream library (Hamlit) that was caused by an upstream change in Ruby. This incident was mitigated by enabling an environment variable that disables this new behavior until upstream Hamlit has been patched.
- Service(s) affected: web and api
- Team attribution:
- Time to detection: 13 minutes
- Minutes downtime or degradation: 1.5 hours
Web and API graphs showing an increase in error rates:
Who was impacted by this incident? (i.e. external customers, internal customers)
- all GitLab.com users
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Users may have received 500 errors or slow requests
How many customers were affected?
- There were 82592 errors during this time period compared to the same time on the previous and following week. Of these errors, 26878 came from unique IP addresses.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- The error rate was about 1%
What were the root causes?
Puma was segfaulting because of memory issues relating to
GC.compact call in Hamlit.
GC.compact attempts to defragment memory by moving data from one heap slot to another. This helps reduce memory bloat since new heap pages don't need to be allocated if enough contiguous slots are available. Ruby gems that use C extensions have two options:
- Handle these "move" events
- Pin the pointers so that the garbage collector will not attempt to move this data.
In Hamlit, there were a number of statically-defined pointers to constants (e.g.
=) that were not pinned. A call to
rb_const_get() to lookup these pointers would occasionally fail with a seg fault. While we were not able to see that these pointers had been moved by the garbage collector, the Hamlit maintainer worked around the seg faults by removing
rb_const_get() calls and and pinning the values. This appears to have solved the problem, although there was one remaining
rb_const_get() call that was removed later.
Incident Response Analysis
How was the incident detected?
- PagerDuty alerted the EOC to the issue.
How could detection time be improved?
- The issue was showing in Sentry before the alert, however we do not alert on Sentry errors.
How was the root cause diagnosed?
- We had to go deep into the code and stacktraces to track down where these errors were coming from.
How could time to diagnosis be improved?
- This particular issue is quite tricky as it wasn't readily apparent what was even happening. Even after we were able to get dev and others involved, it took a while to narrow the problem down to Hamlet and the underlying Ruby change.
How did we reach the point where we knew how to mitigate the impact?
- We had to look into the GitLab code and onward to upstream libraries to figure out the changes before it was determined we could set an environment variable to disable the new, broken behavior.
How could time to mitigation be improved?
- This was a very obscure issue that required lots of people to track down. There isn't really a way to improve mitigation time on this kind of deep, upstream issue.
What went well?
- We were able to get dev and others involved quickly.
- There was excellent collaboration from everyone in order to drill down very deep into an obscure problem.
Post Incident Analysis
Did we have other events in the past with the same root cause?
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Not to my knowledge
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- The issue was made manifest by a deployment, but it was not caused by a deployment.
- This was deeper into ruby and its system interaction than SRE was really prepared to troubleshoot. We should consider more training on deeper Ruby knowledge.
- Feature flags would have let us test this change safely.
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)