2023-08-16: Connection errors on gitaly causing 500s
Customer Impact
3.35% of Git requests resulted in 500 errors for the user.
Current Status
Issue was mitigated. Still waiting for deployments to mark it as resolved.
We are seeing memory buildup on file-81 that has started on August 14th. The memory buildup resets on Gitaly node redeploys. The buildup seems to coincide with gitlab-org/gitaly@bb342a59, where a new adaptive limiter has been deployed to production systems. We also saw the adaptive limiter in some other metrics
Next steps:
- Mitigation: To work around the memory leak, @sxuereb will be setting up a cron job that periodically (every 1 hour) by adding a recipe to the gitlab-server cookbook.
-
Create cron job 👉 https://gitlab.com/gitlab-cookbooks/gitlab-server/-/merge_requests/355 -
Release new version - Publish new version of cookbook
👉 https://ops.gitlab.net/gitlab-cookbooks/gitlab-server/-/pipelines/2223736 - Update version in gstg
👉 https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3889 - Update version in gprd
👉 https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3890
- Publish new version of cookbook
-
Enable cookbook on Gitaly fleet. 👉 https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3891 -
Monitor that we are restarting Gitaly server in https://log.gprd.gitlab.net/app/r/s/85qJQ
-
- Gitaly Fix: @pks-gitlab Is going to revert the change in Gitaly and we will proceed to deploy a new version(roll forward)
-
Revert Gitaly 👉 gitlab-org/gitaly!6236 (merged) -
Update version to be deployed 👉 gitlab-org/gitlab!129585 (merged)
-
- Moving this incident to IncidentResolved:
-
@sabrams to confirm when gitlab-org/gitlab!129585 (merged) is in production - Coordinator pipeline
👉 https://ops.gitlab.net/gitlab-org/release/tools/-/pipelines/2224111
- Coordinator pipeline
-
@sxuereb and @pks-gitlab to validate that memory leak is no longer happening -
@stejacks-gitlab to disable cron job: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3896
-
📝 Summary for CMOC notice / Exec summary:
- Customer Impact: Elevated 500 errors on GitLab.com due to specific Gitaly nodes becoming unresponsive
- Service Impact: ServiceGitaly ServiceGit
- Impact Duration: 2023-08-16 23:23 UTC - 2023-08-17 00:35 UTC (72 minutes) and 2023-08-17 01:02 - 01:13 UTC (11 minutes)
- Root cause: Limiter creating memory leak. Problematic commit reverted here: gitlab-org/gitaly!6236 (merged)
- Corrective Action(s): see #16190 (closed)
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.