2023-06-27: Investigate `ResourceExhausted` error resulting from spawn token timeouts
Customer Impact
Users sometimes see the 500 errors in pipelines or operations such as git-fetch or git-clone. Depending on the git client, they might see an error with description such as process spawn timed out after 5s. Retrying the same operation with some backoff succeeds.
Current Status
Recently we have seen reports from customers of receiving ResourceExhausted errors due to spawn token timeout, causing disruption to their workflows.
There is a timeout limit on acquiring spawn tokens for git commands. This limit was gradually reduced from 10s to 2s at first, to mitigate the high latency issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23577
It has been hard to determine the root cause of resource exhaustion during spike in spawn token timeout errors because generally they are very short-lived and might be getting auto-resolved in between scrape times of our monitoring system.
We increased both max parallel processes and spawn token timeout in response to customer reports, however, this could not completely get rid of these errors.
We are investigating this further to determine if the current timeout and max processes limits can be tweaked further to reduce the number of timeout errors, without bringing down the whole server.
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.