Skip to content

Return ResourceExhausted instead of Internal for Spawn token timeout

For #5096 (closed)

This MR returns ResourceExhausted instead of Internal for Spawn token timeout. The official gRPC documentation (https://grpc.github.io/grpc/core/md_doc_statuscodes.html) clearly distinguishes between different status codes, in particular:

  • RESOURCE_EXHAUSTED: Some resource has been exhausted, perhaps a per-user quota, or perhaps the entire file system is out of space.
  • INTERNAL: Internal errors. This means that some invariants expected by the underlying system have been broken. This error code is reserved for serious errors.

The spawn token system, developed by our team, serves as a means of managing underlying fork/exec operations. When a process is unable to create due to spawn token shortage, it can be viewed as a resource issue, aligning with the definition of the ResourceExhausted error code. This type of error is common in comparable scenarios, making it a predictable outcome. It would be appropriate to reserve the Internal error code for unexpected occurrences instead.

One more thing. In Allow Gitaly to push back on traffic surges (&7891 - closed), I'm currently implementing a pushback feature for clients who encounter specific error codes. By converting these errors to ResourceExhausted, clients will be forced to perform transparent retries in an exponential and automatic manner. This will ultimately have a positive impact on the system.

How would you be able to verify this change?

  • Set spawn token environment variables to extremely low: GITALY_COMMAND_SPAWN_TIMEOUT to 100ms, MaxParallel = 1.
  • Stress-test Gitaly via API until an error is returned
Before Ater
Gitaly returns the Internal response code
Screenshot_2023-05-10_at_12.28.19
Gitaly returns the ResourceExhausted response code
Screenshot_2023-05-10_at_12.19.18
gRPC logs don't include spawn token error, although the final response error includes a portion. There is another dedicated log line for this kind of error
Screenshot_2023-05-10_at_12.28.30
The dedicated log line is removed. It's now merged into gRPC logs
Screenshot_2023-05-10_at_12.41.16
API returns 503 Unavailable. It's due to Gitaly returning the Internal code.
Screenshot_2023-05-10_at_12.27.57
API returns 429 status with a nice message. It's the consequence of Propagate Gitaly ResourceExhausted errors to cl... (gitlab!119054 - merged)
Screenshot_2023-05-10_at_12.17.31
Edited by Quang-Minh Nguyen

Merge request reports