Corrective action: Investigate ways to avoid GitLab instance that consistently returns 429 Too Many Requests leading to FD exhaustion

MR: Fix FD exhaustion during retry requests (!6041 - merged) • Ashvin Sharma • 18.9

In https://gitlab.com/gitlab-org/gitlab/-/issues/572149, an incorrectly implemented rate limit on the PATCH /jobs/:id/trace endpoint led to the exhaustion of available file descriptors in all runner managers, causing an S1 incident. It is not clear why one situation led to another, but for the sake of future reliability, we should investigate it, perhaps trying to reproduce the scenario locally with a GDK patched to always return 429 from that endpoint:

diff --git a/lib/api/ci/runner.rb b/lib/api/ci/runner.rb
index ce7c3f3bd8ee..9185ab293da8 100644
--- a/lib/api/ci/runner.rb
+++ b/lib/api/ci/runner.rb
@@ -274,6 +274,8 @@ class Runner < ::API::Base
           optional :debug_trace, type: Boolean, desc: 'Enable or Disable the debug trace'
         end
         patch '/:id/trace', urgency: :low, feature_category: :continuous_integration do
+          render_api_error!({ error: _('This endpoint has been requested too many times. Try again later.') }, 429)
+
           job = authenticate_job!(heartbeat_runner: true)
 
           error!('400 Missing header Content-Range', 400) unless request.headers.key?('Content-Range')

One thing to note is that the 429 response didn't contain the Retry-After header.

We could probably check resource usage (e.g. file descriptor usage) in the autoscaler, and terminate jobs that are waiting to report to GitLab if load gets too high.

Edited by Ashvin Sharma