Consider new api deployment for route /api/v4/jobs/request
Overview
During incident production#4666 (closed) a specific endpoint utilized only runner traffic utilizes what we call a long poll. This forces workhorse to wait 50 seconds before it responds when no work for a runner is required. This throttle method is utilized to prevent the runner from polling too often draining both the runner and workhorse of resources.
Problem Statement
The downside to this is, is that we must configure the terminationGracePeriod to a larger value than that of long poll, to prevent existing connections waiting for a response from being interrupted. In general the runner handles this gracefully and will simply retry, at the expense of our load balancers seeing a mass influx of HTTP502s during Pod rotations.
Consideration
Since this is very specific to endpoint /api/v4/jobs/request
, consider creating a new specialty api deployment that handles ONLY this endpoint, where the terminationGracePeriod is set very high, and we can then reduce the same value for the existing API deployment, and maybe rid of the longpoll configuration for workhorse.
This proposal may change pending the outcome of the following: