Investigate Pod overcommit in Production
In infrastructure#10930 (comment 388472848) we determined that we are overcommitting our nodes. This impact is currently low, and we've got the ability to utilize sidekiq reliable fetcher for jobs that get lost. But for some workloads, this is an unacceptable situation. For some shards, it's fine if a job gets killed and ends up being retried later, example
project_export. But for items that operate on the
urgent-cpu-bound shard, these are quite literally urgent jobs that must run w/i a threshold.
Utilize this issue to determine which node pools are over subscribed. Investigate better memory requests as necessary and spin up further issues if needed for any detailed investigations to help along the way. Consider trying to figure out if we can fix the Verticle Pod Autoscaler as we should be able to rely on numbers provided by that system to help us make decisions as well.
Noting this as medium priority as restarts are not terribly often, and we aren't triggering any SLA alerts at the moment. But we should be cognizant that we are negatively impacting users actively.