Add Feature Flag to Disable Job Pruning with Auto-Scaler

Description

The customer is facing significant delays and high costs associated with acquiring new fleeting instances due to their custom compute platform. Specifically, when the main fleeting instance fails, all running jobs are terminated, and the nodes must be re-queued. This situation leads to job queues extending for more than a day. The current behavior in GitLab's autoscaler prunes jobs because there is no available state information about them. However, the customer is capable of handling state recovery internally on the node by checking the current state of running jobs, their lifetimes, and manually triggering a prune or recreate if necessary. This approach helps them avoid the risk of re-queuing all nodes in the pool simultaneously.

Proposal

Implement a feature flag to disable the pruning behavior, allowing customers to manage state recovery internally. This change would prevent the unnecessary termination of jobs and re-queuing of nodes.

Add Feature Flag to Disable Job Pruning with Auto-Scaler

Description

Proposal

Links to related issues and merge requests / references