Track instances health and remove unhealthy after exceeding configurable threshold

As described at gitlab-org/ci-cd/shared-runners/infrastructure#73 (comment 1385207644), occasionally an instance may become unhealthy and unreachable. Taskscaler should track failures where an instance couldn't been reached or there was some failure not related to job execution itself. After exceeding a threshold of failures (which should have some default value but should be user configurable), such issue should be marked for deletion.

When deleting, we should mark a reason for deletion and we should include that into metrics, to be able to observe how many deletions are due to max_use_count exceeding, how many are due to idle_time exceeding without a load to handle and how many are due to detected instance health problems.

Update 12/12/2023:

Before I introduce this change, which will kill non-viable instances, I want to put monitoring in place to track how often this happens. Our system is resilient and will probably hide some churn, so we need to proactively look for it. (This was the case with our bug that scaled above max instances: Do not scale above max instances (!35 - merged) • Joe Burnett).

Edited Dec 12, 2023 by Joe Burnett