Failed Instances consume ScaleSet capacity
Failed Instances consume ScaleSet capacity, when the capacity is full no new jobs will be picked up. I see in the code there is not a case for "Failed". The other option would be to configure the ScaleSet to use a Health Check configured to recreate the instance doesn't seem like a good idea or even would work since the ScaleSet is manually managed by the fleeting plugin.
https://gitlab.com/gitlab-org/fleeting/plugins/azure/-/blob/main/provider.go?ref_type=heads#L223
Requesting an optional flag to delete Failed instances
For now I am going to look into creating a cron job
Delete Failed Instances
List all instances in the scale set az vmss list-instances --resource-group $RESOURCE_GROUP --name $SCALE_SET --query "[?provisioningState=='Failed']" -o tableDelete all instances with the "Failed" state az vmss delete-instances --resource-group $RESOURCE_GROUP --name $SCALE_SET --instance-ids $(az vmss list-instances --resource-group mbip_arm_gitlab_runners --name $SCALE_SET --query "[?provisioningState=='Failed'].instanceId" -o tsv)