Failed Instances consume ScaleSet capacity

Failed Instances consume ScaleSet capacity, when the capacity is full no new jobs will be picked up. I see in the code there is not a case for "Failed". The other option would be to configure the ScaleSet to use a Health Check configured to recreate the instance doesn't seem like a good idea or even would work since the ScaleSet is manually managed by the fleeting plugin.

https://gitlab.com/gitlab-org/fleeting/plugins/azure/-/blob/main/provider.go?ref_type=heads#L223

Requesting an optional flag to delete Failed instances

For now I am going to look into creating a cron job

Delete Failed Instances

List all instances in the scale set az vmss list-instances --resource-group $RESOURCE_GROUP --name $SCALE_SET --query "[?provisioningState=='Failed']" -o table

Delete all instances with the "Failed" state az vmss delete-instances --resource-group $RESOURCE_GROUP --name $SCALE_SET --instance-ids $(az vmss list-instances --resource-group mbip_arm_gitlab_runners --name $SCALE_SET --query "[?provisioningState=='Failed'].instanceId" -o tsv)

Edited Jan 15, 2025 by Josh Tischer

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information