Draft: fix(provider): Correctly map 'stopped' state to resolve autoscaler deadlock
Summary
This change resolves a critical deadlock in the autoscaler where the system would stop cleaning up failed instances and stop provisioning new ones. The root cause was an incorrect mapping of the Scaleway stopped server state to the Fleeting provider.StateDeleting.
By correcting the mapping to the more accurate provider.StateStopped, we allow Fleeting's reconciliation logic to correctly identify and garbage-collect failed instances, restoring the self-healing capabilities of the autoscaler.
Problem
When a Scaleway instance fails to start or unexpectedly stops, its state becomes stopped. The plugin was incorrectly reporting this state to the Fleeting framework as provider.StateDeleting.
This caused a deadlock for the following reason:
- Fleeting saw instances in
StateDeletingand assumed a cleanup was already in progress. - It would therefore never call the plugin's
Decreasemethod for these instances, waiting forever for a deletion that was never initiated. - The autoscaler's capacity calculations were frozen, as it believed these "deleting" instances still held capacity, preventing it from provisioning new, healthy instances to meet demand.
This behavior was observed in logs where the capacity snapshot remained stuck (e.g., instance_count=9, pending=7, reserved=7) and no cleanup actions were performed on the stopped instances.
Solution
This PR changes the state mapping in provider.go. The Scaleway states ServerStateStopped and ServerStateStoppedInPlace are now correctly mapped to provider.StateStopped.
This simple change has the correct effect by leveraging Fleeting's internal state machine:
-
For an instance that was running (
acquired): Fleeting sees the unexpected transition toStateStoppedas an error and immediately schedules the instance for cleanup by callingDecrease. -
For an instance being created (
acquiring): Fleeting correctly interpretsStateStoppedas a valid, temporary state while resources are being attached and waits for it to becomeRunning(or for the acquisition to time out).
This resolves the deadlock, ensures that failed or stuck instances are properly garbage-collected during reconciliation, and allows the autoscaler to function correctly.