Skip to content

Draft: fix(provider): Correctly map 'stopped' state to resolve autoscaler deadlock

Summary

This change resolves a critical deadlock in the autoscaler where the system would stop cleaning up failed instances and stop provisioning new ones. The root cause was an incorrect mapping of the Scaleway stopped server state to the Fleeting provider.StateDeleting.

By correcting the mapping to the more accurate provider.StateStopped, we allow Fleeting's reconciliation logic to correctly identify and garbage-collect failed instances, restoring the self-healing capabilities of the autoscaler.

Problem

When a Scaleway instance fails to start or unexpectedly stops, its state becomes stopped. The plugin was incorrectly reporting this state to the Fleeting framework as provider.StateDeleting.

This caused a deadlock for the following reason:

  1. Fleeting saw instances in StateDeleting and assumed a cleanup was already in progress.
  2. It would therefore never call the plugin's Decrease method for these instances, waiting forever for a deletion that was never initiated.
  3. The autoscaler's capacity calculations were frozen, as it believed these "deleting" instances still held capacity, preventing it from provisioning new, healthy instances to meet demand.

This behavior was observed in logs where the capacity snapshot remained stuck (e.g., instance_count=9, pending=7, reserved=7) and no cleanup actions were performed on the stopped instances.

Solution

This PR changes the state mapping in provider.go. The Scaleway states ServerStateStopped and ServerStateStoppedInPlace are now correctly mapped to provider.StateStopped.

This simple change has the correct effect by leveraging Fleeting's internal state machine:

  • For an instance that was running (acquired): Fleeting sees the unexpected transition to StateStopped as an error and immediately schedules the instance for cleanup by calling Decrease.
  • For an instance being created (acquiring): Fleeting correctly interprets StateStopped as a valid, temporary state while resources are being attached and waits for it to become Running (or for the acquisition to time out).

This resolves the deadlock, ensures that failed or stuck instances are properly garbage-collected during reconciliation, and allows the autoscaler to function correctly.

Edited by zadkiel

Merge request reports

Loading