Skip to content

Ensure that dialing fleeting instance can be canceled

Arran Walker requested to merge ajwalker/fleeting-cancel-dial into main

What does this MR do?

In !4669 (merged) with implemented WithContext support, which would cancel the build if the instance a build is running on disappears.

This extended to ConnectInfo() but was also intended to cover dialing the actual instance too, but it didn't.

This fixes that, and treats cancelations of the dial as a build error so that we don't retry. This works fine for now, as these cancelations are not recoverable, but in the future, it might be better to wrap fleeting/taskscaler errors with an interface to suggest whether the context.Cause error can be recovered from. That's a larger change, so for now, this is a bit of a hack but works.

What's the best way to test this MR?

  • Setup fleeting without idle scale (so instances are provisioned as a job comes in). This typically makes dialing the instance take longer (waiting for instance to start/SSH to be setup)
  • Kill the instance out-of-band (in the autoscaler UI for example)
  • The instance being deleted should cancel the dialing of the instance

What are the relevant issue numbers?

Merge request reports