Skip to content

Fix instance termination

Alejo Carballude requested to merge feature/fix_instance_termination into develop

Related tasks

Context

Instance termination sometimes failed to delete EC2 instances because it used MachineInstance records to get the instance IDs. EC2 instance termination failed for simulations that got a valid response when requesting EC2 instances, but were not provided enough machines to launch the simulation. MachineInstance records were created right after receiving a response from EC2 and before validating the returned values, and are not being deleted before retrying. This caused the EC2 termination process to attempt to terminate EC2 instances that no longer existed and fail.

Because MachineInstances records were being used to delete EC2 instances, EC2 instances were not properly deleted sometimes when attempting to retry. The most common sign of this error was finding two Gazebo instances for the same simulation. Having two instances with the same Kubernetes labels caused the copy pod to sometimes be placed in a different instance from its target pod, making the copy pod target directory be empty and upload empty logs.

Change

Instances are now being deleted using EC2 tags rather than MachineInstance records. MachineInstances are now being created after the created EC2 instances are validated, ensuring the records reflect the simulation EC2 instances.

Other information

Additional documentation

Edited by Alejo Carballude

Merge request reports