Fix instance termination
Related tasks
- [cloudsim] Fix simulation termination EC2 instance not found error
- [cloudsim] Fix empty Gazebo server copy pod logs directory
Context
Instance termination sometimes failed to delete EC2 instances because it used MachineInstance
records to get the instance IDs. EC2 instance termination failed for simulations that got a valid response when requesting EC2 instances, but were not provided enough machines to launch the simulation. MachineInstance
records were created right after receiving a response from EC2 and before validating the returned values, and are not being deleted before retrying. This caused the EC2 termination process to attempt to terminate EC2 instances that no longer existed and fail.
Because MachineInstances
records were being used to delete EC2 instances, EC2 instances were not properly deleted sometimes when attempting to retry. The most common sign of this error was finding two Gazebo instances for the same simulation. Having two instances with the same Kubernetes labels caused the copy pod to sometimes be placed in a different instance from its target pod, making the copy pod target directory be empty and upload empty logs.
Change
Instances are now being deleted using EC2 tags rather than MachineInstance
records. MachineInstance
s are now being created after the created EC2 instances are validated, ensuring the records reflect the simulation EC2 instances.