`run_inspection_cluster` leaves cluster running if no messages found
Summary
run_inspection_cluster
reads messages and checks that they correspond to valid AwsMachineImage
and MachineImage
objects in our database. If run_inspection_cluster
either receives no message or all messages have no objects in the database, an early return is triggered:
if not relevant_messages:
# Early return if nothing actually needs inspection.
return
However, by the time run_inspection_cluster
has been called, scale_up_inspection_cluster
must have also successfully been called, meaning the inspection cluster has been scaled up but not yet given a task to run. Since run_inspection_cluster
returns early, nothing scales down the cluster because normally the subsequent task persist_inspection_cluster_results_task
does that, but if there are no results to process, it does not call scale_down_cluster
.
That behavior in persist_inspection_cluster_results_task
is correct and important because we do not want it to shut down the cluster while it's running. My initial recommendation is to change the code quoted above from run_inspection_cluster
to call scale_down_cluster
instead of simply returning.
Another idea to consider by @werwty would be not to discard messages if we no longer have the original AwsMachineImage objects but to allow them to proceed through inspection and only when we process the results discard the then-irrelevant records. This would allow us to simplify some of this workflow's "sanity checks."
The steps to reproduce here are hypothetical based on what I've gathered from reading logs and code.
Steps to Reproduce
- Register a cloud account that has an AWS EC2 instance with an AMI that would normally be inspected and that cloudigrade does not yet know about.
- Wait for the async tasks to run that perform the image copy, volume, etc. operations.
- Before the scheduled
scale_up_inspection_cluster
runs, delete the cloud account and any of its image objects. - Observe as scheduled task
scale_up_inspection_cluster
runs. - Observe as scheduled task
persist_inspection_cluster_results_task
runs.
Expected Result
- Inspection cluster scales down.
Actual Result
- Inspection cluster never scales down.
Additional context
Here is a rough timeline of events that @infinitewarp saw in our QA environment when investigating this issue. Times are UTC.
2020-03-19 08:35:52,761 scale_up_inspection_cluster starts
2020-03-19 08:35:53,475 scale_up_inspection_cluster successfully completes
2020-03-19 08:35:53,506 run_inspection_cluster starts
2020-03-19 08:35:53,544 Skipping inspection; no AwsMachineImage for ami-759bc50a
2020-03-19 08:35:53,545 Skipping inspection; no AwsMachineImage for ami-065d0be9883aa064a
2020-03-19 08:35:53,546 run_inspection_cluster returns early, successfully completes
2020-03-19 08:39:20,021 persist_inspection_cluster_results_task starts
2020-03-19 08:39:30,783 persist_inspection_cluster_results_task successfully completes
...
2020-03-19 09:35:53,009 scale_up_inspection_cluster starts
2020-03-19 09:35:53,010 scale_up_inspection_cluster says cluster was not scaled down
2020-03-19 09:38:xx,xxx Reaper scaled down instance i-0da140c2425ebf2ee.
2020-03-19 10:35:53,153 scale_up_inspection_cluster does not scale up because no volumes found that need inspection