This project is archived. Its data is read-only.

`run_inspection_cluster` leaves cluster running if no messages found

Summary

run_inspection_cluster reads messages and checks that they correspond to valid AwsMachineImage and MachineImage objects in our database. If run_inspection_cluster either receives no message or all messages have no objects in the database, an early return is triggered:

    if not relevant_messages:
        # Early return if nothing actually needs inspection.
        return

However, by the time run_inspection_cluster has been called, scale_up_inspection_cluster must have also successfully been called, meaning the inspection cluster has been scaled up but not yet given a task to run. Since run_inspection_cluster returns early, nothing scales down the cluster because normally the subsequent task persist_inspection_cluster_results_task does that, but if there are no results to process, it does not call scale_down_cluster.

That behavior in persist_inspection_cluster_results_task is correct and important because we do not want it to shut down the cluster while it's running. My initial recommendation is to change the code quoted above from run_inspection_cluster to call scale_down_cluster instead of simply returning.

Another idea to consider by @werwty would be not to discard messages if we no longer have the original AwsMachineImage objects but to allow them to proceed through inspection and only when we process the results discard the then-irrelevant records. This would allow us to simplify some of this workflow's "sanity checks."

The steps to reproduce here are hypothetical based on what I've gathered from reading logs and code.

Steps to Reproduce

  1. Register a cloud account that has an AWS EC2 instance with an AMI that would normally be inspected and that cloudigrade does not yet know about.
  2. Wait for the async tasks to run that perform the image copy, volume, etc. operations.
  3. Before the scheduled scale_up_inspection_cluster runs, delete the cloud account and any of its image objects.
  4. Observe as scheduled task scale_up_inspection_cluster runs.
  5. Observe as scheduled task persist_inspection_cluster_results_task runs.

Expected Result

  • Inspection cluster scales down.

Actual Result

  • Inspection cluster never scales down.

Additional context

Here is a rough timeline of events that @infinitewarp saw in our QA environment when investigating this issue. Times are UTC.

2020-03-19 08:35:52,761 scale_up_inspection_cluster starts
2020-03-19 08:35:53,475 scale_up_inspection_cluster successfully completes
2020-03-19 08:35:53,506 run_inspection_cluster starts
2020-03-19 08:35:53,544 Skipping inspection; no AwsMachineImage for ami-759bc50a
2020-03-19 08:35:53,545 Skipping inspection; no AwsMachineImage for ami-065d0be9883aa064a
2020-03-19 08:35:53,546 run_inspection_cluster returns early, successfully completes
2020-03-19 08:39:20,021 persist_inspection_cluster_results_task starts
2020-03-19 08:39:30,783 persist_inspection_cluster_results_task successfully completes
...
2020-03-19 09:35:53,009 scale_up_inspection_cluster starts
2020-03-19 09:35:53,010 scale_up_inspection_cluster says cluster was not scaled down
2020-03-19 09:38:xx,xxx Reaper scaled down instance i-0da140c2425ebf2ee.
2020-03-19 10:35:53,153 scale_up_inspection_cluster does not scale up because no volumes found that need inspection
Edited Mar 20, 2020 by Brad Smith
Assignee Loading
Time tracking Loading