Corrective Action: Improve safety around Terraform destroy operations (disks)

Summary

We should have more safety when Terraform destroy operations are triggered by accident.

An apply should fail if multiple critical resources containing data (disks) are being deleted.


Suggested by @ahmadsherif during production#15997 (closed) after an accidental deletion of three-recently created gitaly nodes.

Luckily we were able to recover most of the data relatively quickly, but with a <30m data loss on each.

Related Incident(s)

Originating issue(s): production#15997 (closed)

Desired Outcome/Acceptance Criteria

  • Are we able to set this lifecycle setting for the data disk component?
  • If yes, what are the possible caveats? e.g. unable to legitimately delete data disks via CI when desired?
  • Decide on path forward based on the findings to the previous question.
  • Update the lifecycle in Terraform.

Associated Services

Corrective Action Issue Checklist

  • Link the incident(s) this corrective action arose from
  • Give context for what problem this corrective action is trying to prevent re-occurring
  • Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident)
Edited by Filipe Santos