Perform a single zonal cluster rebuild in staging
With the amount of dependencies determined in #2342 (closed) it was determined that there is enough concern regarding our procedures and documentation that we may not have all of the items adequately documented to rebuild a cluster without issue. Utilize this issue to plan out the necessary steps to perform the following operations:
- Test turning down a cluster for a period of time - this test is primarily to determine which alerts will need to be silenced
- create a 4th cluster to test booting new ones and adapt the terraform code if needed.
- Remove a zonal cluster
- Rebuild that same cluster - Let's avoid configuration changes, instead bring up a cluster as similar as possible to the one removed such that we can test our documentation of the rebuild and improve where needed
- Complete deployment of all objects and ensure that auto-deploy works as desired
While performing the above, create issues, or MR's that target improvements to our documentation. Feed any learnings of the process into #2342 (closed) such that we can better understand which items we should attempt to prioritize.
Milestones
-
Test turning down a cluster (removing traffic) for a period of time -
Create a 4th cluster -
Deploy workloads -
CR is built with the intent to replace a cluster production#7598 (closed) -
A single cluster was indeed replaced -
Any necessary cleanup/new issues/MR's are completed -
Learnings from this exercise that provide feedback for #2342 (closed) are documented gitlab-com/runbooks!5041 (merged) -
Issues are created for various improvements inside of epic &776 (closed)
Edited by Ahmad Tolba