Improve Elasticsearch Index creation process and monitoring.
I discovered an issue with the nonprod cluster yesterday where the master nodes had hit capacity, also causing one of them to fall over and become unresponsive. Ignoring the fact that the nodes can become unresponsive when they hit the disk watermark... (Elastic things TM) I [opened an incident](https://app.incident.io/gitlab/incidents/7022) to remediate, and this has since been fixed. There were two primary problems leading up to this: - Disk Saturation monitoring is not paging. It is however alerting to us in our `#g_infra_observability_alerts` channel - which should have been noticed. However because we have just been plagued with noise from other saturation alerts it did get lost in the noise. - An index had been created but not configured correctly with ILM. - This one can actually easily happen because often the index is created via a API request, but then the ILM and index template/settings is latter added via Runbooks. The problem here is the initial index creation often leads to it not being configured with a `rollover_alias` correctly. In this case it lead to the index getting to 600G sitting on the hot tier, which in the non prod case is also the master nodes. From this we need two action items: - [ ] Determine if nonprod saturation should in fact be paging to oncall. - [ ] If not, but also regardless we need to clean up the noise in our alerts channel as this has been ignore for a while. - [ ] Provide a better mechanism or way to create indexes with Elastic. - Ideally with the shift to Terraform we can hope to solve this here, where we can ensure index creation is done via Terraform if possible, and without the ability for it to end up created without the correct ILM and rollover configuration.
issue