Corrective action: Update runbook for prometheus increase storage

Summary

The following MR gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!3752 (merged) trigger the incident due an unexpected behaviour with the prometheus operator.

This prometheus operator doesn't follow the common guidelines for operate common Kubernetes resources, so what is a simple operation in kubernetes via the Prometheus operator results in downtime across both Prometheus instances.

The outcome of this corrective actions is update all the documentation resources and code to make everyone aware that the common change in kubernetes is not safe on this case, and provide a safe runbook to performance the storage increase.

Related Incident(s)

Originating issue(s): production#17184

Desired Outcome/Acceptance Criteria

Associated Services

ServicePrometheus in Production Engineering

Corrective Action Issue Checklist

  • Link the incident(s) this corrective action arose from
  • Give context for what problem this corrective action is trying to prevent re-occurring
  • Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident)
  • Assign a service label
Edited by Raúl Naveiras