Migrate gprd prometheus tsdb disks to SSD
Summary
We have had a few recent issues with our production prometheus VMs locking up, where the only solution is to hard reset them.
There has been a recent effort to cleanup the amount of series/labels generated: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16777.
However we are still seeing instances of the VM hitting a huge disk latency spike which causes enough of a bottleneck to the system that is unrecoverable.
While we still need to isolate the exact cause of this, migrating to SSD should help the VMs survive these sudden changes in workload.
The machines are still using standard persistent disk for its prometheus tsdb.
We should migrate this to SSD to help with IO consistency and latency on these systems.
Specifically:
prometheus-01-inf-gprd.
prometheus-02-inf-gprd.
Related Incident(s)
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8047.
production#8020 (closed).
production#7988 (closed).
production#7906 (closed).
Originating issue(s): gitlab-com/gl-infra/production#8047
Desired Outcome/Acceptance Criteria
Migrate the prometheus tsdb to SSD to improve IO.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4')

