Migrate gprd prometheus tsdb disks to SSD

Summary

We have had a few recent issues with our production prometheus VMs locking up, where the only solution is to hard reset them.
There has been a recent effort to cleanup the amount of series/labels generated: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16777.

However we are still seeing instances of the VM hitting a huge disk latency spike which causes enough of a bottleneck to the system that is unrecoverable.
While we still need to isolate the exact cause of this, migrating to SSD should help the VMs survive these sudden changes in workload.

image

image

The machines are still using standard persistent disk for its prometheus tsdb.
We should migrate this to SSD to help with IO consistency and latency on these systems.

Specifically:
prometheus-01-inf-gprd.
prometheus-02-inf-gprd.

Related Incident(s)

https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8047.
production#8020 (closed).
production#7988 (closed).
production#7906 (closed).

Originating issue(s): gitlab-com/gl-infra/production#8047

Desired Outcome/Acceptance Criteria

Migrate the prometheus tsdb to SSD to improve IO.

Associated Services

ServicePrometheus

Corrective Action Issue Checklist

  • Link the incident(s) this corrective action arose out of
  • Give context for what problem this corrective action is trying to prevent from re-occurring
  • Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • Assign a priority (this will default to 'Reliability::P4')
Edited by Nick Duff