New Staging track: Have the same monitoring as current production alerts - it's possible that at the first iteration we won't be able to set this up, but overall it would be great to have a monitoring alerts. GET does have Prometheus and Grafana support which are used in our performance tests (example 10k dashboard) and GET will support adding custom Grafana dashboard in v1.2.0.
@amyphillips Can you please provide an update for this issue?
Thankyou for creating the epic &594 (closed) to implement standardized monitoring for all GET environments. Given that this particular issue will be the first iteration towards monitoring, can you please list specific tasks that's required for the completion of this issue?
@amyphillips yes we had a coffee chat a couple weeks ago to discuss it, which resulted in the milestones listed in the epic. I think we can create new issues for each one and maybe sync again to decide who does what.
Thanks @pguinoiseau do you have the details needed to be able to populate the issue descriptions too? It would be good to know how much work we're facing for this.
I've added the label admin details onto &594 (closed) to make it easy to label the issues too.
Where does this issue sit amongst the work? Do we have a first iteration of monitoring that we could start off with or do you prefer to have the first iteration of a small set of the issues on &594 (closed) as the first iteration of staging-ref monitoring? cc/ @cindy for your thoughts here too
We are removing the existing built-in (Helm chart) and Omnibus monitoring in favor of deploying a separate/dedicated chart with kube-prometheus-stack, then connecting the Prometheus instance to Thanos so we can leverage our existing metric storage and Grafana service.
Once that is in place we can start looking at the metrics and alertmanager.