Setup monitoring for staging-ref

New Staging track: Have the same monitoring as current production alerts - it's possible that at the first iteration we won't be able to set this up, but overall it would be great to have a monitoring alerts. GET does have Prometheus and Grafana support which are used in our performance tests (example 10k dashboard) and GET will support adding custom Grafana dashboard in v1.2.0.

added to epic &559 (closed)

mentioned in epic &559 (closed)

assigned to @pguinoiseau

added staging-improvements label

added teamDelivery label

Editing this to be the first iteration of monitoring for Staging-10K, we'll follow up with more thorough monitoring in a way that can be applied to all GET environments - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14054

mentioned in issue #14054 (closed)

changed the description

changed title from Setup monitoring for gstg-10k to Setup monitoring for staging-ref

changed epic to &594 (closed)

@amyphillips Can you please provide an update for this issue?

Thankyou for creating the epic &594 (closed) to implement standardized monitoring for all GET environments. Given that this particular issue will be the first iteration towards monitoring, can you please list specific tasks that's required for the completion of this issue?

@pguinoiseau did you and @cindy manage to sync on &594 (closed) and work out what will be achievable for this task?

cc/ @kwanyangu in case you have additional ideas for what a good first iteration might look like.

@amyphillips the list looks good for a first iteration.

Let me work with @cindy and @pguinoiseau to align on the same

@amyphillips yes we had a coffee chat a couple weeks ago to discuss it, which resulted in the milestones listed in the epic. I think we can create new issues for each one and maybe sync again to decide who does what.

@pguinoiseau great stuff. Would you be able to create the issues and add to this epic?

@amyphillips done.

Thanks @pguinoiseau do you have the details needed to be able to populate the issue descriptions too? It would be good to know how much work we're facing for this.

I've added the label admin details onto &594 (closed) to make it easy to label the issues too.

Where does this issue sit amongst the work? Do we have a first iteration of monitoring that we could start off with or do you prefer to have the first iteration of a small set of the issues on &594 (closed) as the first iteration of staging-ref monitoring? cc/ @cindy for your thoughts here too

@amyphillips our squad will sync next week to discuss the epic.

marked this issue as related to gitlab-org/gitlab#338978 (closed)

mentioned in issue gitlab-org/gitlab#344223

removed the relation with gitlab-org/gitlab#338978 (closed)

mentioned in issue gitlab-org/gitlab#338978 (closed)

added teamReliability label and removed teamDelivery label

Created https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14726, #14727 (closed) and #14728 (closed) to setup the basic services required.

We are removing the existing built-in (Helm chart) and Omnibus monitoring in favor of deploying a separate/dedicated chart with kube-prometheus-stack, then connecting the Prometheus instance to Thanos so we can leverage our existing metric storage and Grafana service.

Once that is in place we can start looking at the metrics and alertmanager.

Sounds good. Thanks for creating the new issues and updating here @f_santos

@amyphillips Given that the above 3 issues are already in progress (1 completed), is it fair to assume this work will be delivered as part of %14.8 ?

@vincywilson &594 (closed) is progressing and I believe is on track for completion in 14.8. @cindy please shout if this is untrue!

cc/ @amoter

Thanks Amy

@amyphillips Based on the above comment, I have added %14.8 to this issue.

mentioned in epic &594 (closed)

changed milestone to %14.8

The work for the epic &594 (closed) has been completed.

Metrics and dashboards are available at dashboards.gitlab.net.
Monitoring documentation in the runbook.

Closing this issue.

Thank you team for your awesome work here

closed

added workflow-infraDone label

mentioned in issue #24725 (closed)

Setup monitoring for staging-ref

Designs

Child items ...

Activity