1.4 Implement Monitoring for Geo on Staging
The first step here is a discovery item to establish what exactly needs to be monitored (and how). Then we need to decide on how to monitor this, and ultimately implement the monitoring dashboards themselves.
Status
Geo Primary insights
Grafana:This was originally build for the primary. During the GCP migration we had two separate Prometheus environments (Azure and GCP). There's not a lot there that's really useful to see what the status of Geo is.
What we can change on this dashboard:
-
Geo events: I'm seeing a total between 10k and 20k, but that does not represent in the sub-types of events. We should add all types and make sure they add up to the total. This might require a code change if we realize we not reporting all sub-types -
Repo verification: we have 154 verified repos and wikis. This does not match the number of "checksummed" repos the Geo admin panel is saying (3M+) -
We probably can delete all other graphs. They are irrelevant and also not showing anything at the moment
Geo Secondary status
Grafana:This is mostly monitoring the secondary side of Geo operations.
-
The graphs per type are looking mostly good. For uploads there are 48 failures, matching what the Geo admin panel says -
We might be missing graphs for design repos -
Also graphs for container registry is missing? -
DB replication lag is showing NaN. We should double check -
We're using deprecated single stat graphs. We should migrate them: Gauge visualizations within the Singlestat panel are deprecated. Please migrate this panel to use the Gauge panel.
NOTE: Sparklines are not supported in the gauge panel -
There are some Sidekiq graphs which I don't think are relevant and we might want to delete.
Alerts
As far as I know we don't have alerts set up for Geo on staging. Maybe it will be useful to have a least a few alerts:
-
When database replication lag is getting really high (this one might already exist, we should ask SREs) -
When failures are getting too high
I don't have any experience with Alert Manager. So I'm not sure what's needed to get this done.
Sentry
-
We need to make sure code exceptions are reported to Sentry. Also those happening on the secondary.
Looking at https://sentry.gitlab.net/gitlab/staginggitlabcom I don't see any errors from the secondary (did a search on secondary
, didn't return anything). I guess the secondary is not set up to report errors.
Logs
-
Ideally we should be able to access the geo.log
through https://nonprod-log.gitlab.net/
I don't remember how we had this configured during the GCP migration. I think geo.log
was aggregated into pubsub-rails-inf-gstg
at the time, but I think it would be better to have a separate pubsub for that.