Sign in or sign up before continuing. Don't have an account yet? Register now to get started.

1.4 Implement Monitoring for Geo on Staging

The first step here is a discovery item to establish what exactly needs to be monitored (and how). Then we need to decide on how to monitor this, and ultimately implement the monitoring dashboards themselves.

Status

Grafana: Geo Primary insights

This was originally build for the primary. During the GCP migration we had two separate Prometheus environments (Azure and GCP). There's not a lot there that's really useful to see what the status of Geo is.

What we can change on this dashboard:

Geo events: I'm seeing a total between 10k and 20k, but that does not represent in the sub-types of events. We should add all types and make sure they add up to the total. This might require a code change if we realize we not reporting all sub-types
Repo verification: we have 154 verified repos and wikis. This does not match the number of "checksummed" repos the Geo admin panel is saying (3M+)
We probably can delete all other graphs. They are irrelevant and also not showing anything at the moment

Grafana: Geo Secondary status

This is mostly monitoring the secondary side of Geo operations.

The graphs per type are looking mostly good. For uploads there are 48 failures, matching what the Geo admin panel says
We might be missing graphs for design repos
Also graphs for container registry is missing?
DB replication lag is showing NaN. We should double check
We're using deprecated single stat graphs. We should migrate them:

Gauge visualizations within the Singlestat panel are deprecated. Please migrate this panel to use the Gauge panel.
NOTE: Sparklines are not supported in the gauge panel
There are some Sidekiq graphs which I don't think are relevant and we might want to delete.

Alerts

As far as I know we don't have alerts set up for Geo on staging. Maybe it will be useful to have a least a few alerts:

When database replication lag is getting really high (this one might already exist, we should ask SREs)
When failures are getting too high

I don't have any experience with Alert Manager. So I'm not sure what's needed to get this done.

Sentry

We need to make sure code exceptions are reported to Sentry. Also those happening on the secondary.

Looking at https://sentry.gitlab.net/gitlab/staginggitlabcom I don't see any errors from the secondary (did a search on secondary, didn't return anything). I guess the secondary is not set up to report errors.

Logs

Ideally we should be able to access the geo.log through https://nonprod-log.gitlab.net/

I don't remember how we had this configured during the GCP migration. I think geo.log was aggregated into pubsub-rails-inf-gstg at the time, but I think it would be better to have a separate pubsub for that.

References

Handbook - Monitoring of GitLab.com
Geo Status Dashboard
GitLab Triage Dashboard
Runbooks - How to - Monitoring Overview
Grafana Dashboards repo
WIP: Add Geo DR to staging
GitLab Cookbooks / Chef Repo
Handbook - GitLab.com Production Architecture

Edited Mar 17, 2020 by Toon Claes

Assignee Loading

Time tracking Loading