Feedback from the Dedicated Team after creating an external monitoring solution for Geo

Hello from the Dedicated Team 👋🏽

What is this about

In Dedicated, customers have a Gitlab instance provisioned for them. One of the pillars of this tenancy is privacy. For that reason, the Dedicated Team has no access to the instances themselves.

Recently, we've introduced Geo instances but quickly found ourselves unable to monitor them without access to the Geo Dashboard Page that is present in the Admin panel.

We've now completed an Epic that enabled us to monitor the state of Geo secondaries without direct access to this Dashboard. Here are some of the struggles and insights we've had in the process. Hopefully they can be of value to you guys.

Grafana Dashboards

When looking for Grafana Dashboards we came across these but we also found this Epic that described these Grafana Dashboards as outdated. We couldn't find the exact reason for why they were outdated so we tried deploying them expecting them not to work at all. Surprisingly they did and they seem to produce correct values and appear Good Enough ™️ for our use-case, but we couldn't tell if they were outdated because they contained incorrect metrics, incorrect metric calculations or just simply hadn't been revised in a while.

Alerting

Having the dashboards in place is a good step in order to inspect current (and previous) state but its the alerts that are the first signal that this inspection is necessary in the first place.

In looking for guidance, we couldn't find any. We couldn't find a recommended list of metrics to keep an eye for or what ranges they should fall within during normal operations.

Our first attempt was to monitor database write-ahead log lag (geo_db_replication_lag_seconds) but it quickly became apparent that in RDS this replication will never be a bottleneck. We've tested this via sysbench:

$ sysbench oltp_write_only --db-driver=pgsql \
--table-size=100000 \
--tables=50 \
--threads=10 \
--pgsql-host=$HOST \
--pgsql-port=$PORT \
--pgsql-user=$DBUSER \
--pgsql-password=$DBPASS \
--pgsql-db=$DB \
run

and via GPT:

docker run -e ACCESS_TOKEN=<TOKEN> -v ~/gitlab/amp/10k.json:/10k.json -it gitlab/gpt-data-generator --environment /10k.json

The load may not the highest, but what was curious is that we never saw this lag go above 58 seconds:

This leads us to believe there is some throttling involved, but we couldn't find any in the Geo or the RDS configuration.

The behaviour was also erratic that multiple sysbench and GPT runs would yield completly different lag high-watermarks. Bizarrely, a few of the test runs showed no lag at all. In the end, we decided to abandon this metric altogether for alerting purposes as we couldn't reason with its behaviour.

We then turned our attention to the Geo Dashboard Page in the Gitlab.com instances:

Ideally, we'd be able to alert on those values but had to resort to a combination of source-code inspection and trial-and-error in order to replicate them using the available metrics. We still haven't managed to replicate that "31 seconds" reliably. It would be nice to have the operation for this dashboards described in more detail somewhere.

N.B. This dashboard only shows Git and File synchronisation. We do not know if these two are enough for an overall assessment of system health.

Runbooks

Of course, all of this effort becomes most useful when there are clear and well defined processes in place to resolve issues once it becomes apparent that there is one.

We found the Troubleshooting Page to be quite complete. The problem we faced here was on identifying the triggers that would point to each of the resolutions on the troubleshooting. Ideally, there would be a comprehensive list of external signals (i.e. metric values and behaviours) that would correspond to a potential solution on that page rather than having to hypothesise on what could be happening and then using the page to validate and fix. Log investigation is also an option but that falls more into logging than it does into observability.

Conclusion

Hopefully, this is useful feedback for the Geo team. Having, in my opinion, this few pain-points in monitoring a system as complex as Geo is a testament to the work your team has done.

Also, a special thank you to @mkozono and @lkorbasiewicz for the proactivity in replying to my (many) queries in the g_geo Slack channel.

/cc @amknight @o-lluch @fzimmer

Edited May 15, 2023 by Oriol Lluch

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information