GET Monitoring: Preparing the Metrics Catalog for monitoring GET and other GitLab instances
The metrics-catalog is the name of the Jsonnet tooling we use to define service-level monitoring, saturation and utilization metrics on GitLab.com.
This tooling has evolved over the past 3 years or so, starting with hand-coded YAML based "general metrics", and over time evolved into the generated, declarative tooling with use today.
However, this tooling is very much designed exclusively for GitLab.com.
With some upcoming projects, such as Project Horse, there is a need to extend this tooling to non-GitLab instances so that we can monitor them using the same tools.
A side-effect of this is that it may also be possible for other self-managed GitLab instances to start using this tooling.
Barriers to General GitLab Monitoring
- Label Taxonomy
- Thanos / Prometheus split
- Multiple Jsonnet Entrypoints
- Reusable library code outside of
libsonnet
- Saturation metrics are linked to specific GitLab.com services
Label Taxonomy
The GitLab.com label taxonomy is hard-coded into the metrics-catalog at present. This includes our standard labels such as env
, environment
, tier
, type
, stage
and shard
. While these are critical for the operation of GitLab.com, smaller instances do not need them and requiring their presence would be clunky.
Thanos / Prometheus split
Because of the scale of GitLab.com, we use two-tier system for evaluating service-level indicator metrics (see https://www.youtube.com/watch?v=6sfr2IGJQXk for a talk on this subject). This architecture is reflecting in our metrics structure. Smaller gitlab instances do not need Thanos and Prometheus and this would not be a good operator experience. For this reason, the metrics catalog needs to deal with both "single-tier" and "two-tier" evaluation. Luckily the aggregation sets concept that we have should help with this.
Multiple Jsonnet Entrypoints
At present, there are multiple Jsonnet entrypoints in rules-jsonnet
, rules-thanos-jsonnet
and elsewhere. Each of these files contains a mixture of shared logic and GitLab.com specific configuration. Splitting this up and using a single entrypoint for each set of YAML output (ie, one entrypoint for Prometheus Rules, one entrypoint for Thanos Rules etc) would make this easier.
libsonnet
Reusable library code outside of In order to reuse the metrics catalog, we should move things like resource saturation metrics into the libsonnet directory, so that it can be shared using jsonnet bundler.
Saturation metrics are linked to specific GitLab.com services
Compared to a GitLab.com has a unique set of services. For example, GitLab.com uses web
, api
, and git
services where other GitLab instances make do with a single webservice
to handle all three types.
The problem is that the saturation framework is currently hardcoded to point to specific GitLab.com services. One solution would be to match saturation metrics to services on a tagging basis. We are already doing this for some service (for example, we use tags to match Golang services with Golang saturation metrics) but we should extend this to all services.
Proof-of-concept: !3859 (closed)