Monitoring improvements focusing on Thanos

This adds a few improvements to the monitoring service:

  • Remove the public dashboards Thanos component: we don't have that anymore 😢.
  • Merge all memcached components into one measured from the clientside.
  • Remove fqdn significant labels for components that now only run on Kubernetes
  • Tightening up some apdex durations based on the past week of data.
  • Add some significant labels to the rule evaluations so we can distinguish failures in Prometheus and Thanos on the detail panels
  • Add real world target durations on GRPC apdexes

I looked into separating out the Thanos service entirely for https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14335#note_732980227, but that would require reworking some "type" labels on source metrics and I'm not quite sure where all of these live.

This already cleans up the monitoring service a bit so we can move the SLIs when we do split up the monitoring service.

Edited by Bob Van Landuyt

Merge request reports

Loading