Discussion: Topology-Service application/db level Metrics collection

Context

Cells-infra/Tenant Services has been discussing additional monitoring for topology-service(TS) across multiple issues (gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team#526 (comment 2863582756) & gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team#528 (closed)).

This Discussion Issue consolidates our thinking and invites collaboration with observability and runway on implementation approaches.

Current state: Runway out-of-the-box monitor aggregate HTTP/gRPC metrics via runway_ingress and runway_lb SLIs (latency, request rate, error rate). These provide dashboards for both topology-service-rest & topology-service-grpc

The opportunity: As topology-service becomes the authoritative routing layer for Cells, we need visibility into specific failure modes that aggregate metrics won't surface.

Key Metrics We're Considering

After code analysis of TS, we've identified 4 high-value SLIs we'd like to see in Grafana:

  1. Database Connection Pool Health - Track Spanner session utilization to prevent pool exhaustion
  2. Per-Method gRPC Performance - Distinguish ultra-critical Classify RPCs (<10ms target) from admin operations (more tolerant)
  3. Auth/Authz Failure Tracking - Monitor mTLS and RBAC failures (currently invisible since they happen before RPC handlers)
  4. Database Query Performance - Track query latency by operation type to catch performance regressions

Implementation Paths

More Context

Our TS GCP Projects are standalone BYO folder, so there is no connectivity with the rest of Gitlab's VPCs.

We know that Runway provides via Stackdriver/CloudLogging the export of the default SLIs for Runway Services

So we would like to open a discussion here on what the best path forward would be to:

  1. Harvest Metrics from Cloud Spanner (database layer for TS) and have them present in a Dashboard
  2. Add additional metrics/SLIs for some per-method application level concerns and have them added to the existing service dashboards. (likely via this Custom Metrics process)
  3. The best way to Roll up all of these dashboards (TS REST, TS GRPC, TS Spanner) into a single-pane of glass that is convenient for eoc-oncall's use.

Questions for Discussion

  1. Metrics gathering: How can optimally retrieve metrics from Cloud Spanner in the GCP Project: gitlab-runway-topo-svc-prod
  2. OTEL collector: Are there any gotchas to us exporting app level metrics out via Runway+OTEL
  3. Phased approach: Does this sequence of what to focus on make sense?
    • Phase 1: Exporting Stackdriver Spanner metrics into metrics-catalog as a new service/topology-service-spanner.jsonnet
// something like this
apdex: histogramApdex(
histogram='stackdriver_spanner_api_request_latencies_bucket',
selector={
type: 'topology-grpc',
database: 'topology-db',
method: { oneOf: ['ExecuteSql', 'Read'] },
status: 'OK',
},
satisfiedThreshold=20, // 20ms
toleratedThreshold=50, // 50ms
unit='ms',
),
  • Phase 2: Per-method gRPC
  • Phase 3: Auth/authz tracking
  • Phase 4: App-level DB metrics (if needed)
  1. Should Cloud Spanner be tracked as a separate service from the Topology Service, tracked separately in the service/metric catalog?

Your subject matter expertise here would be greatly appreciated to help guide this effort 🙏

cc @gitlab-com/gl-infra/observability @gitlab-com/gl-infra/platform/runway @reprazent @fforster @gitlab-com/gl-infra/tenant-scale/cells-infrastructure @gitlab-com/gl-infra/tenant-scale/tenant-services @jarv

Edited by Aaron Richter