Skip to content

Kubernetes Observability: Metrics

Overview

This issue tracks the implementation of metrics collection and integration for the Runway GKE clusters. We need to establish proper observability for both platform-level metrics (from GKE) and application-level metrics (from our in-house services) to enable effective monitoring, alerting, and SLO management.

Background

We want to on-board the first pilot customer to Runway on Kubernetes. This requires a base-level of productionization, including observability.

Current observability capabilities are insufficient to track service health and performance. There are platform metrics that we can look at in the GCP cloud console, but we lack alerting capabilities and integration with https://dashboards.gitlab.net/.

Objectives

  1. Implement platform metrics collection from GKE
  2. Implement application metrics collection from our services
  3. Integrate metrics with our dashboard system

Implementation Details

Platform Metrics (GKE)

  • Identify relevant platform-level metrics for monitoring cluster and node health, for example:
    • Request rate by response status (for availability SLI)
    • Request latency (for latency SLI)
  • Modify existing stackdriver exporter configuration to capture selected metrics
  • Test and validate data flow to our metrics backend

Application Metrics

  • Deploy OpenTelemetry (OTEL) collector to clusters using the k8s-mgmt repository
  • Configure collectors to scrape application metrics endpoints
  • Implement appropriate aggregation and processing rules
  • Test and validate data flow to our metrics backend

Definition of Done

  • Platform metrics are being collected from GKE clusters
  • Application metrics are being collected from our services
  • Metrics are displayed on Runway service dashboards at https://dashboards.gitlab.net/
  • User-facing documentation is updated to outline which (platform) metrics are collected out of the box and how application metrics can be implemented
  • Runway developer facing documentation is updated to describe the metric collection process
Edited by Dan Ryan