Kubernetes Observability: Metrics
Overview
This issue tracks the implementation of metrics collection and integration for the Runway GKE clusters. We need to establish proper observability for both platform-level metrics (from GKE) and application-level metrics (from our in-house services) to enable effective monitoring, alerting, and SLO management.
Background
We want to on-board the first pilot customer to Runway on Kubernetes. This requires a base-level of productionization, including observability.
Current observability capabilities are insufficient to track service health and performance. There are platform metrics that we can look at in the GCP cloud console, but we lack alerting capabilities and integration with https://dashboards.gitlab.net/.
Objectives
- Implement platform metrics collection from GKE
- Implement application metrics collection from our services
- Integrate metrics with our dashboard system
Implementation Details
Platform Metrics (GKE)
- Identify relevant platform-level metrics for monitoring cluster and node health, for example:
- Request rate by response status (for availability SLI)
- Request latency (for latency SLI)
- Modify existing stackdriver exporter configuration to capture selected metrics
- Test and validate data flow to our metrics backend
Application Metrics
- Deploy OpenTelemetry (OTEL) collector to clusters using the
k8s-mgmt
repository - Configure collectors to scrape application metrics endpoints
- Implement appropriate aggregation and processing rules
- Test and validate data flow to our metrics backend
Definition of Done
-
Platform metrics are being collected from GKE clusters -
Application metrics are being collected from our services -
Metrics are displayed on Runway service dashboards at https://dashboards.gitlab.net/ -
User-facing documentation is updated to outline which (platform) metrics are collected out of the box and how application metrics can be implemented -
Runway developer facing documentation is updated to describe the metric collection process
Edited by Dan Ryan