Observability Roadmap Sync - Summary Sep 2025

This issue summarizes the discussion from the Observability Roadmap sync on 25th Sep 2025.

Key Challenge

The team identified a fundamental capacity issue: 14 major initiatives in roadmapnext competing for resources across a 7-person team with zero spare capacity. This roadmap review focused on prioritizing critical business initiatives while acknowledging significant technical debt and infrastructure needs.

Priority Projects

1. Observability for Cells & Component Ownership Model

Create a standardized observability solution fo... (gitlab-com/gl-infra&1711 - closed)

Proposed DRI: @stejacks-gitlab OR @knottos
Overview: Create detailed epic scoping what's needed for component ownership model rollout. At least 4 existing clusters need migration, with Auth service as Q1 priority. Missing logging pipeline component identified as critical gap.

2. SLA Calculation Framework (In Progress)

https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1642+

Proposed DRI: @reprazent
Overview: Must deliver first iteration by end of October. Focused on 99.9% uptime measured at edge via Cloudflare, following competitor standards. Support team will manage operational process once tooling complete.

3. Vector Migration & Log Processing

Proposed DRI: @nduff
Overview: $500K+ annual cost savings opportunity through eliminating duplicate log processing with Security team. Addresses FluentD scaling issues and enables better log field standardization.

4. Standardized SDK & Telemetry Framework

Proposed DRI: @abrandl OR @hmerscher
Overview: Essential foundation to control ingest of four-pillars observability. Current fragmented approach (3+ different methods in Ruby, 4+ in Golang) blocks advanced observability features. Will work with DevEx team's Elliot on implementation.

5. ClickHouse Evaluation

Proposed DRI: TBC
Summary: Consider spike with GitLab Dedicated to test Rails log processing. Need to complete data source improvements (~60% remaining work) before broader rollout consideration.

Deferred Items

  • Sentry Replacement: Status quo maintained while exploring alternatives to solve RUM (client-side) problem
  • Distributed Tracing: Blocked pending SDK standardization work. Previous auto-instrumentation attempts produced unusable noise
  • Cross-tenant Error Budgets: Increasingly urgent as more services span multiple Mimir tenants, affecting Usage Billing and AI Gateway monitoring
  • Next Gen Service Catalog: Impossible to prioritize although it's foundational impact will be significant

Decisions

  1. Growing Importance of Cells/COM Tenant Model: @stejacks-gitlab to work with team put together project structure.

  2. Focus on Logging Infrastructure: Vector migration and log field standardization identified as highest ROI work, enabling both cost savings and better observability foundations

  3. SDK-First Approach: All future observability capabilities (tracing, improved metrics) depend on standardized telemetry SDKs across Ruby, Golang, and upcoming Rust services

  4. Collaborative Model: Increased partnership with Developer Experience team, particularly Elliot, for SDK development and developer adoption

Next Steps

  • Stephanie to schedule component ownership model scoping session
  • DRIs to ensure epics are workflow-infraReady for top priority items
  • Regular roadmap reviews every 6 weeks to maintain momentum
  • Quarterly sync with Product/Engineering leadership on capacity and priorities

Cost Impact Summary

  • Vector migration: ~$500K annual savings
  • Log field standardization: Additional ~$500K+ annual savings (25% volume reduction)
  • Combined initiatives represent ~$1M+ annual cost optimization opportunity
Edited by Liam McAndrew