Observability Roadmap Sync - Summary Sep 2025
This issue summarizes the discussion from the Observability Roadmap sync on 25th Sep 2025.
Key Challenge
The team identified a fundamental capacity issue: 14 major initiatives in roadmapnext competing for resources across a 7-person team with zero spare capacity. This roadmap review focused on prioritizing critical business initiatives while acknowledging significant technical debt and infrastructure needs.
Priority Projects
1. Observability for Cells & Component Ownership Model
Create a standardized observability solution fo... (gitlab-com/gl-infra&1711 - closed)
Proposed DRI: @stejacks-gitlab OR @knottos
Overview: Create detailed epic scoping what's needed for component ownership model rollout. At least 4 existing clusters need migration, with Auth service as Q1 priority. Missing logging pipeline component identified as critical gap.
2. SLA Calculation Framework (In Progress)
https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1642+
Proposed DRI: @reprazent
Overview: Must deliver first iteration by end of October. Focused on 99.9% uptime measured at edge via Cloudflare, following competitor standards. Support team will manage operational process once tooling complete.
3. Vector Migration & Log Processing
Proposed DRI: @nduff
Overview: $500K+ annual cost savings opportunity through eliminating duplicate log processing with Security team. Addresses FluentD scaling issues and enables better log field standardization.
4. Standardized SDK & Telemetry Framework
Proposed DRI: @abrandl OR @hmerscher
Overview: Essential foundation to control ingest of four-pillars observability. Current fragmented approach (3+ different methods in Ruby, 4+ in Golang) blocks advanced observability features. Will work with DevEx team's Elliot on implementation.
5. ClickHouse Evaluation
Proposed DRI: TBC
Summary: Consider spike with GitLab Dedicated to test Rails log processing. Need to complete data source improvements (~60% remaining work) before broader rollout consideration.
Deferred Items
- Sentry Replacement: Status quo maintained while exploring alternatives to solve RUM (client-side) problem
- Distributed Tracing: Blocked pending SDK standardization work. Previous auto-instrumentation attempts produced unusable noise
- Cross-tenant Error Budgets: Increasingly urgent as more services span multiple Mimir tenants, affecting Usage Billing and AI Gateway monitoring
- Next Gen Service Catalog: Impossible to prioritize although it's foundational impact will be significant
Decisions
-
Growing Importance of Cells/COM Tenant Model: @stejacks-gitlab to work with team put together project structure.
-
Focus on Logging Infrastructure: Vector migration and log field standardization identified as highest ROI work, enabling both cost savings and better observability foundations
-
SDK-First Approach: All future observability capabilities (tracing, improved metrics) depend on standardized telemetry SDKs across Ruby, Golang, and upcoming Rust services
-
Collaborative Model: Increased partnership with Developer Experience team, particularly Elliot, for SDK development and developer adoption
Next Steps
- Stephanie to schedule component ownership model scoping session
- DRIs to ensure epics are workflow-infraReady for top priority items
- Regular roadmap reviews every 6 weeks to maintain momentum
- Quarterly sync with Product/Engineering leadership on capacity and priorities
Cost Impact Summary
- Vector migration: ~$500K annual savings
- Log field standardization: Additional ~$500K+ annual savings (25% volume reduction)
- Combined initiatives represent ~$1M+ annual cost optimization opportunity