Development Analytics Q4 Planning

Q4

Epic Description DRIs

Build CI Failure Signatures for Pattern Detecti... (&27)

Complete from Q3, add failure categories and signatures to ClickHouse datastore. This will enable real time dashboards and alerts on CI failures, and put in place the data we need to better identify true master broken incidents quickly. This feeds in to pipeline stability improvements as a foundational element. Lohit/Pranshu

Build single backend test observability solutio... (&28)

Complete from Q3, our ClickHouse based Test Observability dashboards. These dashboards will underpin our work on identifying and fix/delete/quarantine flakey tests, and will support deep links into specific flakey test issues we create, giving engineers much improved visibility into the health of their tests. Andrey

Improve the quarantine process for flaky tests (gitlab-org/quality&259)

Improve Flakey Test detection by moving to ClickHouse based data, and support auto-quarantine system with Test Governance, to drive CI stability.

Our success metric here is to drive down the number of flaky tests, and reduce unneeded pipeline failures.

David/Ievgen

Review CI failures and ensure top infrastructur... (gitlab-org/quality&263)

Aligned with our DX survey actions around CI stability, we will review the top reasons for CI failures (such as infx issues or timeouts) and create issues with the responsible teams to work through and resolve the issues. Our success metric here is to reducing the amount of unneeded pipeline failures Pranshu

Introduce test coverage observability with Clic... (gitlab-org/quality&240)

Engineering teams lack visibility into test coverage trends and patterns across our codebase. While coverage data is generated during CI/CD, it's trapped in short-lived artifacts. This is a foundational component to being able to surface coverage to teams, to allow them understand how quarantined or deleted tests etc impact their coverage. Richard
EPIC TBD Support SaaS availability call with dashboards TBD

Migrate CI related Development Analytics snowfl... (&31)

Migrate CI related Development Analytics snowflake dashboards and data to ClickHouse/Grafana to improve discoverability TBD

Migrate existing Devex Dashboards to new Data Path (&29)

Support consolidation of Devex dashboards and data to Grafana/ClickHouse TBD

Q1

Epic Description

Improve Master-Broken Incident Detection Accuracy (&24)

Improve the mean time to recovery from master broken incidents by improving detection and notification.

This is also a foundational to be able to review the master broken incidents easily to ensure we can add extra preventative measures where needed to stop reoccurrence of particular failures.

Build scalable CI job telemetry reporting (&22)

EPIC TBD Clarify ownership of EP related tools (DangerFiles, Renovatebot, Triage Ops support)
Edited by Paul Phillips