DX Insights dashboard (#347) · Epics · Developer Experience

DX Insights dashboard

## Overview This epic delivers a Grafana Insights home page: a single cross-team monitoring dashboard that shows the health and activity of every GitLab group across all triage report criteria in one view. The dashboard serves two purposes simultaneously. First, it gives the Development Analytics team a way to monitor the status of all ~75 groups at once without opening individual triage reports. Second, it acts as a discovery surface for the detailed dashboards (code coverage, test mapping, quarantined tests, etc.) by linking each metric to the relevant drill-down view. --- ## Core component: metrics overview table The centrepiece is a single table with groups as rows and triage report criteria as columns. Each cell shows the count or key metric for that group and criteria, using the same attention signals that the triage reports already surface. ### Criteria covered All ~20 sections from [#436](https://gitlab.com/gitlab-org/quality/analytics/team/-/work_items/436): | Domain | Sections | |--------|----------| | Security | Security issues | | Availability | Availability issues | | Test health | Quarantined tests, Flaky tests, Test coverage, Test duration | | Infradev | New infradev, Overdue infradev, Infradev heatmap | | Feature proposals | Customer feature proposals, Non-customer feature proposals | | UX debt | Unscheduled UX debt | | Bugs | All bugs heatmap, Customer bugs heatmap, Frontend customer bugs, Frontend bugs, Backend customer bugs, Backend bugs, Past SLO bugs, Vintage bugs, Blocked bugs | | Community | New untriaged community issues | ### How attention is surfaced The triage reports do not use RAG (red/amber/green) classification. Instead, each section surfaces issues needing attention through SLO breach detection, attention filtering, or raw counts. The dashboard mirrors these signals: - **SLO-backed sections** (Security, Availability, Infradev, Test Duration): Show counts of issues approaching or past SLO. These numbers are inherently meaningful -- any non-zero value warrants investigation. - **Attention-filtered sections** (Deferred UX, Quarantine, Coverage Risk): Show count of issues flagged as requiring attention (missing labels, severity too high, tests without tracking issues, etc.). - **Count-only sections** (Feature Proposals, Flaky Tests, Bugs): Show raw counts. Viewers identify outliers by relative magnitude across groups. Grafana's built-in value-based colouring (colour intensity scaling with count magnitude) highlights cells with notably high values, letting problem areas stand out visually without introducing arbitrary thresholds. ### Granularity A dropdown variable controls the level of aggregation: Section (~5 rows), Stage (~15 rows), or Group (~75 rows). One dashboard, one table, re-aggregated based on the selection. At group level, cells show individual counts. At rolled-up levels, cells show aggregated totals. The exact presentation at rolled-up levels will be decided during implementation. ### Navigation Each cell links to the relevant detailed dashboard or triage report section, giving teams a path from the overview to the drill-down. ### Data source All data required for the dashboard is available in ClickHouse today. No new ingestion work is needed. - **Issue-based domains** (Security, Availability, Infradev, Bugs, Feature Proposals, UX Debt, Community): Queryable from `work_item_metrics.issue_metrics`, which has a `labels` array containing group, severity, priority, and type labels. Data is current with ~1 year of history. - **Test-based domains** (Quarantine, Flaky Tests, Test Duration): Queryable from `test_metrics.test_results`, which has `group`, `stage`, and `section` columns for direct group-level breakdown. - **Coverage Risk**: Queryable from `code_coverage.test_file_risk_summary`, joinable to group via `code_coverage.coverage_metrics`. SLO threshold logic for sections that use it (Security, Availability, Infradev, Test Duration) is replicated independently in the dashboard queries, matching the handbook-defined thresholds used by the triage reports. Specific query details and label mappings are documented in each child issue. --- ## Supporting panels Ideas for additional panels that could complement the core table: - **Worst offenders** -- Top N groups with the highest attention counts per domain - **Summary counters** -- Stat panels showing org-wide totals (e.g., "8 groups with issues past SLO", "12 groups with >20 open bugs") - **Domain health overview** -- Domain-level summary showing distribution of attention counts across groups - **Trend over time** -- Week-over-week metric changes, highlighting groups trending in the wrong direction

epic