Feature Category Summary for Error Budgets

Workflow board for Error Budgets: Board

Project Work

Topic Links Status Summary
Error Budgets as a Performance Indicator
@reprazent
&880
Board
workflow-infraReady 2023-07-05

Changing the status here back to ready while taking the time to check the error budgets report of July. To be resumed soon.
Improve observability of the websockets service
&1111
Board
workflow-infraTriage
Include patroni SLIs into the error budget for stage groups
&951
Board
workflow-infraTriage 2023-06-28

Including this into the Error Budget overwhelms the calculation in favour of these high occurance metrics. We must reevaluate how we present Error Budgets.
Improve GraphQL SLI and include it in error budgets for stage groups
&665
Board
workflow-infraProposal 2022-01-19:

The GraphQL SLI has been created but only measures apdex. The default urgency currently has a 1s target duration, but this probably needs to be revisited. All stage groups have been opted out from using the SLI in their error budget by default

Issues in gitlab-org/gitlab:

- gitlab-org/gitlab#349546 (closed): Refine the urgencies defined for GraphQL queries
- gitlab-org/gitlab#345263 (closed): Graphql query error SLI
- gitlab-org/gitlab#328535 (closed): GraphQL query-to-feature correlation mechanism
- gitlab-org/gitlab#345141: GraphQL query-to-urgency correlation
Alert stage groups to SLO violations
&615
Board
workflow-infraTriage 2022-05-06

We are working on the issues/epics that block this epic. Namely #1395 (closed) which will be followed by &663 and &700

Issues Not in Epics

Summary of issues that are not in an Epic (for Error Budgets)

Total Issues: 31

Topic Service Board Workflow Status
Discussion: Proposal of Red/Green Error Budgets
#2391
workflow-infraTriage
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2275
workflow-infraTriage
High traffic endpoints should not have low-urgency thresholds
#2237 (closed)
workflow-infraTriage
Highlights for Error Budget Report
#2142 (closed)
workflow-infraIn Progress
Aggregate the metric application_sli_aggregation:rails_request:apdex:weight:score_1h with wider time frames
#2111 (closed)
workflow-infraProposal
Evaluate endpoint Repositories::GitHttpController#info_refs default urgency, given high traffic share (8%)
#2107 (closed)
workflow-infraIn Progress
Gather usage information on existing stage (group) dashboards
#2094 (closed)
Handle unknown endpoints in error budgets
#2087 (closed)
workflow-infraTriage
Source Code Error Budgets dashboard displaying "No Data"
#2068 (closed)
workflow-infraTriage
Improve SLI metrics implementation design to avoid misinterpretation of prometheus labels required
#2058 (closed)
workflow-infraProposal
"Replica was not up to date" exceptions affect Error Budgets
#1980 (closed)
ServiceSidekiq workflow-infraTriage
Feature Category Summary for Error Budgets
#1871 (closed)
workflow-infraTriage
Make Gitaly per-node alerts less noisy
#1781 (closed)
ServiceGitaly workflow-infraTriage
Stage group apdex-ratio recording is erratic
#1702 (closed)
workflow-infraTriage
Support multiple urgencies for Grape endpoints with the same path but different HTTP verb
#1670 (closed)
workflow-infraTriage
Include 'urgency' in Rails request metrics that result in an ETag cache hit
#1600 (closed)
ServiceWeb workflow-infraTriage
Feedback on Infradev Reports and Error Budget Reports
#1569 (closed)
Define a process for investigating groups with suspiciously high availability
#1530 (closed)
Extract a webservice architype to be reused for the web, api and git services.
#1512 (closed)
workflow-infraTriage
Remove custom feature category recordings for the puma component
#1481
workflow-infraTriage
Improve the Gitaly SLI using an urgency per RPC
#1450 (closed)
ServiceGitaly boardplanning workflow-infraTriage
Add an SLI for the monitoring service reporting failed
#1443 (closed)
ServiceMonitoring-Other workflow-infraTriage
Revisit rails_requests apdex SLO for git, api and web after urgencies have been set
#1353 (closed)
Add more validations on application SLI definitions
#1317 (closed)
ServiceMonitoring-Other workflow-infraProposal
We don't calculate k8s resource saturation when no limits are set
#1060 (closed)
ServicePrometheus workflow-infraTriage
Allow users to configure default panels
#994 (closed)
Stage Group Dashboards workflow-infraTriage
How can we make sure error budgets don't optimise for the wrong things
#974 (closed)
Run an up-to-date version of GitLab-exporter
#797 (closed)
ServiceMonitoring-Other workflow-infraTriage
[Continuous Integration] Add metrics for stale build traces
#781 (closed)
ServiceMonitoring-Other workflow-infraBlocked
Pass feature_category information to gitaly
#759 (closed)
ServiceGitaly workflow-infraBlocked
Git HTTP redirects show up as 'unknown' feature category
#639 (closed)
ServiceWeb workflow-infraTriage
Edited by Rachel Nienaber