Feature Category Summary for Error Budgets
Workflow board for Error Budgets: Board
Project Work
| Topic | Links | Status | Summary |
|---|---|---|---|
| Error Budgets as a Performance Indicator @reprazent |
&880 Board |
workflow-infraReady | 2023-07-05 Changing the status here back to ready while taking the time to check the error budgets report of July. To be resumed soon. |
| Improve observability of the websockets service |
&1111 Board |
workflow-infraTriage | |
| Include patroni SLIs into the error budget for stage groups |
&951 Board |
workflow-infraTriage | 2023-06-28 Including this into the Error Budget overwhelms the calculation in favour of these high occurance metrics. We must reevaluate how we present Error Budgets. |
| Improve GraphQL SLI and include it in error budgets for stage groups |
&665 Board |
workflow-infraProposal | 2022-01-19: The GraphQL SLI has been created but only measures apdex. The default urgency currently has a 1s target duration, but this probably needs to be revisited. All stage groups have been opted out from using the SLI in their error budget by default Issues in gitlab-org/gitlab:- gitlab-org/gitlab#349546 (closed): Refine the urgencies defined for GraphQL queries - gitlab-org/gitlab#345263 (closed): Graphql query error SLI - gitlab-org/gitlab#328535 (closed): GraphQL query-to-feature correlation mechanism - gitlab-org/gitlab#345141: GraphQL query-to-urgency correlation |
| Alert stage groups to SLO violations |
&615 Board |
workflow-infraTriage | 2022-05-06 We are working on the issues/epics that block this epic. Namely #1395 (closed) which will be followed by &663 and &700 |
Issues Not in Epics
Summary of issues that are not in an Epic (for Error Budgets)
Total Issues: 31
| Topic | Service | Board | Workflow Status |
|---|---|---|---|
| Discussion: Proposal of Red/Green Error Budgets #2391 |
workflow-infraTriage | ||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2275 |
workflow-infraTriage | ||
| High traffic endpoints should not have low-urgency thresholds #2237 (closed) |
workflow-infraTriage | ||
| Highlights for Error Budget Report #2142 (closed) |
workflow-infraIn Progress | ||
Aggregate the metric application_sli_aggregation:rails_request:apdex:weight:score_1h with wider time frames #2111 (closed) |
workflow-infraProposal | ||
Evaluate endpoint Repositories::GitHttpController#info_refs default urgency, given high traffic share (8%) #2107 (closed) |
workflow-infraIn Progress | ||
| Gather usage information on existing stage (group) dashboards #2094 (closed) |
|||
| Handle unknown endpoints in error budgets #2087 (closed) |
workflow-infraTriage | ||
| Source Code Error Budgets dashboard displaying "No Data" #2068 (closed) |
workflow-infraTriage | ||
| Improve SLI metrics implementation design to avoid misinterpretation of prometheus labels required #2058 (closed) |
workflow-infraProposal | ||
| "Replica was not up to date" exceptions affect Error Budgets #1980 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Feature Category Summary for Error Budgets #1871 (closed) |
workflow-infraTriage | ||
| Make Gitaly per-node alerts less noisy #1781 (closed) |
ServiceGitaly | workflow-infraTriage | |
| Stage group apdex-ratio recording is erratic #1702 (closed) |
workflow-infraTriage | ||
| Support multiple urgencies for Grape endpoints with the same path but different HTTP verb #1670 (closed) |
workflow-infraTriage | ||
| Include 'urgency' in Rails request metrics that result in an ETag cache hit #1600 (closed) |
ServiceWeb | workflow-infraTriage | |
| Feedback on Infradev Reports and Error Budget Reports #1569 (closed) |
|||
| Define a process for investigating groups with suspiciously high availability #1530 (closed) |
|||
| Extract a webservice architype to be reused for the web, api and git services. #1512 (closed) |
workflow-infraTriage | ||
| Remove custom feature category recordings for the puma component #1481 |
workflow-infraTriage | ||
| Improve the Gitaly SLI using an urgency per RPC #1450 (closed) |
ServiceGitaly | boardplanning | workflow-infraTriage |
| Add an SLI for the monitoring service reporting failed #1443 (closed) |
ServiceMonitoring-Other | workflow-infraTriage | |
| Revisit rails_requests apdex SLO for git, api and web after urgencies have been set #1353 (closed) |
|||
| Add more validations on application SLI definitions #1317 (closed) |
ServiceMonitoring-Other | workflow-infraProposal | |
| We don't calculate k8s resource saturation when no limits are set #1060 (closed) |
ServicePrometheus | workflow-infraTriage | |
| Allow users to configure default panels #994 (closed) |
Stage Group Dashboards | workflow-infraTriage | |
| How can we make sure error budgets don't optimise for the wrong things #974 (closed) |
|||
| Run an up-to-date version of GitLab-exporter #797 (closed) |
ServiceMonitoring-Other | workflow-infraTriage | |
| [Continuous Integration] Add metrics for stale build traces #781 (closed) |
ServiceMonitoring-Other | workflow-infraBlocked | |
| Pass feature_category information to gitaly #759 (closed) |
ServiceGitaly | workflow-infraBlocked | |
| Git HTTP redirects show up as 'unknown' feature category #639 (closed) |
ServiceWeb | workflow-infraTriage |
Edited by Rachel Nienaber