Error Budget Improvement: Investigate slow requests and failures for Product Planning
# Summary
We have investigated increase in our error budget spent for gitlab~10690700 by investigating usage logs for 1 month duration, with Product Planning being in [exception](https://handbook.gitlab.com/handbook/engineering/error-budgets/#stage-groups-with-different-error-budgets) with target set to 99.89% since May 2024.
## Overview of requests
<details><summary>2025-01-15</summary>
**:link: [Kibana Dahsboard (Last 7 Days)](https://log.gprd.gitlab.net/app/r/s/KTHEH)**
The investigation below is done with log data as on 2025-01-15 with a look-back of past 7 days, i.e. timeframe range of **2025-01-08 to 2025-01-15**. Since Kibana does not retain logs older than 7 days, if you're accessing the above linked dashboard at a later date, the actual numbers will differ, but given the underlying distribution of failing and slow requests being consistent, the percentage may still be the same.
During this duration, gitlab~10690700 served about **6,198,216 requests**, following is a chart of distribution across Success, Failure, and Invalid requests;
```mermaid
%%{
init: {
'theme': 'base',
'themeVariables': {
'pie1': '#aed581',
'pie2': '#e0e0e0',
'pie3': '#fff59d',
'pie4': '#e57373'
}
}
}%%
pie showData
title Request Distribution
"Success" : 5422761
"Failures" : 260
"Invalid" : 41268
"Other" : 745604
```
## Failing requests
- A total of **260 requests** failed during this time.
- 128 hits point to `groupWorkItems` query, i.e. ~50% of all failures.
- 63 hits point to `getWorkItemStateCounts` query, ~25% of all failures.
## Slow requests
- We have [documented urgency levels](https://docs.gitlab.com/ee/development/application_slis/rails_request.html#how-to-adjust-the-urgency) that determine maximum duration (in seconds) under which a request should complete, this can be individually set for endpoints. If no urgency is specified, default urgency is 1 second.
- If request takes more than the set urgency duration for that endpoint, it is consumed from our error budget.
- In this section, we've looked into **5,422,761 successful requests**, along with their urgency level and actual duration they take to complete.
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px'}}}%%
graph TD
subgraph Total["Total Successful Requests: 5,422,761"]
subgraph Medium["Medium Urgency (<0.5s)"]
M1["Total Requests: 7,758"]
M2["Breaching Requests: 178"]
end
subgraph Default["Default Urgency (<1s)"]
D1["Total Requests: 2,286,570"]
D2["Breaching Requests: 8,310"]
D3["/api/v4/groups/:id/epics/:epic_iid: 5,433"]
D4["Epic Web UI: 297"]
end
subgraph Low["Low Urgency (<5s)"]
L1["Total Requests: 3,128,046"]
L2["Breaching Requests: 2,181"]
L3["workItemTreeQuery: 444"]
L4["workItemParticipants: 221"]
L5["/api/v4/groups/:id/epics: 220"]
L6["groupWorkItems: 136"]
L7["getWorkItems: 103"]
L8["groupEpics: 101"]
end
end
classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px;
classDef medium fill:#e1f5fe,stroke:#0288d1,stroke-width:2px;
classDef defaultUrg fill:#e8f5e9,stroke:#388e3c,stroke-width:2px;
classDef low fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
class Total default;
class Medium,M1,M2 medium;
class Default,D1,D2,D3,D4 defaultUrg;
class Low,L1,L2,L3,L4,L5,L6,L7,L8 low;
```
### Medium urgency (<`0.5s`)
- From the successful requests, 7758 requests are of `medium` urgency. This is mainly the usage of `/autocomplete` endpoint.
- From this, 178 requests take more than 0.5 seconds, breaching the urgency level's target duration.
### Default urgency (<`1s`)
- From the successful requests, 2,286,570 requests are of `default` urgency.
- From this, 8310 requests (i.e. 0.36%) take anywhere between 1 to 41 seconds, breaching the target for this level. Here's a distribution snapshot from Kibana;
{width=1093 height=422}
- From these hits, 5433 requests (i.e. ~65%) are coming from public REST API usage for endpoint `/api/:version/groups/:id/epics/:epic_iid`, indicating that the epics being accessed are inherently slow.
- 297 hits are from accessing Epic web UI on GitLab.
```mermaid
pie title Breaching Requests (Total: 8,310)
"Public REST API" : 5433
"Epic Web UI" : 297
"Other Requests" : 2580
```
### Low urgency (<`5s`)
- From the successful requests, 3,128,046 requests are of `low` urgency.
- From this, 2181 requests (i.e. 0.07%) take anywhere between 5 to 54 seconds, breaching the target for this level. Here's a distribution snapshot from Kibana;
{width=1082 height=422}
- From these hits, 444 requests (i.e. ~20%) are using [`workItemTreeQuery`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/assets/javascripts/work_items/graphql/work_item_tree.query.graphql) query, this is from Hierarchy Widget.
- 221 hits (i.e. ~10%) are using [`workItemParticipants`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/assets/javascripts/work_items/graphql/work_item_participants.query.graphql) query, this is from Work Item detail page, loading the participants list.
- 220 hits (i.e. ~10%) are using public REST API endpoint `/api/:version/groups/:id/epics`.
- 136 hits (i.e. ~6%) are using [`groupWorkItems`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/assets/javascripts/work_items/graphql/group_work_items.query.graphql) query, this is used within Work Item detail page in places like `Parent` dropdown, and hierarchy/linked items widget autocompletion.
- 103 hits (i.e. ~5%) are using [`getWorkItems`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/assets/javascripts/work_items/graphql/list/get_work_items.query.graphql) query, this is from Work Items-based Epics list, currently enabled only on internal groups, including `gitlab-org` & `gitlab-com`.
- 101 hits (i.e. ~5%) are using [`groupEpics`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/assets/javascripts/epics_list/queries/group_epics.query.graphql) query, this is from Epics list using legacy Epics GraphQL API.
```mermaid
pie title Breaching Requests (Total: 2,181)
"workItemTreeQuery (20%)" : 444
"Public REST API (10%)" : 220
"groupWorkItems (6%)" : 136
"getWorkItems (5%)" : 103
"groupEpics (5%)" : 101
"Other Requests (54%)" : 1177
```
</details>
### 2025-07-07
**Range**: Last 14 days
**Links**:
- [Error budget Grafana dashboard](https://dashboards.gitlab.net/goto/soAG5AsHg?orgId=1):
- [Sidekiq error and apdex Grafana Dashboard](https://dashboards.gitlab.net/goto/PzkI5AyNg?orgId=1)
- [Rails requests error and Apdex Grafana dashboard](https://dashboards.gitlab.net/goto/ilPn50sNg?orgId=1)
- [Plan stage error budget Kibana dashboard](https://log.gprd.gitlab.net/app/r/s/KXUfR)
#### List of contributions to the budget, in descending order
1. Api Apdex
| Endpoint | Urgency | Operations | Apdex | Impact :arrow_up_small: |
| ---------------------------------- | ------- | ---------- | ------ | ------- |
| graphql:workItemParticipants | default | 154100 | 0.8869 | 17,352 |
| graphql:projectWorkItemAwardEmojis | default | 17188611 | 0.9965 | 51,566 |
| graphql:workItemTreeQuery | default | 216587 | 0.6291 | 80,224 |
| graphql:namespaceWorkItem | default | 1668087 | 0.9406 | 98,250 |
| graphql:workItemNotesByIid | default | 4366358 | 0.9055 | 410,438 |
The highest impacting queries `graphql:workItemNotesByIid` and `graphql:namespaceWorkItem` are also the highest for Project Management API apdex, so focusing on these will help both budgets.
2. Web Apdex
| Endpoint | Urgency | Operations | Apdex | Impact :arrow_up_small: |
| --------------------------------------------- | ------- | ---------- | ------ | ----------- |
| Projects::AutocompleteSourcesController#epics | medium | 9345 | 0.9492 | 93,394,405 |
| Groups::EpicsController#index | default | 12261 | 0.9812 | 122,536,665 |
| Groups::EpicsController#show | default | 53374 | 0.9672 | 533,421,507 |
3. Sidekiq error
| worker | job_urgency | operations | errors | queue | impact :arrow_up_small: |
| ------------------------------------------------------------------------- | ----------- | ---------- | ------ | ------------ | ------ |
| DesignManagement::NewVersionWorker | low | 2,836 | 0.18% | memory_bound | 5 |
| Issuable::RelatedLinksCreateWorker | high | 23,941 | 0.05% | urgent_other | 11 |
| WorkItems::RolledupDates::UpdateMilestoneRelatedWorkItemDatesEventHandler | low | 2,100 | 0.57% | default | 12 |
| WorkItems::RolledupDates::UpdateMultipleRolledupDatesWorker | low | 100,789 | 0.07% | default | 70 |
This is low priority, but the biggest impact on Sidekiq errors is `WorkItems::RolledupDates::UpdateMultipleRolledupDatesWorker`, raising exception:
```
PG::QueryCanceled: ERROR: canceling statement due to statement timeout CONTEXT: SQL statement "UPDATE issues SET start_date = NEW.start_date, due_date = NEW.due_date WHERE issues.id = NEW.issue_id" PL/pgSQL function sync_issues_dates_with_work_item_dates_sources() line 3 at SQL statement
```
epic