Error Budget Improvement: Investigate slow requests and failures for Product Planning
# Summary We have investigated increase in our error budget spent for gitlab~10690700 by investigating usage logs for 1 month duration, with Product Planning being in [exception](https://handbook.gitlab.com/handbook/engineering/error-budgets/#stage-groups-with-different-error-budgets) with target set to 99.89% since May 2024. ## Overview of requests <details><summary>2025-01-15</summary> **:link: [Kibana Dahsboard (Last 7 Days)](https://log.gprd.gitlab.net/app/r/s/KTHEH)** The investigation below is done with log data as on 2025-01-15 with a look-back of past 7 days, i.e. timeframe range of **2025-01-08 to 2025-01-15**. Since Kibana does not retain logs older than 7 days, if you're accessing the above linked dashboard at a later date, the actual numbers will differ, but given the underlying distribution of failing and slow requests being consistent, the percentage may still be the same. During this duration, gitlab~10690700 served about **6,198,216 requests**, following is a chart of distribution across Success, Failure, and Invalid requests; ```mermaid %%{ init: { 'theme': 'base', 'themeVariables': { 'pie1': '#aed581', 'pie2': '#e0e0e0', 'pie3': '#fff59d', 'pie4': '#e57373' } } }%% pie showData title Request Distribution "Success" : 5422761 "Failures" : 260 "Invalid" : 41268 "Other" : 745604 ``` ## Failing requests - A total of **260 requests** failed during this time. - 128 hits point to `groupWorkItems` query, i.e. ~50% of all failures. - 63 hits point to `getWorkItemStateCounts` query, ~25% of all failures. ## Slow requests - We have [documented urgency levels](https://docs.gitlab.com/ee/development/application_slis/rails_request.html#how-to-adjust-the-urgency) that determine maximum duration (in seconds) under which a request should complete, this can be individually set for endpoints. If no urgency is specified, default urgency is 1 second. - If request takes more than the set urgency duration for that endpoint, it is consumed from our error budget. - In this section, we've looked into **5,422,761 successful requests**, along with their urgency level and actual duration they take to complete. ```mermaid %%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px'}}}%% graph TD subgraph Total["Total Successful Requests: 5,422,761"] subgraph Medium["Medium Urgency (<0.5s)"] M1["Total Requests: 7,758"] M2["Breaching Requests: 178"] end subgraph Default["Default Urgency (<1s)"] D1["Total Requests: 2,286,570"] D2["Breaching Requests: 8,310"] D3["/api/v4/groups/:id/epics/:epic_iid: 5,433"] D4["Epic Web UI: 297"] end subgraph Low["Low Urgency (<5s)"] L1["Total Requests: 3,128,046"] L2["Breaching Requests: 2,181"] L3["workItemTreeQuery: 444"] L4["workItemParticipants: 221"] L5["/api/v4/groups/:id/epics: 220"] L6["groupWorkItems: 136"] L7["getWorkItems: 103"] L8["groupEpics: 101"] end end classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px; classDef medium fill:#e1f5fe,stroke:#0288d1,stroke-width:2px; classDef defaultUrg fill:#e8f5e9,stroke:#388e3c,stroke-width:2px; classDef low fill:#fff3e0,stroke:#f57c00,stroke-width:2px; class Total default; class Medium,M1,M2 medium; class Default,D1,D2,D3,D4 defaultUrg; class Low,L1,L2,L3,L4,L5,L6,L7,L8 low; ``` ### Medium urgency (<`0.5s`) - From the successful requests, 7758 requests are of `medium` urgency. This is mainly the usage of `/autocomplete` endpoint. - From this, 178 requests take more than 0.5 seconds, breaching the urgency level's target duration. ### Default urgency (<`1s`) - From the successful requests, 2,286,570 requests are of `default` urgency. - From this, 8310 requests (i.e. 0.36%) take anywhere between 1 to 41 seconds, breaching the target for this level. Here's a distribution snapshot from Kibana; ![image](/uploads/d1c174b616775d2938c85f33b3cb7e0e/image.png){width=1093 height=422} - From these hits, 5433 requests (i.e. ~65%) are coming from public REST API usage for endpoint `/api/:version/groups/:id/epics/:epic_iid`, indicating that the epics being accessed are inherently slow. - 297 hits are from accessing Epic web UI on GitLab. ```mermaid pie title Breaching Requests (Total: 8,310) "Public REST API" : 5433 "Epic Web UI" : 297 "Other Requests" : 2580 ``` ### Low urgency (<`5s`) - From the successful requests, 3,128,046 requests are of `low` urgency. - From this, 2181 requests (i.e. 0.07%) take anywhere between 5 to 54 seconds, breaching the target for this level. Here's a distribution snapshot from Kibana; ![image](/uploads/6dfea1fa15bd987a3d2f7b7a663df661/image.png){width=1082 height=422} - From these hits, 444 requests (i.e. ~20%) are using [`workItemTreeQuery`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/assets/javascripts/work_items/graphql/work_item_tree.query.graphql) query, this is from Hierarchy Widget. - 221 hits (i.e. ~10%) are using [`workItemParticipants`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/assets/javascripts/work_items/graphql/work_item_participants.query.graphql) query, this is from Work Item detail page, loading the participants list. - 220 hits (i.e. ~10%) are using public REST API endpoint `/api/:version/groups/:id/epics`. - 136 hits (i.e. ~6%) are using [`groupWorkItems`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/assets/javascripts/work_items/graphql/group_work_items.query.graphql) query, this is used within Work Item detail page in places like `Parent` dropdown, and hierarchy/linked items widget autocompletion. - 103 hits (i.e. ~5%) are using [`getWorkItems`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/assets/javascripts/work_items/graphql/list/get_work_items.query.graphql) query, this is from Work Items-based Epics list, currently enabled only on internal groups, including `gitlab-org` & `gitlab-com`. - 101 hits (i.e. ~5%) are using [`groupEpics`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/assets/javascripts/epics_list/queries/group_epics.query.graphql) query, this is from Epics list using legacy Epics GraphQL API. ```mermaid pie title Breaching Requests (Total: 2,181) "workItemTreeQuery (20%)" : 444 "Public REST API (10%)" : 220 "groupWorkItems (6%)" : 136 "getWorkItems (5%)" : 103 "groupEpics (5%)" : 101 "Other Requests (54%)" : 1177 ``` </details> ### 2025-07-07 **Range**: Last 14 days **Links**: - [Error budget Grafana dashboard](https://dashboards.gitlab.net/goto/soAG5AsHg?orgId=1): - [Sidekiq error and apdex Grafana Dashboard](https://dashboards.gitlab.net/goto/PzkI5AyNg?orgId=1) - [Rails requests error and Apdex Grafana dashboard](https://dashboards.gitlab.net/goto/ilPn50sNg?orgId=1) - [Plan stage error budget Kibana dashboard](https://log.gprd.gitlab.net/app/r/s/KXUfR) #### List of contributions to the budget, in descending order 1. Api Apdex | Endpoint | Urgency | Operations | Apdex | Impact :arrow_up_small: | | ---------------------------------- | ------- | ---------- | ------ | ------- | | graphql:workItemParticipants | default | 154100 | 0.8869 | 17,352 | | graphql:projectWorkItemAwardEmojis | default | 17188611 | 0.9965 | 51,566 | | graphql:workItemTreeQuery | default | 216587 | 0.6291 | 80,224 | | graphql:namespaceWorkItem | default | 1668087 | 0.9406 | 98,250 | | graphql:workItemNotesByIid | default | 4366358 | 0.9055 | 410,438 | The highest impacting queries `graphql:workItemNotesByIid` and `graphql:namespaceWorkItem` are also the highest for Project Management API apdex, so focusing on these will help both budgets. 2. Web Apdex | Endpoint | Urgency | Operations | Apdex | Impact :arrow_up_small: | | --------------------------------------------- | ------- | ---------- | ------ | ----------- | | Projects::AutocompleteSourcesController#epics | medium | 9345 | 0.9492 | 93,394,405 | | Groups::EpicsController#index | default | 12261 | 0.9812 | 122,536,665 | | Groups::EpicsController#show | default | 53374 | 0.9672 | 533,421,507 | 3. Sidekiq error | worker | job_urgency | operations | errors | queue | impact :arrow_up_small: | | ------------------------------------------------------------------------- | ----------- | ---------- | ------ | ------------ | ------ | | DesignManagement::NewVersionWorker | low | 2,836 | 0.18% | memory_bound | 5 | | Issuable::RelatedLinksCreateWorker | high | 23,941 | 0.05% | urgent_other | 11 | | WorkItems::RolledupDates::UpdateMilestoneRelatedWorkItemDatesEventHandler | low | 2,100 | 0.57% | default | 12 | | WorkItems::RolledupDates::UpdateMultipleRolledupDatesWorker | low | 100,789 | 0.07% | default | 70 | This is low priority, but the biggest impact on Sidekiq errors is `WorkItems::RolledupDates::UpdateMultipleRolledupDatesWorker`, raising exception: ``` PG::QueryCanceled: ERROR: canceling statement due to statement timeout CONTEXT: SQL statement "UPDATE issues SET start_date = NEW.start_date, due_date = NEW.due_date WHERE issues.id = NEW.issue_id" PL/pgSQL function sync_issues_dates_with_work_item_dates_sources() line 3 at SQL statement ```
epic