Error Budget Improvement: Investigate slow requests and failures for Product Planning
Summary
We have investigated increase in our error budget spent for groupproduct planning by investigating usage logs for 1 month duration, with Product Planning being in exception with target set to 99.89% since May 2024.
Overview of requests
The investigation below is done with log data as on 2025-01-15 with a look-back of past 7 days, i.e. timeframe range of 2025-01-08 to 2025-01-15. Since Kibana does not retain logs older than 7 days, if you're accessing the above linked dashboard at a later date, the actual numbers will differ, but given the underlying distribution of failing and slow requests being consistent, the percentage may still be the same.
During this duration, groupproduct planning served about 6,198,216 requests, following is a chart of distribution across Success, Failure, and Invalid requests;
%%{
init: {
'theme': 'base',
'themeVariables': {
'pie1': '#aed581',
'pie2': '#e0e0e0',
'pie3': '#fff59d',
'pie4': '#e57373'
}
}
}%%
pie showData
title Request Distribution
"Success" : 5422761
"Failures" : 260
"Invalid" : 41268
"Other" : 745604
Failing requests
- A total of 260 requests failed during this time.
- 128 hits point to
groupWorkItemsquery, i.e. ~50% of all failures. - 63 hits point to
getWorkItemStateCountsquery, ~25% of all failures.
Slow requests
- We have documented urgency levels that determine minimum duration (in seconds) under which a request should complete, this can be individually set for endpoints. If no urgency is specified, default urgency is 1 second.
- If request takes more than the set urgency duration for that endpoint, it is consumed from our error budget.
- In this section, we've looked into 5,422,761 successful requests, along with their urgency level and actual duration they take to complete.
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px'}}}%%
graph TD
subgraph Total["Total Successful Requests: 5,422,761"]
subgraph Medium["Medium Urgency (<0.5s)"]
M1["Total Requests: 7,758"]
M2["Breaching Requests: 178"]
end
subgraph Default["Default Urgency (<1s)"]
D1["Total Requests: 2,286,570"]
D2["Breaching Requests: 8,310"]
D3["/api/v4/groups/:id/epics/:epic_iid: 5,433"]
D4["Epic Web UI: 297"]
end
subgraph Low["Low Urgency (<5s)"]
L1["Total Requests: 3,128,046"]
L2["Breaching Requests: 2,181"]
L3["workItemTreeQuery: 444"]
L4["/api/v4/groups/:id/epics: 220"]
L5["groupWorkItems: 136"]
L6["getWorkItems: 103"]
L7["groupEpics: 101"]
end
end
classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px;
classDef medium fill:#e1f5fe,stroke:#0288d1,stroke-width:2px;
classDef defaultUrg fill:#e8f5e9,stroke:#388e3c,stroke-width:2px;
classDef low fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
class Total default;
class Medium,M1,M2 medium;
class Default,D1,D2,D3,D4 defaultUrg;
class Low,L1,L2,L3,L4,L5,L6,L7 low;
Medium urgency (<0.5s)
- From the successful requests, 7758 requests are of
mediumurgency. This is mainly the usage of/autocompleteendpoint. - From this, 178 requests take more than 0.5 seconds, breaching the urgency level's target duration.
Default urgency (<1s)
- From the successful requests, 2,286,570 requests are of
defaulturgency. - From this, 8310 requests (i.e. 0.36%) take anywhere between 1 to 41 seconds, breaching the target for this level. Here's a distribution snapshot from Kibana;
- From these hits, 5433 requests (i.e. ~65%) are coming from public REST API usage for endpoint
/api/:version/groups/:id/epics/:epic_iid, indicating that the epics being accessed are inherently slow. - 297 hits are from accessing Epic web UI on GitLab.
pie title Breaching Requests (Total: 8,310)
"Public REST API" : 5433
"Epic Web UI" : 297
"Other Requests" : 2580
Low urgency (<5s)
- From the successful requests, 3,128,046 requests are of
lowurgency. - From this, 2181 requests (i.e. 0.07%) take anywhere between 5 to 54 seconds, breaching the target for this level. Here's a distribution snapshot from Kibana;
- From these hits, 444 requests (i.e. ~20%) are using
workItemTreeQuery, this is from Hierarchy Widget. - 220 hits (i.e. ~10%) are using public REST API endpoint
/api/:version/groups/:id/epics. - 136 hits (i.e. ~6%) are using
groupWorkItemsquery, this is used within Work Item detail page in places likeParentdropdown, and hierarchy/linked items widget autocompletion. - 103 hits (i.e. ~5%) are using
getWorkItemsquery, this is from Work Items-based Epics list, currently enabled only on internal groups, includinggitlab-org&gitlab-com. - 101 hits (i.e. ~5%) are using
groupEpicsquery, this is from Epics list using legacy Epics GraphQL API.
pie title Breaching Requests (Total: 2,181)
"workItemTreeQuery (20%)" : 444
"Public REST API (10%)" : 220
"groupWorkItems (6%)" : 136
"getWorkItems (5%)" : 103
"groupEpics (5%)" : 101
"Other Requests (54%)" : 1177

