Error Budget Improvement: Investigate slow requests and failures for Product Planning

Summary

We have investigated increase in our error budget spent for groupproduct planning by investigating usage logs for 1 month duration, with Product Planning being in exception with target set to 99.89% since May 2024.

Overview of requests

🔗 Kibana Dashboard (Last 7 Days)

The investigation below is done with log data as on 2025-01-15 with a look-back of past 7 days, i.e. timeframe range of 2025-01-08 to 2025-01-15. Since Kibana does not retain logs older than 7 days, if you're accessing the above linked dashboard at a later date, the actual numbers will differ, but given the underlying distribution of failing and slow requests being consistent, the percentage may still be the same.

During this duration, groupproduct planning served about 6,198,216 requests, following is a chart of distribution across Success, Failure, and Invalid requests;

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'pie1': '#aed581',
      'pie2': '#e0e0e0',
      'pie3': '#fff59d',
      'pie4': '#e57373'
    }
  }
}%%
pie showData
    title Request Distribution
    "Success" : 5422761
    "Failures" : 260
    "Invalid" : 41268
    "Other" : 745604

Failing requests

A total of 260 requests failed during this time.
128 hits point to groupWorkItems query, i.e. ~50% of all failures.
63 hits point to getWorkItemStateCounts query, ~25% of all failures.

Slow requests

We have documented urgency levels that determine minimum duration (in seconds) under which a request should complete, this can be individually set for endpoints. If no urgency is specified, default urgency is 1 second.
- If request takes more than the set urgency duration for that endpoint, it is consumed from our error budget.
In this section, we've looked into 5,422,761 successful requests, along with their urgency level and actual duration they take to complete.

%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px'}}}%%
graph TD
    subgraph Total["Total Successful Requests: 5,422,761"]
        subgraph Medium["Medium Urgency (<0.5s)"]
            M1["Total Requests: 7,758"]
            M2["Breaching Requests: 178"]
        end
        subgraph Default["Default Urgency (<1s)"]
            D1["Total Requests: 2,286,570"]
            D2["Breaching Requests: 8,310"]
            D3["/api/v4/groups/:id/epics/:epic_iid: 5,433"]
            D4["Epic Web UI: 297"]
        end
        subgraph Low["Low Urgency (<5s)"]
            L1["Total Requests: 3,128,046"]
            L2["Breaching Requests: 2,181"]
            L3["workItemTreeQuery: 444"]
            L4["/api/v4/groups/:id/epics: 220"]
            L5["groupWorkItems: 136"]
            L6["getWorkItems: 103"]
            L7["groupEpics: 101"]
        end
    end

    classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px;
    classDef medium fill:#e1f5fe,stroke:#0288d1,stroke-width:2px;
    classDef defaultUrg fill:#e8f5e9,stroke:#388e3c,stroke-width:2px;
    classDef low fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
    
    class Total default;
    class Medium,M1,M2 medium;
    class Default,D1,D2,D3,D4 defaultUrg;
    class Low,L1,L2,L3,L4,L5,L6,L7 low;

Medium urgency (<`0.5s`)

From the successful requests, 7758 requests are of medium urgency. This is mainly the usage of /autocomplete endpoint.
From this, 178 requests take more than 0.5 seconds, breaching the urgency level's target duration.

Default urgency (<`1s`)

From the successful requests, 2,286,570 requests are of default urgency.
From this, 8310 requests (i.e. 0.36%) take anywhere between 1 to 41 seconds, breaching the target for this level. Here's a distribution snapshot from Kibana;

From these hits, 5433 requests (i.e. ~65%) are coming from public REST API usage for endpoint /api/:version/groups/:id/epics/:epic_iid, indicating that the epics being accessed are inherently slow.
297 hits are from accessing Epic web UI on GitLab.

pie title Breaching Requests (Total: 8,310)
    "Public REST API" : 5433
    "Epic Web UI" : 297
    "Other Requests" : 2580

Low urgency (<`5s`)

From the successful requests, 3,128,046 requests are of low urgency.
From this, 2181 requests (i.e. 0.07%) take anywhere between 5 to 54 seconds, breaching the target for this level. Here's a distribution snapshot from Kibana;

From these hits, 444 requests (i.e. ~20%) are using workItemTreeQuery, this is from Hierarchy Widget.
220 hits (i.e. ~10%) are using public REST API endpoint /api/:version/groups/:id/epics.
136 hits (i.e. ~6%) are using groupWorkItems query, this is used within Work Item detail page in places like Parent dropdown, and hierarchy/linked items widget autocompletion.
103 hits (i.e. ~5%) are using getWorkItems query, this is from Work Items-based Epics list, currently enabled only on internal groups, including gitlab-org & gitlab-com.
101 hits (i.e. ~5%) are using groupEpics query, this is from Epics list using legacy Epics GraphQL API.

pie title Breaching Requests (Total: 2,181)
    "workItemTreeQuery (20%)" : 444
    "Public REST API (10%)" : 220
    "groupWorkItems (6%)" : 136
    "getWorkItems (5%)" : 103
    "groupEpics (5%)" : 101
    "Other Requests (54%)" : 1177

Edited Jan 16, 2025 by Kushal Pandya