Plan: Project Management Error Budgets
Problem
Severely over error budget to stay within our new 99.95% availability and performance SLA
Context
Scalability review
gitlab-com/gl-infra/scalability#1101 (comment 585346307)
Element | Apdex Success Ratio | Success Ratio |
---|---|---|
{component="sidekiq_execution"} | 0.999562943272945 |
0.98944369784859 |
{component="puma"} | 0.980669484122662 |
0.999999998366837 |
Element | Value |
---|---|
{component="puma",error_type="error_rate"} | 0.0500713646008383 |
{component="sidekiq_execution",error_type="apdex"} | 339.20501063088886 |
{component="sidekiq_execution",error_type="error_rate"} | 8195.450961067867 |
{component="puma",error_type="apdex"} | 234085.47333695722 |
This table shows the number of requests with an unsatisfactory duration in the past 7 days: https://log.gprd.gitlab.net/goto/2323c5ab128fdfd98f00ae37bd2073b8
json.meta.caller_id.keyword: Descending | Count |
---|---|
GET /api/:version/projects | 2,013,923 |
ProjectsController#show | 181,078 |
GET /api/:version/projects/:id | 172,809 |
RootController#index | 165,929 |
Projects::NotesController#index | 148,166 |
It means improving the performance of GET /api/:version/projects
would have the biggest impact on the error budget.
Looking into what that endpoint is pending time on, in the logs shows us these averages (so this hides outliers): https://log.gprd.gitlab.net/goto/1bb3cb0d34dc9ef8378daab5bf8ef296
This shows that that the database time alone already pushes us near or over the 1s threshold, so maybe there's some queries to be optimized. This is not entirely surprising, there's a lot of data loaded there. This search reveals a bunch of slow queries originating from this endpoint.
Trying this out in the performance bar also revealed that some of the queries for that endpoint pass in a huge list of ids rather than using a subquery. These come from Projects::BatchCountService
as the comment indicates.
Another handy tool to dig into is the flamegraph. Which I tried for a list of 100 projects. In a quick scroll through, the HasRepository#default_branch
call taking a large amount of time is something that caught my eye:
500 summary
Based on the rolling 24 hour period from 2021-05-12 @ ~12:15PM EDT to 2021-05-13 @ ~12:15PM EDT
Path | 500 count |
---|---|
Boards::IssuesController::index | 779 |
GroupsController::issues | 441 |
ProjectsController::show | 422 |
Projects::IssuesController::index | 402 |
GET /api/v4/issues_statistics | 317 |
Boards::IssuesController::update | 210 |
GET /api/v4/projects | 142 |
Projects::IssuesController::update | 125 |
Projects::NotesController::create | 118 |
Projects::IssuesController::show | 96 |
POST /api/v4/projects/:id/issues/:id | 74 |
DashboardsController::issues | 73 |
Projects::NotesController::index | 73 |
PUT /api/v4/projects/:id/issues/:id | 61 |
Projects::DiscussionsController::show | 55 |
Projects::IssuesController::discussions | 48 |
Explore::ProjectsController::index | 38 |
Projects::IssuesController::related_branches | 35 |
GET /api/v4/projects/:id/issues/:id | 34 |
GET /api/v4/projects/:id/issues/:id/related_merge_requests | 34 |
Projects::IssuesController::create | 26 |
Projects::NotesController::update | 23 |
Dashboard::ProjectsController::index | 22 |
RootController::index | 19 |
ProjectsController::transfer | 9 |
POST /api/v4/projects/:id/issues/:id/notes | 9 |
Projects::IssueLinksController::create | 5 |
ProjectsController::create | 3 |
Acceptance
- 99.95% Availability