Puma Errors: failing requests GET /api/:version/projects/:id/jobs
Summary
Recently, grouppipeline execution opted in to the new SLI durations for endpoints. This triggered an unhealthy drop in error budget from 99.97% to 99.94% (see below).
The GET /api/:version/projects/:id/jobs
endpoint currently is the top endpoint of failing requests.
Proposal
Investigate the GET /api/:version/projects/:id/jobs
endpoint and review the failure to determine the best course of fixing the issue.
Initial investigation from Kibana
In the initial investigation in Kibana, it appears that a PG::QueryCanceled: ERROR: canceling statement due to statement timeout
error is occurring on the ci_builds
table with the SQL being below:
/*application:web,correlation_id:01G22STTS1JN4Z4HDGZG4HQBMA,endpoint_id:GET /api/:version/projects/:id/jobs,db_config_name:ci_replica*/ SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" = $1 AND "ci_builds"."project_id" = $2 AND "ci_builds"."status" IN ($3, $4, $5) ORDER BY id DESC LIMIT $7 OFFSET $6
Investigation Plan
Description | Issue link | Target Milestone | Notes |
---|---|---|---|
Puma Apdex: slow requests Set PATCH /api/:version/jobs/:id/trace urgency to low | #361095 (closed) | %15.0 | workflowproduction |
Puma Errors: failing requests GET /api/:version/projects/:id/jobs |
|
%15.1 | Closed - replaced with #362172 (closed) |
Introduce Keyset pagination for GET /api/:version/projects/:id/jobs API endpoint |
#362172 (closed) | TBD | |
Backend: Improve performance of PATCH /api/:version/jobs/:id/trace | #353802 (closed) | TBD | |
Backend: Improve performance of GraphqlController#execute | #361377 | TBD |
Edited by Mark Nuzzo