2025-10-23: Intermittent 502 errors in /api/v4/internal/allowed
Intermittent 502 errors in /api/v4/internal/allowed (Severity 3 (Medium))
Over the past week, I've noticed a steady stream of 502 errors in the Workhorse logs due to /api/v4/internal/allowed. https://log.gprd.gitlab.net/app/r/s/jRpWJ shows the logs. GitLab-Shell is the user agent, so this likely is leading to intermittent CI clone failures.
The uptick in 502s likely occurred due to the changes in the Workhorse readiness probes in an attempt to fix inc-3683-jobs-stuck-in-running-indefinitely. It seems that instead of readiness probes failing, the liveness probes would fail, causing Kubernetes to restart Puma. We are rolling back those changes for now.
It does appear that we still do have 502s as reported in Workhorse even before the changes. We may need to look at optimizing the slowest routes. https://gitlab.com/gitlab-org/gitlab/-/issues/578304 is open for the internal Pages route, but the slow queries are not happening as frequently as I had thought.
This ticket was created to track INC-5164, by incident.io