Systemically reduce http 500 errors (#25401) · Issues · GitLab.com / GitLab Infrastructure Team / Production Engineering · GitLab

Systemically reduce http 500 errors

**Details** We have experienced numerous incidents that go undetected before customers report them despite them being detectable via http 500 errors. This is because we have consistent http 500 errors, making spotting new issues and increases in rates difficult. If we can eliminate the noise-floor of http 500 errors, we would be able to expect to detect new issues quicker. This issue is to discuss how we approach this problem. A few ideas that have been discussed: 1. Leverage stage group reports to show top sources of http 500 errors, and set a goal to reduce top x sources each month. 2. Leverage error budgets Each [stage team dashboard](https://dashboards.gitlab.net/dashboards/f/stage-groups/stage-groups) links to kibana dashboards that show all 500 errors for their endpoints and can be sorted/filtered for top offenders. **Example issues we have not detected ahead of time** 1. [Requests to REST API for group creation/update stipulating emails:disabled fail with error 500](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17775/) - 32 hours on production before detection 2. [500 errors on SAML login](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17856) - 4 hours on production before detection 3. [Dependency proxy failing with 500 errors](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17851) - 5 hours on production before detection

issue