Registry 5xx alerts are numerous with no obvious problem to resolve.
The on-call SRE's receive alerts that indicate a high number of 5xx errors from the registry service. But there are no obvious service problems (registry is down, serving requests slowly, etc.) that can be repaired. The issue often recovers before any action is taken.
Either our alerting for 5xx errors in registry are too stringent, or there is a legitimate issue with registry that needs to be fixed.
One of these should be the criteria to close this issue:
- Do we have an SLO that we must keep registry 5xx errors below a threshold? If so, are our current alerts enforcing that level, or a level higher? Create an issue(s) or merge request(s) to address the changes to our alerting.
- If the alerting is correct, what is the underlying issue? Create an issue (or issues) to resolve that problem.