Overwhelming amount of "unexpected end of JSON input" errors
Context
We have recently introduced support for error reporting with Sentry.
The rollout of this feature for GitLab.com happened on Jan 12, 2021 (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11297#note_483587713). Before this, we had no insight into the unexpected errors raised by the registry (unless we wanted to grep over logs). Since then, I have been watching the errors on Sentry to identify patterns and possible improvements.
Problem
I noticed that we have an overwhelming amount of unknown: unknown error: unexpected end of JSON input
errors (sample). More precisely, we have ~270 of these errors being raised per hour.
By correlating several of these events with logs (using the recently introduced correlation_id
field), we can see that these seem related to manifest pull requests (sample).
Given that Go errors have no stack trace (unexpected end of JSON input
is an error from the standard library json
package. Given that these errors seem to be related to manifest pulls, this is most likely an error when trying to unmarshal the manifest payload (JSON) after retrieving it from the storage backend.
If my suspicions are correct, this means that the corresponding manifests are corrupted. We're currently serving (successfully) over 670K manifests per hour, so in proportion, the number of these errors is negligible (~0.04%). We haven't changed any of the manifest parsing logic either, so these errors were most likely happening for a long time already. Nevertheless, we should try to identify the root cause.
Proposal
-
Wrap the error returned from the JSON unmarshal attempt (source) with failed to unmarshal manifest payload: ...
. This should let us know that the error comes from here, which would mean that the respective manifests are indeed corrupted. -
Deploy the change above to GitLab.com and evaluate the errors on Sentry. -
Raise a followup issue to investigate the corruption of manifests, if this is indeed the root cause. Otherwise, restart the analysis and look for other possible root causes. -
Consider if the current 500 Internal Server Error
is the most appropriate HTTP status code for these errors.