Consistent logging in image resizer module (!652) · Merge requests · GitLab.org / gitlab-workhorse

Matthias Käppler requested to merge 320-improve-image-scaler-logs into master Nov 06, 2020

What was broken

Our image scaler logging implementation had multiple flaws or shortcomings:

It wasn't consistently applying labels; depending on how soon we had to log out an error, not all the info that would go into labels was available yet, so we were not applying any labels whatsoever (outside of the default labels applied by the log helpers). This should be largely addressed now by always having a consistent label set, even if some of those might carry zero-values.
Previously, we identified error messages coming from the image scaler by doing a full text search in Kibana for "ImageResizer:" which we thought we had consistently used as a "prefix". This is bad for two reasons: first, this isn't structured logging but relies on message text; second, it wasn't actually always part of the error message (see below). This is now fixed by always adding a subsystem: imageresizer label to all log lines, so that we can see everything coming out of this module by filtering by this label in Kibana.
Any errors that had been logged by the scaler binary itself, and not Workhorse, were previously going straight to the logging sink as strings. This caused two problems: first, the "ImageResizer:" prefix was not part of the error message (now redundant) and second, there was no way for us to apply log labels. These errors are now buffered in Workhorse first and treated like any other error so that they will have all the same labels.
We were not prefixing log labels with anything, thus polluting the global label namespace. Outside of labels that are virtually ubiquitous (e.g. uri or correlation_id) I think we should get into the habit of scoping labels to a context.
We were logging bytes served as bytes_written when instead we should use written_bytes as done elsewhere, since conventionally, units should come last in a label name.
Some error messages were going into json.msg which others were going into json.error. We should make this consistent.

How it's done

I now capture a consistent set of fields into a new resizeOutcome data structure. Its fields will always be part of every log line. It also contains the status field that we already had, which will be used for json.msg in case of successes. It also has a nullable err field which when set will become json.error.

Results are logged when the Inject function returns. This way we have a cleaner separation of collecting contextual data and performing the logging side-effect.

All resizer specific log-fields are prefixed with imageresizer.

Some examples:

Success:

workhorse_1      | time="2020-11-24T14:19:02Z" level=info msg=success correlation_id=PBE3RIQGFo7 duration_s=0.005976283 imageresizer.content_type=image/png imageresizer.original_filesize=5005 imageresizer.status=success imageresizer.target_width=32 method=GET subsystem=imageresizer uri="/uploads/-/system/user/avatar/1/avatar.png?width=32" written_bytes=1449

Success (not modified):

workhorse_1      | time="2020-11-24T14:20:21Z" level=info msg=success-client-cache correlation_id=TTfMkmprWR5 duration_s=0.002134883 imageresizer.content_type=image/png imageresizer.original_filesize=5005 imageresizer.status=success-client-cache imageresizer.target_width=32 method=GET subsystem=imageresizer uri="/uploads/-/system/user/avatar/1/avatar.png?width=32" written_bytes=0

Hard failure:

workhorse_1      | time="2020-11-24T14:20:57Z" level=error msg=error correlation_id=0neQlOsOZs7 duration_s=0.003599552 error="gitlab-resize-image: fatal: decode: png: invalid format: invalid checksum\n" imageresizer.content_type=image/png imageresizer.original_filesize=38577 imageresizer.status=request-failed imageresizer.target_width=64 method=GET subsystem=imageresizer uri="/uploads/-/system/group/avatar/52/EIP-Terraform.png?width=64" written_bytes=0

Fail-over (original served):

(this emits two log lines currently; the values below don't make sense because I "forced" this error by swapping a condition in code)

workhorse_1      | time="2020-11-24T14:23:31Z" level=error msg=error correlation_id=EHhijY96mp9 duration_s=0.001627991 error="5005 bytes exceeds maximum file size of 250000 bytes" imageresizer.content_type=image/png imageresizer.original_filesize=5005 imageresizer.status=unknown imageresizer.target_width=32 method=GET subsystem=imageresizer uri="/uploads/-/system/user/avatar/1/avatar.png?width=32" written_bytes=0
workhorse_1      | time="2020-11-24T14:23:31Z" level=info msg=served-original correlation_id=EHhijY96mp9 duration_s=0.001962027 imageresizer.content_type=image/png imageresizer.original_filesize=5005 imageresizer.status=served-original imageresizer.target_width=32 method=GET subsystem=imageresizer uri="/uploads/-/system/user/avatar/1/avatar.png?width=32" written_bytes=5005

Other notes

I realize that subsystem isn't a label we use elsewhere. I think it is time to introduce something like this (name TBD), since Workhorse is reaching sufficient complexity that we should conceptually break it down into sub-components; I like subsystem because "component" is already in use elsewhere esp. when it comes to observability topics. We could also make it very specific and call it injecter or just module?
I had to extend the logging helpers by two more variants that allow me to pass in custom label sets. This is getting quite messy since the number of options for how to log causes the interface surface area to explode combinatorally. I am not very experienced with Go, but in object oriented languages you address this problem with the builder pattern; in functional programming, you would most likely use curried functions or reader monads. Golang seems to fall somewhere in between these paradigms and maybe just having a bunch of specifically named procedures is the "Go way", but I would appreciate feedback here.

Edited Nov 25, 2020 by Matthias Käppler

Consistent logging in image resizer module

What was broken

How it's done

Other notes

Merge request reports