Logs getting dropped due to mapping conflicts
Extracted from delivery#1612, quoting @skarbek.
Sidekiq
In observing sidekiq, we are rejecting a lot of logs. Multiple causes:
- The
errorfield:
{"type":"mapper_parsing_exception","reason":"object mapping for [json.error] tried to parse field [error] as object, but found a concrete value"}
Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=LOS0ZHgB2SyrKNsSSUq6
- The
exceptionfield:
"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [exception.class]. Existing mapping for [json.exception] must be of type object but found [text]."}
Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=Cxu6ZHgBkEx-pOLLxgK9
- The
userfield:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [user.username]. Existing mapping for [json.user] must be of type object but found [text]."}
Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=SHG8ZHgBcRpzTsORm3sk
- The
exception.backtracefield:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [exception.backtrace]. Existing mapping for [json.exception] must be of type object but found [text]."}
Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=vdTBZHgB28qpl1LPLzIT
- The
exception.messagefield:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [exception.message]. Existing mapping for [json.exception] must be of type object but found [text]."}
Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=uf_CZHgB4_gCPKVmGRSx
The above is all log data we are losing today. For all items related to errors/exceptions, we have a backup method of finding this data via Sentry. But this is made a little more odd in that the exception field is already supposed to be a giant string as defined by our fluentd configuration: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/blob/223ae65c63487eec1adf88aef24ce154fc1e6012/releases/fluentd/defaults.yaml#L247 The user field is an awkward one to be a causation for losing data, however.
We appear to drop anywhere between 20k and 40k log messages per 12 hours.
Rails
Let's observe the rails log index for similar information:
- The
params.valuefield:
{"type":"mapper_parsing_exception","reason":"failed to parse field [json.params.value] of type [text] in document with id 'WJ5dZXgBkEx-pOLLw7JQ'. Preview of field's value: '{size=7352215, oid=35967a21e5d856eaea89d2e5dd55a5e3b5f4e1e4efe3b000ef6d60b31600f1d2}'","caused_by":{"type":"illegal_state_exception","reason":"Can't get text on a START_OBJECT at 1:2335"}
Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=g2FdZXgB28qpl1LP4QGr
- The
protoPayload.request.metadata.labels.appfield:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [app.kubernetes.io/managed-by]. Existing mapping for [json.protoPayload.request.metadata.labels.app] must be of type object but found [text]."}
Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=AJNfZXgBG8Cs66GhYIw8
- The
protoPayload.response.spec.volumesfield:
{"type":"mapper_parsing_exception","reason":"failed to parse field [json.protoPayload.response.spec.volumes] of type [text] in document with id '3athZXgBP57qwIoFJ1My'. Preview of field's value: '{emptyDir={}, name=shared-tmp}'","caused_by":{"type":"illegal_state_exception","reason":"Can't get text on a START_OBJECT at 1:21329"}}
I don't recall us logging this information at all, and suspect we probably shouldn't...
Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=pu5hZXgBa0YDjQVgRyH3
- The
protoPayload.response.metadata.labels.appfield:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [app.kubernetes.io/instance]. Existing mapping for [json.protoPayload.response.metadata.labels.app] must be of type object but found [text]."}
- The
errorfield:
{"type":"mapper_parsing_exception","reason":"object mapping for [json.error] tried to parse field [error] as object, but found a concrete value"}
Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=iKllZXgB2SyrKNsSYLcC
The last one is commonly shared with the sidekiq one as well. Perhaps we need to tweak our fluentd configuration for this? The next 3 are all directly related to kubernetes metadata that gets sent into ES. I suspect the schema is simply incorrect on these items. The first one listed, however, we should dig further into. It was flagged during our investigation into the incident.
Search used to help with investigation into these dropped messages: https://log.gprd.gitlab.net/goto/1722b65a1730039ab4aae063f69c31b3
Heavy filtering on the above errors was used to narrow down the remaining causes.