Logs getting dropped due to mapping conflicts

Extracted from delivery#1612, quoting @skarbek.

Sidekiq

In observing sidekiq, we are rejecting a lot of logs. Multiple causes:

  1. The error field:
{"type":"mapper_parsing_exception","reason":"object mapping for [json.error] tried to parse field [error] as object, but found a concrete value"}

Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=LOS0ZHgB2SyrKNsSSUq6

  1. The exception field:
"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [exception.class]. Existing mapping for [json.exception] must be of type object but found [text]."}

Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=Cxu6ZHgBkEx-pOLLxgK9

  1. The user field:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [user.username]. Existing mapping for [json.user] must be of type object but found [text]."}

Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=SHG8ZHgBcRpzTsORm3sk

  1. The exception.backtrace field:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [exception.backtrace]. Existing mapping for [json.exception] must be of type object but found [text]."}

Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=vdTBZHgB28qpl1LPLzIT

  1. The exception.message field:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [exception.message]. Existing mapping for [json.exception] must be of type object but found [text]."}

Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=uf_CZHgB4_gCPKVmGRSx


The above is all log data we are losing today. For all items related to errors/exceptions, we have a backup method of finding this data via Sentry. But this is made a little more odd in that the exception field is already supposed to be a giant string as defined by our fluentd configuration: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/blob/223ae65c63487eec1adf88aef24ce154fc1e6012/releases/fluentd/defaults.yaml#L247 The user field is an awkward one to be a causation for losing data, however.

We appear to drop anywhere between 20k and 40k log messages per 12 hours.

Rails

Let's observe the rails log index for similar information:

  1. The params.value field:
{"type":"mapper_parsing_exception","reason":"failed to parse field [json.params.value] of type [text] in document with id 'WJ5dZXgBkEx-pOLLw7JQ'. Preview of field's value: '{size=7352215, oid=35967a21e5d856eaea89d2e5dd55a5e3b5f4e1e4efe3b000ef6d60b31600f1d2}'","caused_by":{"type":"illegal_state_exception","reason":"Can't get text on a START_OBJECT at 1:2335"}

Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=g2FdZXgB28qpl1LP4QGr

  1. The protoPayload.request.metadata.labels.app field:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [app.kubernetes.io/managed-by]. Existing mapping for [json.protoPayload.request.metadata.labels.app] must be of type object but found [text]."}

Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=AJNfZXgBG8Cs66GhYIw8

  1. The protoPayload.response.spec.volumes field:
{"type":"mapper_parsing_exception","reason":"failed to parse field [json.protoPayload.response.spec.volumes] of type [text] in document with id '3athZXgBP57qwIoFJ1My'. Preview of field's value: '{emptyDir={}, name=shared-tmp}'","caused_by":{"type":"illegal_state_exception","reason":"Can't get text on a START_OBJECT at 1:21329"}}

I don't recall us logging this information at all, and suspect we probably shouldn't...

Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=pu5hZXgBa0YDjQVgRyH3

  1. The protoPayload.response.metadata.labels.app field:
{"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [app.kubernetes.io/instance]. Existing mapping for [json.protoPayload.response.metadata.labels.app] must be of type object but found [text]."}

Example field: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=0ahiZXgBkEx-pOLLSTLs

  1. The error field:
{"type":"mapper_parsing_exception","reason":"object mapping for [json.error] tried to parse field [error] as object, but found a concrete value"}

Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=iKllZXgB2SyrKNsSYLcC


The last one is commonly shared with the sidekiq one as well. Perhaps we need to tweak our fluentd configuration for this? The next 3 are all directly related to kubernetes metadata that gets sent into ES. I suspect the schema is simply incorrect on these items. The first one listed, however, we should dig further into. It was flagged during our investigation into the incident.

Search used to help with investigation into these dropped messages: https://log.gprd.gitlab.net/goto/1722b65a1730039ab4aae063f69c31b3

Heavy filtering on the above errors was used to narrow down the remaining causes.

Edited by Igor