Logs getting dropped due to mapping conflicts
Extracted from https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1612, quoting @skarbek. ## Sidekiq In observing sidekiq, we are rejecting a lot of logs. Multiple causes: 1. The `error` field: ```json {"type":"mapper_parsing_exception","reason":"object mapping for [json.error] tried to parse field [error] as object, but found a concrete value"} ``` Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=LOS0ZHgB2SyrKNsSSUq6 2. The `exception` field: ```json "type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [exception.class]. Existing mapping for [json.exception] must be of type object but found [text]."} ``` Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=Cxu6ZHgBkEx-pOLLxgK9 3. The `user` field: ```json {"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [user.username]. Existing mapping for [json.user] must be of type object but found [text]."} ``` Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=SHG8ZHgBcRpzTsORm3sk 4. The `exception.backtrace` field: ```json {"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [exception.backtrace]. Existing mapping for [json.exception] must be of type object but found [text]."} ``` Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=vdTBZHgB28qpl1LPLzIT 5. The `exception.message` field: ```json {"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [exception.message]. Existing mapping for [json.exception] must be of type object but found [text]."} ``` Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=uf_CZHgB4_gCPKVmGRSx --- The above is all log data we are losing today. For all items related to errors/exceptions, we have a backup method of finding this data via Sentry. But this is made a little more odd in that the exception field is already supposed to be a giant string as defined by our fluentd configuration: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/blob/223ae65c63487eec1adf88aef24ce154fc1e6012/releases/fluentd/defaults.yaml#L247 The user field is an awkward one to be a causation for losing data, however. We appear to drop anywhere between 20k and 40k log messages per 12 hours. ## Rails Let's observe the rails log index for similar information: 1. The `params.value` field: ```json {"type":"mapper_parsing_exception","reason":"failed to parse field [json.params.value] of type [text] in document with id 'WJ5dZXgBkEx-pOLLw7JQ'. Preview of field's value: '{size=7352215, oid=35967a21e5d856eaea89d2e5dd55a5e3b5f4e1e4efe3b000ef6d60b31600f1d2}'","caused_by":{"type":"illegal_state_exception","reason":"Can't get text on a START_OBJECT at 1:2335"} ``` Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=g2FdZXgB28qpl1LP4QGr 2. The `protoPayload.request.metadata.labels.app` field: ```json {"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [app.kubernetes.io/managed-by]. Existing mapping for [json.protoPayload.request.metadata.labels.app] must be of type object but found [text]."} ``` Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=AJNfZXgBG8Cs66GhYIw8 3. The `protoPayload.response.spec.volumes` field: ```json {"type":"mapper_parsing_exception","reason":"failed to parse field [json.protoPayload.response.spec.volumes] of type [text] in document with id '3athZXgBP57qwIoFJ1My'. Preview of field's value: '{emptyDir={}, name=shared-tmp}'","caused_by":{"type":"illegal_state_exception","reason":"Can't get text on a START_OBJECT at 1:21329"}} ``` I don't recall us logging this information at all, and suspect we probably shouldn't... Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=pu5hZXgBa0YDjQVgRyH3 4. The `protoPayload.response.metadata.labels.app` field: ```json {"type":"mapper_parsing_exception","reason":"Could not dynamically add mapping for field [app.kubernetes.io/instance]. Existing mapping for [json.protoPayload.response.metadata.labels.app] must be of type object but found [text]."} ``` Example field: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=0ahiZXgBkEx-pOLLSTLs 5. The `error` field: ```json {"type":"mapper_parsing_exception","reason":"object mapping for [json.error] tried to parse field [error] as object, but found a concrete value"} ``` Example message: https://log.gprd.gitlab.net/app/discover#/doc/c1ee6e20-fbf1-11ea-af41-ad80f197fa45/pubsub-pubsubbeat-inf-gprd-000039?id=iKllZXgB2SyrKNsSYLcC --- The last one is commonly shared with the sidekiq one as well. Perhaps we need to tweak our fluentd configuration for this? The next 3 are all directly related to kubernetes metadata that gets sent into ES. I suspect the schema is simply incorrect on these items. The first one listed, however, we should dig further into. It was flagged during our investigation into the incident. Search used to help with investigation into these dropped messages: https://log.gprd.gitlab.net/goto/1722b65a1730039ab4aae063f69c31b3 Heavy filtering on the above errors was used to narrow down the remaining causes.
issue