Pipeline step log entries have too many fields

Summary

When producing log entries, we need to be weary of dynamically defined fields, as they can explode the log schema cardinality, creating too many fields in Elasticsearch. Because the schema size can have a negative impact on Elasticsearch performance and stability, it caps the maximum number of fields to 1024.

Once this limit is reached, entries attempting to add more fields may be rejected. If the combined cardinality of indices involved in a search exceeds 1024, the search may fail.

Impact

Quality and availability of logs is impacted, which makes it harder for infrastructure to sustainably maintain availability for GitLab.com.

Here is an example of a failed search:

{
    "shard": 0,
    "index": "pubsub-rails-inf-gprd-012926",
    "node": "g2aX_u-BSg6-cx1U09F19Q",
    "reason": {
        "type": "query_shard_exception",
        "reason": "failed to create query: field expansion matches too many fields, limit: 1024, got: 1181",
        "index_uuid": "dXd8P6NJSIuyrYC0R4OB1g",
        "index": "pubsub-rails-inf-gprd-012926",
        "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "field expansion matches too many fields, limit: 1024, got: 1181"
        }
    }
}

source

Recommendation

I ran some analysis on the current mappings:

$ curl -s $ELASTICSEARCH_URL/pubsub-rails-inf-gprd-\*/_mappings > mappings.rails.json
$ cat mappings.rails.json | jq -r '.[].mappings|paths|join(";")|[., 1]|@tsv' | sd -s ';properties;' ';' | grep ';type' | sd '_' ';' | flamegraph.pl --hash > flamegraph.svg

mappings.rails.json

We can see two clusters:

pipeline_: Almost 20% of the total schema size. This includes fields like pipeline_step_gitlab_ci_pipeline_chain_build_associations_duration_s.
config_: Around 7% of the total schema size. This includes fields like config_build_context_duration_s.

While these granular breakdowns can be very useful for analysis, they are too fine-grained for what Elasticsearch can handle.

Thus, we should either make these a little more coarse-grained (to reduce the number of fields), or convert them to a single array field containing key-value pairs.

Verification

This kibana search should no longer produce an error.

It's also possible to look at the number of fields in the rails schema:

$ curl -s "$ELASTICSEARCH_URL/$(curl -s $ELASTICSEARCH_URL/pubsub-rails-inf-gprd/_alias | jq -r 'to_entries[]|select(.value.aliases[].is_write_index)|.key')/_mappings" | jq -r '.[]|paths|join(".")' | sd -s '.properties.' '.' | grep '\.type$' | wc -l

    1189

This number should be significantly below 1024.

Workaround

This error occurs when searching all fields (via the search field at the top). By matching on specific columns, you can still get responses.

Technical Proposal

There are simple modifications we can do, while we wait for #373143, to reduce the number of fields:

!103165 (merged) - Remove any fields we think are not needed or show data that is not relevant to the performance problem -
In https://log.gprd.gitlab.net/goto/5bb29a20-6038-11ed-b0ec-930003e0679c we can clearly see the big hitters from pipeline_step_gitlab_ci_pipeline_chain_*. We can consider logging only the top 10 (instead of 33 today).
!104593 (merged) - Each pipeline_step_gitlab_ci_pipeline_chain_*_duration_s contains .avg, .count, .max, .min, .sum but we only measure it once. See .count = 1 always. This means we could just have pipeline_step_gitlab_ci_pipeline_chain_*_duration_s containing the duration measured once (equivalent of .sum). And we could reduce these attributes x5.
- For example: introduce observe_once method.
!104580 (merged) - For the other more granular observations we could only log .count and .sum (.avg can be derived from those, .min and .max may not be as critical for now).

With this we may be able to have a 70-80% reduction in attributes, while still maintaining the data we need to investigate Pipeline creation is slow (&7290)

How to track improvements

# dump the content of a document from Kibana
# query: json.class: "Gitlab::Ci::Pipeline::Logger" AND NOT json.pipeline_source: "unknown"
json = JSON.parse(File.read('document.json'))

# fields introduced by the default json logger. We won't touch these.
default_fields = 25

# count the number of fields exported. Not all the fields are added by the pipeline logger
# but we can use this number as the initial value before reducing the fields.
json['fields'].select { |f| f.start_with?('json.') }.count - default_fields
# => 390

Edited Nov 28, 2022 by Fabio Pitino