Pipeline step log entries have too many fields
Summary
When producing log entries, we need to be weary of dynamically defined fields, as they can explode the log schema cardinality, creating too many fields in Elasticsearch. Because the schema size can have a negative impact on Elasticsearch performance and stability, it caps the maximum number of fields to 1024.
Once this limit is reached, entries attempting to add more fields may be rejected. If the combined cardinality of indices involved in a search exceeds 1024, the search may fail.
See also:
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/10173
- !31910 (merged)
- !32583 (merged)
- !32859 (merged)
Impact
Quality and availability of logs is impacted, which makes it harder for infrastructure to sustainably maintain availability for GitLab.com.
Here is an example of a failed search:
{
"shard": 0,
"index": "pubsub-rails-inf-gprd-012926",
"node": "g2aX_u-BSg6-cx1U09F19Q",
"reason": {
"type": "query_shard_exception",
"reason": "failed to create query: field expansion matches too many fields, limit: 1024, got: 1181",
"index_uuid": "dXd8P6NJSIuyrYC0R4OB1g",
"index": "pubsub-rails-inf-gprd-012926",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "field expansion matches too many fields, limit: 1024, got: 1181"
}
}
}
Recommendation
I ran some analysis on the current mappings:
$ curl -s $ELASTICSEARCH_URL/pubsub-rails-inf-gprd-\*/_mappings > mappings.rails.json
$ cat mappings.rails.json | jq -r '.[].mappings|paths|join(";")|[., 1]|@tsv' | sd -s ';properties;' ';' | grep ';type' | sd '_' ';' | flamegraph.pl --hash > flamegraph.svg
We can see two clusters:
-
pipeline_
: Almost 20% of the total schema size. This includes fields likepipeline_step_gitlab_ci_pipeline_chain_build_associations_duration_s
. -
config_
: Around 7% of the total schema size. This includes fields likeconfig_build_context_duration_s
.
While these granular breakdowns can be very useful for analysis, they are too fine-grained for what Elasticsearch can handle.
Thus, we should either make these a little more coarse-grained (to reduce the number of fields), or convert them to a single array field containing key-value pairs.
Verification
This kibana search should no longer produce an error.
It's also possible to look at the number of fields in the rails schema:
$ curl -s "$ELASTICSEARCH_URL/$(curl -s $ELASTICSEARCH_URL/pubsub-rails-inf-gprd/_alias | jq -r 'to_entries[]|select(.value.aliases[].is_write_index)|.key')/_mappings" | jq -r '.[]|paths|join(".")' | sd -s '.properties.' '.' | grep '\.type$' | wc -l
1189
This number should be significantly below 1024.
Workaround
This error occurs when searching all fields (via the search field at the top). By matching on specific columns, you can still get responses.
Technical Proposal
There are simple modifications we can do, while we wait for #373143, to reduce the number of fields:
-
!103165 (merged) - Remove any fields we think are not needed or show data that is not relevant to the performance problem - -
In https://log.gprd.gitlab.net/goto/5bb29a20-6038-11ed-b0ec-930003e0679c we can clearly see the big hitters from pipeline_step_gitlab_ci_pipeline_chain_*
. We can consider logging only the top 10 (instead of 33 today). -
!104593 (merged) - Each pipeline_step_gitlab_ci_pipeline_chain_*_duration_s
contains.avg
,.count
,.max
,.min
,.sum
but we only measure it once. See.count = 1
always. This means we could just havepipeline_step_gitlab_ci_pipeline_chain_*_duration_s
containing the duration measured once (equivalent of.sum
). And we could reduce these attributes x5.- For example: introduce
observe_once
method.
- For example: introduce
-
!104580 (merged) - For the other more granular observations we could only log .count
and.sum
(.avg
can be derived from those,.min
and.max
may not be as critical for now).
With this we may be able to have a 70-80% reduction in attributes, while still maintaining the data we need to investigate Pipeline creation is slow (&7290)
How to track improvements
# dump the content of a document from Kibana
# query: json.class: "Gitlab::Ci::Pipeline::Logger" AND NOT json.pipeline_source: "unknown"
json = JSON.parse(File.read('document.json'))
# fields introduced by the default json logger. We won't touch these.
default_fields = 25
# count the number of fields exported. Not all the fields are added by the pipeline logger
# but we can use this number as the initial value before reducing the fields.
json['fields'].select { |f| f.start_with?('json.') }.count - default_fields
# => 390