Evaluate recommendations from Elastic for logging cluster

Email detailing recommendations from Elastic:

It's been a while since our last call. If you remember, the Elastic team went over a couple of the diagnostics that were run on some of the clusters the Reliability team was overseeing.

I thought it would be good to summarize some of the findings and recommendations we went over on our last call. Also, we wanted to check in and see if the GitLab team has had a chance to look at or implement any of the changes we reviewed. Are there still any recurring or persistent issues?

Doing inefficient searches

Aggregating on Text fields, this is not optimal.
Using Field Data is also not recommended.
This is reflected in the circuit breakers.
Recommended to set some keyword fields for the values that are important for Aggregations

Cluster is underpowered, related to Disk I/O mainly, and needs optimization or scale-up, or both.

Seems like you are performing flushes, this should be automatic, what’s reason here?
Recommend optimizing the refresh, to 5 or 10 seconds rather than default (1 sec) for high volume ingest.
Recommend more primary shards to distribute load better.
Total shard count looks good.
Check the snapshot policy is not extreme, this also causes disk load.

Parent /child mappings are resource intensive, why is this used for Logging cluster?
CPU Load looks good, single digits is good.

This issue is to breakdown these recommendations and make issues for reasonable increments of work and can be closed when that breakdown is done.

Edited Feb 22, 2023 by Dave Smith