Evaluate the use of force-merge in the production logging elasticsearch cluster (#3285) · Issues · GitLab.com / GitLab Infrastructure Team / Observability / Observability Issue Tracker

Evaluate the use of force-merge in the production logging elasticsearch cluster

## Summary  We've been seeing some performance problems in the production logging cluster recently. One of the recommendations from Elastic Cloud support was to run force merge on the indices. We do not see high resource utilization on the warm nodes, so I suspect the reasoning behind it is to improve performance of indices between the time when they are rolled over and moved to the warm tier. During this time indices are still on the hot nodes and the force merge is probably expected to reduce the number of segments within shards. We used to run force-merge on hot tier and it was causing saturation. I think we should test this anyway and come up with the best approach. Possible outcomes: - run force merge on the warm tier to improve performance of indices after they were moved to warm nodes - run force merge on the hot tier and resize the hot tier to improve performance of indices after they were rolled over but before they were moved to warm nodes I think this isn't just a simple config change to be reviewed after a couple of days in production. One approach I can think of: - roll it out during low traffic hours - check if merge tasks are executed on the tier where we want it to be - check thread utilization on that tier - if everything is as desired, potentially wait till high traffic hours to quantify impact ## Related Incident(s)  Originating issue(s): - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6993 (the most recent) ## Desired Outcome/Acceptance Criteria  We know what the best strategy is for force merge for the logging cluster. ## Associated Services  ## Corrective Action Issue Checklist * [x] Link the incident(s) this corrective action arose out of * [x] Give context for what problem this corrective action is trying to prevent from re-occurring * [x] Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') * [x] Assign a priority (this will default to 'priority::4')

issue