Evaluate the use of force-merge in the production logging elasticsearch cluster
## Summary <!-- Give context for what problem this issue is trying to prevent from happening again. Provide a brief assessment of the risk (chance and impact) of the problem that this corrective action fixes, to assist with triage and prioritization. --> We've been seeing some performance problems in the production logging cluster recently. One of the recommendations from Elastic Cloud support was to run force merge on the indices. We do not see high resource utilization on the warm nodes, so I suspect the reasoning behind it is to improve performance of indices between the time when they are rolled over and moved to the warm tier. During this time indices are still on the hot nodes and the force merge is probably expected to reduce the number of segments within shards. We used to run force-merge on hot tier and it was causing saturation. I think we should test this anyway and come up with the best approach. Possible outcomes: - run force merge on the warm tier to improve performance of indices after they were moved to warm nodes - run force merge on the hot tier and resize the hot tier to improve performance of indices after they were rolled over but before they were moved to warm nodes I think this isn't just a simple config change to be reviewed after a couple of days in production. One approach I can think of: - roll it out during low traffic hours - check if merge tasks are executed on the tier where we want it to be - check thread utilization on that tier - if everything is as desired, potentially wait till high traffic hours to quantify impact ## Related Incident(s) <!-- Note the originating incident(s) and link known related incidents/other issues. The relation will happen automatically if you are creating this issue from an incident, if this isn't done already please uncomment the following line: --> Originating issue(s): - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6993 (the most recent) ## Desired Outcome/Acceptance Criteria <!-- How will you know that this issue is complete? If you have any initial thoughts on implementation details (e.g. what to do or not do, gotchas, edge cases etc.), please share them while they are fresh in your mind. --> We know what the best strategy is for force merge for the logging cluster. ## Associated Services <!-- Apply the appropriate services associated with this corrective action if applicable. ~Service::SERVICE_NAME --> ## Corrective Action Issue Checklist * [x] Link the incident(s) this corrective action arose out of * [x] Give context for what problem this corrective action is trying to prevent from re-occurring * [x] Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') * [x] Assign a priority (this will default to 'priority::4')
issue