Evaluate the use of force-merge in the production logging elasticsearch cluster

Summary

We've been seeing some performance problems in the production logging cluster recently. One of the recommendations from Elastic Cloud support was to run force merge on the indices. We do not see high resource utilization on the warm nodes, so I suspect the reasoning behind it is to improve performance of indices between the time when they are rolled over and moved to the warm tier. During this time indices are still on the hot nodes and the force merge is probably expected to reduce the number of segments within shards.

We used to run force-merge on hot tier and it was causing saturation. I think we should test this anyway and come up with the best approach.

Possible outcomes:

run force merge on the warm tier to improve performance of indices after they were moved to warm nodes
run force merge on the hot tier and resize the hot tier to improve performance of indices after they were rolled over but before they were moved to warm nodes

I think this isn't just a simple config change to be reviewed after a couple of days in production. One approach I can think of:

roll it out during low traffic hours
check if merge tasks are executed on the tier where we want it to be
check thread utilization on that tier
if everything is as desired, potentially wait till high traffic hours to quantify impact

Related Incident(s)

Originating issue(s):

production#6993 (closed) (the most recent)

Desired Outcome/Acceptance Criteria

We know what the best strategy is for force merge for the logging cluster.

Associated Services

Corrective Action Issue Checklist

Link the incident(s) this corrective action arose out of
Give context for what problem this corrective action is trying to prevent from re-occurring
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
Assign a priority (this will default to 'priority::4')