Non-prod Elastic Cluster indicates healthy, but is terribly slow

Details

Point of contact for this request: groupdelivery
If a call is needed, what is the proposed date and time of the call: Date and Time
Additional call details (format, type of call): additional details

SRE Support Needed

We leverage the non-prod Elastic Cluster for sending log data from our automated tooling. Since an unknown time ago, we started to see failures from our tooling when making contact with the Elastic Search Cluster for sending data. Note that we speak directly to the API, pubsub is not leveraged. We've seen an increase in timeout errors from the API. We've increased this timeout associated with the library our tooling uses to no avail. Note that we do get messages into Elastic Search, our tooling is just throwing errors for each message we send. Example: https://ops.gitlab.net/gitlab-org/release/tools/-/jobs/12906067

In looking at the cluster, Elastico reports it's healthy. However, if one browses Kibana, it's terribly sluggish. If one views our own monitoring dashboards, the elasticsearch cluster has various high saturation points.

From our own dashboards, we see a massive uptick in the early morning from an unknown source but that has left a long lasting effect throughout the entire day even as ingestion as gone back down to normal: https://dashboards.gitlab.net/d/logging-main/logging3a-overview?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gstg&from=1708437757058&to=1708457274223

The same can be seen from Elastico's dashboard:

I do not believe this cluster is healthy.

Should groupdelivery move where we send our logs?