Figure out why Elasticsearch was OOM
From https://gitlab.com/gitlab-com/gl-infra/production/issues/1591#note_277584126 and https://gitlab.com/gitlab-org/search-team/team-tasks/issues/8#note_277880394 we believe that an Elasticsearch instance crashed due to out of memory.
This seemed to not resolve itself and cascaded a way that made the whole cluster stable.
So we want to understand:
-
What caused the initial failure? Were we sending payloads that are too large? Were we sending too many requests? Something else? -
Why couldn't the system failover and recover since the cluster has redundancy? -
Do we need to resize the nodes and if so what is the appropriate size?
It's expected we may wish to involve Elastic support in order to help figure this out.
We may want to do another live experiment if we don't have enough logs to answer some of the above questions. We could try enabling indexing without enabling searching then try to re-enable again for the big customer to simulate the same load. In order to do this we need to have a good handle on what safe levels are for sidekiq queues to know that we won't OOM Redis. We can also disable indexing as soon as the cluster becomes unstable which seemed quite quick last time and should still give us an opportunity to see what's happening.
It would also be great to try and simulate this on staging but the problem is that without real usage we may never be able to replicate what actually happened.
Elastic Case
https://support.elastic.co/customers/s/case/5004M00000Yog8qQAB