Investigate pubsubbeat hpa scaling behavior
Some questions to answer:
-
are we using the correct metric for scale-ups? - should we use
pubsub.googleapis.com/subscription/oldest_unacked_message_ageinstead?
- should we use
-
how do we differentiate the config between beats for indices with different indexing rates? - this can be easily done with adjusting the threshold per topic in
gitlab-helmfiles
- this can be easily done with adjusting the threshold per topic in
-
are we using the right requests/limits and the right pubsubbeat config? - should we have a smaller number of beats sending bigger batch requests?
- this can be optimized later, I don't think this is impacting processing bandwidth
-
in case of ES issues (or a burst of traffic), we'll scale to max number of HPAs (which will put more pressure on ES), is that ok? - yes, we want to process logs at maximum possible rate. The ES cluster should protect itself by rejecting writes. The additional pressure can come from too many connections, but that's an acceptable overhead and we over-provisioned masters in the past to accommodate for this. This is related to resources/config question above. I think this can be optimized at a later date.
-
workhorse and gitaly were saturated again today, but let's wait with changing the max replicas until we give ES some more power -
potentially scale down rails after ES is resized
Edited by Pierre Guinoiseau