Skip to content

Configure the Pubsubbeat HPA to use the oldest unacked message in PubSub as a target

Pierre Guinoiseau requested to merge pguinoiseau/pubsubbeat-hpa-tuning into master

See this discussion: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15501#note_885771618

  • The Pubsub backlog size doesn't seem like the best target to use, it can increase by a lot in a few seconds but that doesn't mean the logs are late (yet), Pubsubbeat can cope with it, using this metric can cause the HPA to scale up too soon. Using the oldest unacked message age should be better, the metrics history shows that it stays under 50 seconds in normal times, so 60 seconds should be a good target.
  • We should use Value and not AverageValue for external metrics, because AverageValue divides Value by the number of replicas, which is not what we want here, and causes the HPA to not scale when it should
  • The HPA usage shows an average CPU utilization of ~67% when under full load, which was weird at first. But this is 67% of the CPU limit which is at 1500m, so that's actually 100% of 1 core, which makes sense because Pubsubbeat is singlethreaded (as far as I can tell), and so the target of 70% is never reached. Dropping the target to 60% will target 90% of 1 core, which should be better.
Edited by Pierre Guinoiseau

Merge request reports