Vertical Scaled Limits for Thanos receveiver component

What does this MR do and why?

This MR aims to implement a Vertical Pod Autoscaler object for the thanos-receive statefulset in thanos namespace. Unlike other Thanos components, the receiver does not support horizontal autoscaling, thus having the need of a vertical autoscaler. The limits have been chosen based on the current usage in IC clusters and also based on the previous values set for request and limits.

The behavior of the autoscaler needs to also be studied, as the new values that are allocated at the pod are a little bit difficult to handle, as the vpa is updating the values at a very granular level (as can be seen in the image below).

image

In other cases, the VPA is updating the values in a granular style, but still keeping the overall aspect of the values which are set:

image image image

IMPORTANT NOTE: The vertica;-pod-autoscaler is setting the requests and limits automatically, based on its own analysis made over a period of time.

I have also studied any potential change in this behavior, but at the moment, there is no modifiable argument that allows the Vertical Pod Autoscaler to scale the requests and limits of the container with a specific pace.

This MR is also connected to: !6381 and !6377

Related to issue #616

Test coverage

I have used a benchmark in order to test an increase in the limits of thanos-receive statefulset. What has been observed to this moment:

  • The modification of the values set for requests and limits happen just at the container level, so there is always going to be a difference between the values set for the parent resource (statefulset in this case) and the value of the container. This is going to be useful for the operational team, as they can simply remove the vertical-pod-autoscaler object in case of malfunctioning and the values of the parent resources will be applied instead.
  • There is still a need to come up with an initial set of requests and limits in order for the VPA to be able to compute. The VPA is not going to add them, it is just going to adjust them in time, based on the usage. The values that were chosen now were based on the consumptions of thanos-receiver statefulset in IC clusters.
  • If you want to replicate the behavior locally, there will need to be an initial wait period before any VPA will get into effect, as it has an internal policy not to update any pod which is considered short-lived.
  • The difference of usage in the benchmark needs to be considerable in order to trigger modifications in the values for requests and limits (the benchmark deployment had modifications for its replicaset when it was tested, scaling from 2 to 20 pods)

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon Meaning Available values
☁️ Infra Provider capd, capo, capm3
🚀 Bootstrap Provider kubeadm (alias kadm), rke2, okd, ck8s
🐧 Node OS ubuntu, suse, na, leapmicro
🛠️ Deployment Options light-deploy, dev-sources, ha, misc, maxsurge-0, logging, no-logging, cilium
🎬 Pipeline Scenarios Available scenario list and description
🟢 Enabled units Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type
🏗️ Target platform Can be used to select specific deployment environment (i.e real-bmh for capm3 )
  • 🎬 preview ☁️ capd 🚀 kadm 🐧 ubuntu

  • 🎬 preview ☁️ capo 🚀 rke2 🐧 suse

  • 🎬 preview ☁️ capm3 🚀 rke2 🐧 ubuntu

  • ☁️ capd 🚀 kadm 🛠️ light-deploy 🐧 ubuntu

  • ☁️ capd 🚀 rke2 🛠️ light-deploy 🐧 suse

  • ☁️ capo 🚀 rke2 🐧 suse

  • ☁️ capo 🚀 rke2 🐧 leapmicro

  • ☁️ capo 🚀 kadm 🐧 ubuntu

  • ☁️ capo 🚀 kadm 🐧 ubuntu 🟢 neuvector,mgmt:harbor

  • ☁️ capo 🚀 rke2 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capo 🚀 kadm 🎬 wkld-k8s-upgrade 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 rolling-update-no-wkld 🛠️ ha 🐧 suse

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.5.x 🛠️ ha 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.5.x 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🛠️ ha,misc,openbao🐧 suse

  • ☁️ capo 🚀 rke2 🐧 suse 🎬 upgrade-from-prev-tag

  • ☁️ capm3 🚀 rke2 🐧 suse

  • ☁️ capm3 🚀 kadm 🐧 ubuntu

  • ☁️ capm3 🚀 ck8s 🐧 ubuntu

  • ☁️ capm3 🚀 kadm 🎬 rolling-update-no-wkld 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 wkld-k8s-upgrade 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.5.x 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 rke2 🛠️ misc,ha 🐧 suse

  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.5.x 🛠️ ha,misc 🐧 suse

  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 ck8s 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2|okd 🎬 no-update 🐧 ubuntu|na

  • ☁️ capm3 🚀 rke2 🐧 suse 🎬 upgrade-from-release-1.5

  • ☁️ capm3 🚀 rke2 🐧 suse 🎬 upgrade-to-main

Global config for deployment pipelines

  • autorun pipelines
  • allow failure on pipelines
  • record sylvactl events

Notes:

  • Enabling autorun will make deployment pipelines to be run automatically without human interaction
  • Disabling allow failure will make deployment pipelines mandatory for pipeline success.
  • if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited by Tiberiu Mihai

Merge request reports

Loading