Vertical Scaled Limits for Thanos receveiver component
What does this MR do and why?
This MR aims to implement a Vertical Pod Autoscaler object for the thanos-receive statefulset in thanos namespace. Unlike other Thanos components, the receiver does not support horizontal autoscaling, thus having the need of a vertical autoscaler. The limits have been chosen based on the current usage in IC clusters and also based on the previous values set for request and limits.
The behavior of the autoscaler needs to also be studied, as the new values that are allocated at the pod are a little bit difficult to handle, as the vpa is updating the values at a very granular level (as can be seen in the image below).
In other cases, the VPA is updating the values in a granular style, but still keeping the overall aspect of the values which are set:
IMPORTANT NOTE: The vertica;-pod-autoscaler is setting the requests and limits automatically, based on its own analysis made over a period of time.
I have also studied any potential change in this behavior, but at the moment, there is no modifiable argument that allows the Vertical Pod Autoscaler to scale the requests and limits of the container with a specific pace.
This MR is also connected to: !6381 and !6377
Related reference(s)
Related to issue #616
Test coverage
I have used a benchmark in order to test an increase in the limits of thanos-receive statefulset. What has been observed to this moment:
- The modification of the values set for requests and limits happen just at the container level, so there is always going to be a difference between the values set for the parent resource (statefulset in this case) and the value of the container. This is going to be useful for the operational team, as they can simply remove the vertical-pod-autoscaler object in case of malfunctioning and the values of the parent resources will be applied instead.
- There is still a need to come up with an initial set of requests and limits in order for the VPA to be able to compute. The VPA is not going to add them, it is just going to adjust them in time, based on the usage. The values that were chosen now were based on the consumptions of thanos-receiver statefulset in IC clusters.
- If you want to replicate the behavior locally, there will need to be an initial wait period before any VPA will get into effect, as it has an internal policy not to update any pod which is considered short-lived.
- The difference of usage in the benchmark needs to be considerable in order to trigger modifications in the values for requests and limits (the benchmark deployment had modifications for its replicaset when it was tested, scaling from 2 to 20 pods)
CI configuration
Below you can choose test deployment variants to run in this MR's CI.
Click to open to CI configuration
Legend:
| Icon | Meaning | Available values |
|---|---|---|
| Infra Provider |
capd, capo, capm3
|
|
| Bootstrap Provider |
kubeadm (alias kadm), rke2, okd, ck8s
|
|
| Node OS |
ubuntu, suse, na, leapmicro
|
|
| Deployment Options |
light-deploy, dev-sources, ha, misc, maxsurge-0, logging, no-logging, cilium
|
|
| Pipeline Scenarios | Available scenario list and description | |
| Enabled units | Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type |
|
| Target platform | Can be used to select specific deployment environment (i.e real-bmh for capm3 ) |
-
🎬 preview☁️ capd🚀 kadm🐧 ubuntu -
🎬 preview☁️ capo🚀 rke2🐧 suse -
🎬 preview☁️ capm3🚀 rke2🐧 ubuntu -
☁️ capd🚀 kadm🛠️ light-deploy🐧 ubuntu -
☁️ capd🚀 rke2🛠️ light-deploy🐧 suse -
☁️ capo🚀 rke2🐧 suse -
☁️ capo🚀 rke2🐧 leapmicro -
☁️ capo🚀 kadm🐧 ubuntu -
☁️ capo🚀 kadm🐧 ubuntu🟢 neuvector,mgmt:harbor -
☁️ capo🚀 rke2🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capo🚀 kadm🎬 wkld-k8s-upgrade🐧 ubuntu -
☁️ capo🚀 rke2🎬 rolling-update-no-wkld🛠️ ha🐧 suse -
☁️ capo🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha🐧 ubuntu -
☁️ capo🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc,openbao🐧 suse -
☁️ capo🚀 rke2🐧 suse🎬 upgrade-from-prev-tag -
☁️ capm3🚀 rke2🐧 suse -
☁️ capm3🚀 kadm🐧 ubuntu -
☁️ capm3🚀 ck8s🐧 ubuntu -
☁️ capm3🚀 kadm🎬 rolling-update-no-wkld🛠️ ha,misc🐧 ubuntu -
☁️ capm3🚀 rke2🎬 wkld-k8s-upgrade🛠️ ha🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha🐧 suse -
☁️ capm3🚀 rke2🛠️ misc,ha🐧 suse -
☁️ capm3🚀 rke2🎬 sylva-upgrade-from-1.5.x🛠️ ha,misc🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 suse -
☁️ capm3🚀 ck8s🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2|okd🎬 no-update🐧 ubuntu|na -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-from-release-1.5 -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-to-main
Global config for deployment pipelines
- autorun pipelines
- allow failure on pipelines
- record sylvactl events
Notes:
- Enabling
autorunwill make deployment pipelines to be run automatically without human interaction - Disabling
allow failurewill make deployment pipelines mandatory for pipeline success. - if both
autorunandallow failureare disabled, deployment pipelines will need manual triggering but will be blocking the pipeline
Be aware: after configuration change, pipeline is not triggered automatically.
Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.



