Metric Queries need to avoid evicted Pods
Some metrics we gather obtain data from Pods that are in state Evicted. Example, under Running Pods in these charts, the numbers are inflated greatly:
https://dashboards.gitlab.net/d/kubernetes-resources-workloads-namespace/kubernetes-compute-resources-namespace-workloads?orgId=1&refresh=10s&var-datasource=Global&var-cluster=gprd-gitlab-gke&var-namespace=gitlab&var-interval=4h&var-type=deployment
Example:
$ kubectl get pods -n gitlab | grep memory-bound | grep Running
gitlab-sidekiq-memory-bound-v1-6b58677c89-5752n 1/1 Running 1 7h42m
gitlab-sidekiq-memory-bound-v1-6b58677c89-9d2ff 1/1 Running 0 21m
gitlab-sidekiq-memory-bound-v1-6b58677c89-d4ghz 1/1 Running 0 95m
gitlab-sidekiq-memory-bound-v1-6b58677c89-p2p6k 1/1 Running 0 3h39m
gitlab-sidekiq-memory-bound-v1-6b58677c89-rgkgz 1/1 Running 0 17m
gitlab-sidekiq-memory-bound-v1-6b58677c89-snc2w 1/1 Running 0 5h2m
So at the time of me taking the screen grab and running the above command we are only Running 6 Pods. 52 of them are evicted.
We need to ensure that this style of gathering metrics is not impacting our metrics in other ways, such as saturation. Let's revisit our metrics we are charting on Grafana and figure out if there's a way we can remove Pods that are not running.
As a bonus add somewhere the amount of evicted Pods and maybe also graph this data so we can quickly see the rate of evictions. This will improve our ability to observe the health of our clusters and services that are running over the course of time and provide comparison data during incidents.
