Kubernetes Metric Queries need to avoid evicted Pods
Some metrics we gather obtain data from Pods that are in state `Evicted`. Example, under `Running Pods` in these charts, the numbers are inflated greatly: https://dashboards.gitlab.net/d/kubernetes-resources-workloads-namespace/kubernetes-compute-resources-namespace-workloads?orgId=1&refresh=10s&var-datasource=Global&var-cluster=gprd-gitlab-gke&var-namespace=gitlab&var-interval=4h&var-type=deployment Example: ![image](/uploads/1755d16806c0a63226f5f08b78040e25/image.png) ``` $ kubectl get pods -n gitlab | grep memory-bound | grep Running gitlab-sidekiq-memory-bound-v1-6b58677c89-5752n 1/1 Running 1 7h42m gitlab-sidekiq-memory-bound-v1-6b58677c89-9d2ff 1/1 Running 0 21m gitlab-sidekiq-memory-bound-v1-6b58677c89-d4ghz 1/1 Running 0 95m gitlab-sidekiq-memory-bound-v1-6b58677c89-p2p6k 1/1 Running 0 3h39m gitlab-sidekiq-memory-bound-v1-6b58677c89-rgkgz 1/1 Running 0 17m gitlab-sidekiq-memory-bound-v1-6b58677c89-snc2w 1/1 Running 0 5h2m ``` So at the time of me taking the screen grab and running the above command we are only Running 6 Pods. 52 of them are evicted. We need to ensure that this style of gathering metrics is not impacting our metrics in other ways, such as saturation. Let's revisit our metrics we are charting on Grafana and figure out if there's a way we can remove Pods that are not running. As a bonus add somewhere the amount of evicted Pods and maybe also graph this data so we can quickly see the rate of evictions. This will improve our ability to observe the health of our clusters and services that are running over the course of time and provide comparison data during incidents.
issue