Problems with prometheus-01.us-east1-c.gce.gitlab-runners.gitlab.net at 2018-06-11
Timeline:
- at ~20:07 UTC first alert in #ci-cd-alerts is fired
- at ~20:41 UTC first alert in #alerts is fired; PagerDuty is triggered
In DO nyc1
, GCP us-east1-c
and GCP us-east1-d
we have a monitoring clusters prepared to track usage of resources on autoscaled machines for CI. Each of the clusters is built from:
-
consul-01
..consul-03
nodes, that form a Consul cluster, -
prometheus-01
node, that tracks the metrics, - Consul agents and Prometheus exporters on autoscaled machines.
We're using both GCP zones more or less equaly (we're not prioritizing any of them in the configuration of CI infrastructure).
Prometheus server in us-east1-d
is working fine, and only prometheus-01.us-east1-c.gce.gitlab-runners.gitlab.net
seems to have problems.
On a graphs I see an increased number of connections on prometheus-01.us-east1-c.gce
(in comparison to prometheus-01.us-east1-d.gce
), which seems to be the source of increased number of samples and appenders in Prometheus metrics (what is the appender
?), which finaly seems to be the reason of huge RAM usage spike which OOMs the process.
But looking on the number of machines created for CI in GCP:
- there is no spike in the usage that could explain what's happening; last 7 dys usage looks quite stable (with lower usage at weekend),
- espiecially there is no visible difference between usage of
us-esat1-c
andus-east1-d
, which could explain why onlyprometheus-01.us-east1-c.gce.
have problems.
Related metrics (comparing prometheus-01.us-east1-c.gce
and prometheus-01.us-east2-d.gce
):
us-east1-c:
us-east1-d:
us-east1-c:
us-east1-d:
us-east1-c:
us-east1-d: