Nomad-exporter fails due to allocation metric collision
Info
- nomad-exporter version: 0.3.0
- Nomad version: 0.8.7
- Cluster size: 35 Nomad nodes
- Node resources:
- 147200 MHz of CPU
- 63 GiB of Memory
- Job spec:
my-heavy-job
- 25 replicas
- 4000 MHz CPU
- 6.14 GiB Memory
Full http response
# curl -v 10.0.0.177:27601/metrics
* Hostname was NOT found in DNS cache
* Trying 10.0.0.177...
* Connected to 10.0.0.177 (10.0.0.177) port 27601 (#0)
> GET /metrics HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 10.0.0.177:27601
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 10 Mar 2020 15:48:12 GMT
< Transfer-Encoding: chunked
<
An error has occurred during metrics collection:
11 error(s) occurred:
* collected metric nomad_allocation_cpu_percent label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:78.77808730129948 > was collected before with the same name and label values
* collected metric nomad_allocation_cpu_throttle_time label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:0 > was collected before with the same name and label values
* collected metric nomad_allocation_memory_rss_bytes label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:4.109185024e+09 > was collected before with the same name and label values
* collected metric nomad_allocation_cpu_ticks label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:1811.8960079298881 > was collected before with the same name and label values
* collected metric nomad_allocation_cpu_user_mode label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:5263.838494219879 > was collected before with the same name and label values
* collected metric nomad_allocation_cpu_system_mode label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:1021.3417973859466 > was collected before with the same name and label values
* collected metric nomad_allocation_memory_rss_required_bytes label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:6.442450944e+09 > was collected before with the same name and label values
* collected metric nomad_allocation_cpu_required label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:4000 > was collected before with the same name and label values
* collected metric nomad_task_cpu_percent label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > label:<name:"task" value:"my-heavy-job" > gauge:<value:78.77808730129948 > was collected before with the same name and label values
* collected metric nomad_task_cpu_total_ticks label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > label:<name:"task" value:"my-heavy-job" > gauge:<value:1811.8960079298881 > was collected before with the same name and label values
* collected metric nomad_task_memory_rss_bytes label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > label:<name:"task" value:"my-heavy-job" > gauge:<value:4.109185024e+09 > was collected before with the same name and label values
* Connection #0 to host 10.0.0.177 left intact
Note: job, node and dc has been sanitised with dummy values.
Context
Here's a similar scenario to https://github.com/pcarranza/nomad-exporter/issues/18 where nomad-exporter scrappes throw 500 errors.
The reason is that exporter is trying to collect metrics from two allocations with the same labels:
alloc = my-heavy-job.my-heavy-job[0]
datacenter = my-dc
group = my-heavy-job
job = my-heavy-job
job_version = 941
node = nomad-node4000-71fj
region = global
I'm not sure about the root cause of this. nomad-exporter checks if the allocation is running, therefore Nomad shouldn't run two allocations with the same name
of the same job_version
in the same node
...
Any idea or explanation is welcome @pcarranza @estromboliano
Proposal
Tolerate these kind of collisions with a controlled failure instead of total downtime. Maybe skip when exporter tries to collect an already registered allocation, which can be check with its labels values.
Take into account that including some kind of alloc_id in the labels will provoke cardinality explosion so I would not recommend that way.