Nomad-exporter fails due to allocation metric collision

Info

nomad-exporter version: 0.3.0
Nomad version: 0.8.7
Cluster size: 35 Nomad nodes
Node resources:
- 147200 MHz of CPU
- 63 GiB of Memory
Job spec: my-heavy-job
- 25 replicas
- 4000 MHz CPU
- 6.14 GiB Memory

Full http response

# curl -v 10.0.0.177:27601/metrics
* Hostname was NOT found in DNS cache
*   Trying 10.0.0.177...
* Connected to 10.0.0.177 (10.0.0.177) port 27601 (#0)
> GET /metrics HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 10.0.0.177:27601
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 10 Mar 2020 15:48:12 GMT
< Transfer-Encoding: chunked
<
An error has occurred during metrics collection:
11 error(s) occurred:
* collected metric nomad_allocation_cpu_percent label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:78.77808730129948 >  was collected before with the same name and label values
* collected metric nomad_allocation_cpu_throttle_time label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:0 >  was collected before with the same name and label values
* collected metric nomad_allocation_memory_rss_bytes label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:4.109185024e+09 >  was collected before with the same name and label values
* collected metric nomad_allocation_cpu_ticks label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:1811.8960079298881 >  was collected before with the same name and label values
* collected metric nomad_allocation_cpu_user_mode label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:5263.838494219879 >  was collected before with the same name and label values
* collected metric nomad_allocation_cpu_system_mode label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:1021.3417973859466 >  was collected before with the same name and label values
* collected metric nomad_allocation_memory_rss_required_bytes label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:6.442450944e+09 >  was collected before with the same name and label values
* collected metric nomad_allocation_cpu_required label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > gauge:<value:4000 >  was collected before with the same name and label values
* collected metric nomad_task_cpu_percent label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > label:<name:"task" value:"my-heavy-job" > gauge:<value:78.77808730129948 >  was collected before with the same name and label values
* collected metric nomad_task_cpu_total_ticks label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > label:<name:"task" value:"my-heavy-job" > gauge:<value:1811.8960079298881 >  was collected before with the same name and label values
* collected metric nomad_task_memory_rss_bytes label:<name:"alloc" value:"my-heavy-job.my-heavy-job[0]" > label:<name:"datacenter" value:"my-dc" > label:<name:"group" value:"my-heavy-job" > label:<name:"job" value:"my-heavy-job" > label:<name:"job_version" value:"941" > label:<name:"node" value:"nomad-node4000-71fj" > label:<name:"region" value:"global" > label:<name:"task" value:"my-heavy-job" > gauge:<value:4.109185024e+09 >  was collected before with the same name and label values
* Connection #0 to host 10.0.0.177 left intact

Note: job, node and dc has been sanitised with dummy values.

Context

Here's a similar scenario to https://github.com/pcarranza/nomad-exporter/issues/18 where nomad-exporter scrappes throw 500 errors.

The reason is that exporter is trying to collect metrics from two allocations with the same labels:

alloc = my-heavy-job.my-heavy-job[0]
datacenter = my-dc
group = my-heavy-job
job = my-heavy-job
job_version = 941
node = nomad-node4000-71fj
region = global

I'm not sure about the root cause of this. nomad-exporter checks if the allocation is running, therefore Nomad shouldn't run two allocations with the same name of the same job_version in the same node... 🤔

Any idea or explanation is welcome @pcarranza @estromboliano

Proposal

Tolerate these kind of collisions with a controlled failure instead of total downtime. Maybe skip when exporter tries to collect an already registered allocation, which can be check with its labels values.

Take into account that including some kind of alloc_id in the labels will provoke cardinality explosion so I would not recommend that way.