Skip to content

Add last update at gauge

Andreas Brandl requested to merge ab/batched-migration-metrics into master

What does this MR do?

We report gauge metrics for batched migrations from the sidekiq jobs. These jobs execute on different hosts over time and each of those hosts keeps reporting latest gauge value it saw for a while.

This leads to confusing situations where there are different values being reported for the same gauge (from different hosts):

Example: https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=batched_migration_job_batch_size%7Benv%3D%22gprd%22%7D&g0.tab=1

This change adds a unix timestamp gauge with the same set of labels as the other gauges. We intend to use this to reason about whatever the "latest" value is we should be looking at. This looks roughly like so (and might become a recording rule):

batched_migration_job_batch_size{}
* on(job, instance, migration_identifier) group_left()
group by (job, instance, migration_identifier) (topk by (migration_identifier) (1, batched_migration_job_last_update_time_seconds{env="gprd", migration_identifier="CopyColumnUsingBackgroundMigrationJob/ci_builds.id"}))

This is per suggestion from @andrewn how to deal with this situation. An alternative we discussed was moving those metrics over to gitlab-exporter which can easily report the same gauges with a simple database query. We might end up doing that, but I would like to explore the option at hand first.

Does this MR meet the acceptance criteria?

Conformity

Edited by Andreas Brandl

Merge request reports