Skip to content

Gather grid's internal state metrics

Context

An initial set of monitoring information has been discussed on the mailing list. The internal state metrics identified are mostly counters and gauges to be collected periodically and transmitted to external monitoring system(s). BuildGrid's internals will need some work to produce these.

Task Description

The initial set of metrics should be:

  • From the REAPI execution service:
    • Total number of clients connected (watching an Operation stream).
    • Number of clients connected by instance (watching an Operation stream).
  • From the RWAPI bots interface:
    • Total number of bots connected (with an active BotSession).
    • Number of connected bots by instance (with an active BotSession).
    • Number of connected bots by BotStatus (with an active BotSession).
  • From the internal scheduler:
    • Total number of jobs currently active.
    • Total number of operations currently active.
    • Number of operations by OperationStage currently active.
    • Total number of leases currently emitted.
    • Number of leases by LeaseState currently emitted.
    • Number of job retries, as a delta.
    • Total number of job retries per error type (to be defined).
    • Overall average queue time for jobs.
    • Average queue time for jobs by priorities (to be defined).

Acceptance Criteria

Internal-state metrics are gathered periodically and logged to standard output.

Edited by Santiago Gil