Gather grid's internal state metrics
Context
An initial set of monitoring information has been discussed on the mailing list. The internal state metrics identified are mostly counters and gauges to be collected periodically and transmitted to external monitoring system(s). BuildGrid's internals will need some work to produce these.
Task Description
The initial set of metrics should be:
- From the REAPI execution service:
-
Total number of clients connected (watching an Operation
stream). -
Number of clients connected by instance (watching an Operation
stream).
-
- From the RWAPI bots interface:
-
Total number of bots connected (with an active BotSession
). -
Number of connected bots by instance (with an active BotSession
). -
Number of connected bots by BotStatus
(with an activeBotSession
).
-
- From the internal scheduler:
-
Total number of jobs currently active. -
Total number of operations currently active. -
Number of operations by OperationStage
currently active. -
Total number of leases currently emitted. -
Number of leases by LeaseState
currently emitted. -
Number of job retries, as a delta. -
Total number of job retries per error type (to be defined). -
Overall average queue time for jobs. -
Average queue time for jobs by priorities (to be defined).
-
Acceptance Criteria
Internal-state metrics are gathered periodically and logged to standard output.
Edited by Santiago Gil