Skip to content

Turn MonitoringBus._streaming_worker into a subprocess

Adam Coldrick requested to merge sotk/monitoring/remove-metrics-coroutine into master

Before raising this MR, consider whether the following are required, and complete if so:

  • Unit tests
  • Metrics
  • Documentation update(s)

If not required, please explain in brief why not.

Description

This method is currently a coroutine which consumes from a Janus queue containing all the metrics that we send to the monitoring bus. The coroutine is run in the main event loop, and therefore competes for processor time with the other coroutines in BuildGrid.

These other coroutines are the log writer, the state metrics generator (in BuildGrids with at least one Scheduler configured), and the coroutines inside the Janus queues used for metrics and logging.

When BuildGrid is producing metrics rapidly, this contention becomes increasingly noticeable, and logs start to exhibit latency.

When BuildGrid produces metrics fast enough, __streaming_worker and the coroutines inside the related Janus queue effectively monopolise the main event loop, leading to a situation where no logs are emitted by BuildGrid.

In this situation, the size of the log message queue grows unbounded. Assuming the load producing the metrics doesn't go away fast, the queue used by __streaming_worker will also grow unbounded. Under high load, this leads to rapid growth in the memory footprint of BuildGrid, and is unrecoverable until the load stops.

The throughput of the Janus queue is also not great, exacerbating this issue by slowing down the rate at which BuildGrid can pull metrics out of the queue.

This MR updates __streaming_worker to be run in a subprocess, rather than as a coroutine. This avoids the event loop contention and fixes the problem of log messages getting backed up and never written when experiencing high load.

A previous version of this MR used a thread rather than a subprocess. That version solved the logging issue, and somewhat improved metric publishing performance, but still didn't solve the issue of metrics getting backlogged at reasonable levels of load.

Switching to a subprocess greatly improves the throughput as it avoids the GIL contention which led to the poor performance in earlier versions, where the metrics thread was competing with all the gRPC handlers for CPU time. This meant that when the metrics writer needed to be faster, it actually had less and less opportunity to run.

There are also some minor fixes to some compiled protos in this MR, to allow them to be pickled and unpickled correctly.

Validation

Run BuildGrid using one of the docker-compose setups, expose it to high load and watch the logging slow down and eventually stop. Then apply this MR and watch the logging not slow down.

Edited by Adam Coldrick

Merge request reports