Make logging_worker and state_monitoring_worker resilient to exceptions (!589) · Merge requests · BuildGrid / buildgrid

Jeremiah Bonney requested to merge jbonney4/state-metric-thread into master Dec 23, 2020

At server startup, we create an asyncio task for logging/metric publishing as well as one for periodic metrics if configured. These tasks are expected to run continuously until the server exits, but if an exception is thrown they will silently exit. This is not great, as that means that log records will stop being written or the periodic metrics will stop being published with no easy way to tell why. I've personally seen this with the periodic metric task, which exited due to a database being unavailable for a few moments.

This MR wraps these tasks in try/catch Exception blocks, and logs any exceptions (outside of the expected asyncio.CancelledError) at a high severity to make it visible.

I tried writing some tests for this, but I had a trouble extending the server_instance.py tests to cover this so putting it up as is.

Make logging_worker and state_monitoring_worker resilient to exceptions

Merge request reports