Skip to content

Draft: Make `StatsdMetricPublisher` more fault tolerant

Zehao Chen requested to merge zchen723/metric-fault-tolerance into main

Summary

This MR aims to make StatsdMetricPublisher more fault tolerant.

The issue is that aio-statsd, the dependency of this class, raises ConnectionError even if the statsd port is merely temporarily unavailable. It's internal future will be closed and the error cannot be recovered during the next retry.

Ultimately, we wouldn't like errors from metric publishing popped up to the application logic.

Changes

  • Retry metric_publisher.connect if it's closed due to previous errors.
  • Ignore ConnectionError from aio-statsd library. This is safe as long as the application logic doesn't raise this type itself.
  • Unfortunately, retrying connect means the interfaces have to be changed to async and not backward compatible, hence bumping up the minor version number.

Alternative solution

We can also replace aio-statsd with pystatsd which seems to be more often maintained. However, the interfaces of the latter are sync and we'd have to wrap calls with asyncio.to_thread which might be a performance concern.

Edited by Zehao Chen

Merge request reports