worker fails to connect to rabbitmq - OSError: [Errno 98] Address already in use
Today we had an outage of production Artemis:
https://status.testing-farm.io/issues/2021-12-21-redhat-ranch-artemis-outage/
Sentry:
Where I realized it is the same problem I am hitting on the more busy public ranch deployment, so time to finally report it.
When the problem hits, worker fails to connect to rabbitmq (I think) with the error:
File "/tmp/.cache/pypoetry/virtualenvs/tft-artemis-XM7e6MJt-py3.7/lib/python3.7/site-packages/pika/compat.py", line 242, in _nonblocking_socketpair
lsock.bind((host, 0))
OSError: [Errno 98] Address already in use
And no more tasks are dispatched. After restarting the worker deployment, things nicely resume.
Metrics when it happened:
I copied the full log from the worker, it can be found here (note it is fairly large): TBD
I am attaching last 10k lines to this request for reference. log-10k.gz