Skip to content

worker fails to connect to rabbitmq - OSError: [Errno 98] Address already in use

Today we had an outage of production Artemis:

https://status.testing-farm.io/issues/2021-12-21-redhat-ranch-artemis-outage/

Sentry:

https://sentry.engineering.redhat.com/baseos/artemis-production/issues/1401472/?query=is%3Aunresolved

Where I realized it is the same problem I am hitting on the more busy public ranch deployment, so time to finally report it.

When the problem hits, worker fails to connect to rabbitmq (I think) with the error:

File "/tmp/.cache/pypoetry/virtualenvs/tft-artemis-XM7e6MJt-py3.7/lib/python3.7/site-packages/pika/compat.py", line 242, in _nonblocking_socketpair
    lsock.bind((host, 0))
OSError: [Errno 98] Address already in use

And no more tasks are dispatched. After restarting the worker deployment, things nicely resume.

Metrics when it happened:

http://metrics.osci.redhat.com/d/Y-g67NwGz/artemis?viewPanel=6&orgId=1&from=1640093097184&to=1640102265315

I copied the full log from the worker, it can be found here (note it is fairly large): TBD

I am attaching last 10k lines to this request for reference. log-10k.gz