Switcher returns "Address already in use" error randomly
Summary
When starting Switcher again within 1 minute of exiting Switcher, following error occurs randomly.
Error 28 fault detected [no subcode]
"Address already in use"
Detail: bind failed in soap_bind()
This bug appears to be related to gsoap server sometime not releasing TCP socket.
I encounter this bug from time to time on sip-server pipeline. SIP tests are run two times sequentially using Switcher 2.1.38 and 3.0.1 docker images. Sometimes the second SIP test fails to bind to TCP socket.
I also encountered this bug trying to script a SIP load test that creates many calls one after the other on sip-server.
This bug also prevent me from doing a SIP load test on sip-server. I script a SIP call generator to generate calls on after the other. Call generator script randomly get stuck on switcher-ctrl waiting after giving up this error :
SOAP 1.1 fault SOAP-ENV:Server[no subcode]
"Cannot assign requested address"
Detail: connect failed in tcp_connect()
I saw this happen randomly after 1 to 61 iterations.
How to reproduce ?
I can reproduce issue with following script on Switcher 2.1.38, 3.0.1 and develop@afa67e4e.
Script starts Switcher, does 15 queries using switcher-ctrl, then stops Switcher. It loops until switcher-ctrl returns connect failed error then quits.
#!/bin/bash
i=0
while true
do
((i++))
echo "***** Test number $i"
switcher &
PID=$!
echo "Started switcher on PID $PID"
sleep 1
echo "Querying using switcher-ctrl -e 15 times"
if [ -n "$(switcher-ctrl -e 2>&1 | grep "connect failed")" ]
then
echo "* Failed to connect at test number $i"
echo "Verifying socket timers at bind fail error"
netstat -nopa | grep 27182
exit 1
fi
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
switcher-ctrl -e
echo "Sending interrupt signal to PID $PID"
kill -INT $PID
sleep 2
echo "Verifying socket timers after successful test"
netstat -nopa | grep 27182
done
It randomly fails, sometimes after 2 iterations, sometimes after more than 60 iterations.
When failing, netstat always shows that TCP socket for port 127.0.0.1:27182 is in TIME_WAIT state with a 60 seconds timewait timer.
***** Test number 32
Started switcher on PID 40285
Error 28 fault detected [no subcode]
"Address already in use"
Detail: bind failed in soap_bind()
Querying using switcher-ctrl -e 15 times
* Failed to connect at test number 32
Verifying socket timers at bind fail error
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp 0 0 127.0.0.1:27182 127.0.0.1:52840 TIME_WAIT - timewait (56.92/0/0)
tcp 0 0 127.0.0.1:52696 127.0.0.1:27182 TIME_WAIT - timewait (41.44/0/0)
tcp 0 0 127.0.0.1:52552 127.0.0.1:27182 TIME_WAIT - timewait (25.96/0/0)
tcp 0 0 127.0.0.1:52840 127.0.0.1:27182 TIME_WAIT - timewait (56.92/0/0)
Waiting until TIME_WAIT expires after 60 seconds allows to start again normally.
Expected behavior
I expect to be able to start and stop Switcher as many time as needed without waiting 60 seconds for soap server TCP socket to reset.
What is the frequency of occurrence of this behavior ?
It occurs randomly, sometime right after launching Switcher for the first time, sometime only after many cycles (60+).
Other comment
I tested modifying switcher-ctrl client to close its soap socket before exiting, but it does not help with server's socket getting stuck in TIME_WAIT.
resetting KEEPALIVE and closing soap socket in src/switcher-ctrl.cpp before line 454:
soap_clr_omode(&switcher_control, SOAP_IO_KEEPALIVE);
switcher_control.soap_close_socket();
This appears to be related to SO_LINGER flag in gsoap. Documentation calls to reduce linger_time to 0 as a remedy.
Please be aware that a closed socket is not immediately reused when a client closes the socket after the server closes the socket, which can lead to sockets in TIME_WAIT state, see tuning systems for high connection rates. Remedies may include setting SO_LINGER linger time:
soap->accept_flags |= SO_LINGER; soap->linger_time = 0;
However, setting SO_LINGER timeout to 0 as a workaround appears to be controversial. Some says one should design its client to initiate the connection close.