T47: sftp_write waits infinitely when connection to target server is lost
Description
Originally reported by migration: https://bugs.libssh.org/T47
In blocking mode sftp//write will wait (poll) infinitely, when connection to target server is lost.
To reproduce you need two computers:
- Transfer a large file from computer A to computer B using sftp//write.
- During the transfer unplug the network cable from computer B. -> computer A will infinitely poll
- After 20 minutes or so plug network cable of computer B -> computer A will still poll infinitely!
The infinite poll is cause by some codelines:
int ssh_channel_read(ssh_channel channel, void **dest, uint32_t count, int is_stderr)
{
return ssh_channel_read_timeout(channel, dest, count, is_stderr, -1);
}
if (timeout < 0) {
timeout = SSH_TIMEOUT_DEFAULT;
}
For more information and a patch please see my mailing list post: http://www.libssh.org/archive/libssh/2016-02/0000006.html
Comments:
cortana63 commented on 2018-11-19 19:56:29 UTC:
This is still occurring as of release 0.8.4. This issue is related: https://bugs.libssh.org/T33
The attached patches fix this issue. There were two problems that needed to be addressed.
The first was that there doesn't seem to be any place that SSH_OPTIONS_TIMEOUT is used as a timeout, except for the initial SSH connection. Because of this, the SFTP object was calling ssh_channel_read() which, by default, uses an infinite timeout.
The second issue was that the ssh_poll() functions would return 0 if there was a disconnect. A return of 0 was registered as SSH_AGAIN, forcing the method to be called endlessly.
Although these patches fix the disconnection bug, I believe there might be a larger issue. It does not look like SSH_OPTIONS_TIMEOUT is used at all, besides for the initial SSH connection. It isn't clear to me if this is by design or not. From reading the documentation, my understanding was that once you set this option, then that will be used by every function that has an optional timeout parameter. Secondly, there is no current way to provide the SFTP connection object with some sort of timeout variable to use. In the case of my solution, I used SSH_OPTIONS_TIMEOUT. However, it would make more sense if there was an equivalent SFTP_OPTIONS_TIMEOUT setting that was used for every function call that has an optional timeout parameter.
0002-channel-read-timeout-fix.patch 0002-channel-read-timeout-fix.patch
asn commented on 2018-11-20 07:13:02 UTC:
Hi Sarah,
thank you very much for your contribution.
In the second patch you have the following if-clause:
if (SSH_OPTIONS_TIMEOUT > 0) {
...
}
As SSH_OPTIONS_TIMEOUT is an enum constant this will always evaluate to true. I think what you want to check is:
sftp->session->opts.timeout
However maybe we should handle that in ssh_channel_read() already correctly and always check if a timeout has been set in the options.
cortana63 commented on 2018-11-20 13:43:29 UTC:
Thanks Andreas. I have attached an updated patch for the sftp.c file using sftp->session->opts.timeout instead of SSH_OPTIONS_TIMEOUT:
0002-channel-read-timeout-fix.patch
Whether or not ssh_channel_read() should check the timeout is a good question. Since there is already also a ssh_channel_read_timeout() function, my thinking is that ssh_channel_read() should always use an infinite timeout.
asn commented on 2018-11-20 14:50:23 UTC:
Sarah,
what do you think about:
cortana63 commented on 2018-11-20 16:02:06 UTC:
Thanks for spending time on this Andreas.
The above commit alone does not fix the issue because ssh_handle_packets_termination() is still being called with an infinite timeout from the ssh_channel_read() function. I would still need to use the two patches I provided in conjunction with it.
asn commented on 2018-11-21 08:05:59 UTC:
Could you please resend the first patch with the requested modifications? I will look into the rest then.
asn commented on 2018-11-21 09:25:58 UTC:
Sarah,
I've looked into the issue again. Returning SSH_ERROR in ssh_poll() is incorrect.
If you look at the manpage from poll, if a fd has been disconnected POLLHUP is set on the revents. We correctly check for this in the bsd_poll() impelmentation. However POLLHUP is not checked in ssh_poll_ctx_dopoll().
Could you please take a look at:
https://git.libssh.org/users/asn/libssh.git/log/?h=master-fix
and test it?
cortana63 commented on 2018-11-21 13:50:17 UTC:
Hi Andreas,
I pulled master-fix and tested it with and without the 0002-channel-read-timeout-fix.patch and am still seeing an issue.
It is getting hung on line 202 of poll.c: rc = select (max_fd + 1, &readfds, &writefds, &exceptfds ptv);
Select is returning 0 and ssh_poll_ctx_dopoll() is not detecting a POLLHUP revent so it returns SSH_AGAIN.
asn commented on 2018-11-21 16:17:04 UTC:
I've pushed two more patches with another idea. Maybe add some logging after getsockopt()
cortana63 commented on 2018-11-21 21:01:53 UTC:
Thanks again for working with me on this issue.
I pulled again and tried building but initially ran into a few errors:
- poll.c needs to include ws2tcpip.h to access socklen_t
- The bsd_socket_disconnected function declaration (line 128) is missing missing an 'e'
- line 258 needed a comma
After fixing these issues, I was able to successfully build. However, when I ran it and disconnected the device, I received a seg fault coming from bsd_socket_disconnect. The errno parameter is invalid.
asn commented on 2018-11-22 07:40:53 UTC:
Ah, sorry. You can't use the name errno for a variable. I've fixed it and pushed it to the repo.
asn commented on 2018-11-23 15:28:46 UTC:
Can you please test: https://git.libssh.org/users/asn/libssh.git/log/?h=master-poll
cortana63 commented on 2018-11-26 15:21:03 UTC:
I pulled the latest master-poll and am no longer seeing a seg fault. However, I'm seeing the same infinite poll issue as before. It happens maybe 75% of the time.
asn commented on 2018-11-27 07:28:41 UTC:
Now we need to find out what doesn't produce a POLLHUP ...
asn commented on 2018-11-27 18:56:51 UTC:
Can you try if it works if you handle: WSAECONNREFUSED and WSAETIMEDOUT in bsd_socket_disconnected() or add a breakpoint there and check what the error is that to and what we are missing to handle there?
cortana63 commented on 2018-11-27 19:36:36 UTC:
It looks like bsd_socket_disconnected() isn't even getting called.
https://git.libssh.org/users/asn/libssh.git/tree/src/poll.c?h=master-poll
Lines 235 and 254 are returning false so bsd_socket_disconnected() never gets called.
asn commented on 2018-11-28 07:53:18 UTC:
I pushed more changes
cortana63 commented on 2018-11-28 19:15:35 UTC:
Hi Andreas,
I pulled the branch and tested again but there is still an infinite loop issue.
Line 677-678- of poll.c:
if ((ctx->pollfds[i].revents & POLLHUP) || (ctx->pollfds[i].revents & POLLERR))
This is returning false. ctx->pollfds[i].revents is 0.
asn commented on 2018-11-29 07:35:04 UTC:
I've pushed an update, however you need to check what WSAGetLastError()
is reporting after the recv()
!
cortana63 commented on 2018-11-29 20:09:14 UTC:
I think we are getting closer to fixing this issue. The timeout is correctly being propagated to the winsock select() function. WSAGetLastError() is returning WSAECONNRESET.
However, inside of the loop, line 282 of poll.c, the FD_ISSET() calls are sometimes returning false. Therefore, bsd_socket_compute_revents() sometimes doesn't get called, resulting in the infinite loop again.
asn commented on 2018-11-30 07:33:27 UTC:
Do you know what the reason is why FD_ISSET() is called, maybe add WSAGetLastError() in that case. Does ssh_poll() return 0 in that case?
cortana63 commented on 2018-12-05 13:44:08 UTC:
It's not clear to me why FD_ISSET() is needed. WSAGetLastError() does correctly return WSAECONNRESET.
Check out the couple of code fragments near the bottom. In the loop, instead of using FD_ISSET, can we just call ret = recv(fd, data, 64, MSG_PEEK); Then if the ret is == -1, WSAGetLastError()?
asn commented on 2018-12-05 14:03:39 UTC:
Well FD_ISSET() checks after select() if the fd is part of the set or not. If not we do not need to inspect it further. If we always call recv() then it doesn't make sense to call select() or poll() at all.
You say that one issue is fixed now and another still exists. What is the remaining issue exactly?
cortana63 commented on 2018-12-05 14:06:31 UTC:
The remaining issue is that the only time bsd_socket_compute_revents is called is when FD_ISSET(fds[i].fd, &readfds) returns true. However, with some instances of the application, FD_ISSET(fds[i].fd, &readfds) will consistently return false, resulting in the infinite loop since bsd_socket_compute_revents doesn't get called.
asn commented on 2018-12-10 09:17:12 UTC:
I've pushed some changes to the master-poll branch. This should correctly return errors.
cortana63 commented on 2018-12-11 20:26:00 UTC (Edited):
Hi Andreas, the issue still persists. Looking at this commit: https://git.libssh.org/users/asn/libssh.git/commit/?h=master-poll&id=917ced39e0c8aaae4d2c22bcdbcb767f6355e62d
The if (rc == 0) check is resulting in true. However, I'm not sure if it should return a 0. A return of 0 is handled as an SSH_AGAIN on line 685 of poll.c, causing the endless loop. Can we instead do something like:
// A timeout occurred and a timeout was provided, return an error
if (rc == 0 && timeout > 0)
{
return -1;
}
// A timeout occurred but no timeout was provided so try again
if (rc == 0 && timeout < 0)
{
return 0;
}
asn commented on 2018-12-13 07:56:54 UTC:
I'm sorry, but returning an error if poll() times out is simply wrong.
With poll() or select() you check if there is data on the fd to read or if all data has been written. If it times out it means we did not receive any data from the server. It would be mean we abort the connection if the connection is on idle!
If we did not detect and error we might not asked to check for data on that fd which got disconnected! However it might be that we do not check for an error in any of the other FD_SET() cases and need to do so.
boguslaw commented on 2019-08-09 10:30:14 UTC:
Hi guys How the issue was ended? Is there working fix for unplug calble during file transfer? I will be grateful for any advice :)