Page MenuHomePhabricator

sftp_write waits infinitely when connection to target server is lost
Open, NormalPublic

Description

In blocking mode sftp//write will wait (poll) infinitely, when connection to target server is lost.

To reproduce you need two computers:

  • Transfer a large file from computer A to computer B using sftp//write.
  • During the transfer unplug the network cable from computer B. -> computer A will infinitely poll
  • After 20 minutes or so plug network cable of computer B -> computer A will still poll infinitely!

The infinite poll is cause by some codelines:

int ssh_channel_read(ssh_channel channel, void **dest, uint32_t count, int is_stderr)
{
    return ssh_channel_read_timeout(channel, dest, count, is_stderr, -1);
}
if (timeout < 0) {
      timeout = SSH_TIMEOUT_DEFAULT;
}

For more information and a patch please see my mailing list post:
http://www.libssh.org/archive/libssh/2016-02/0000006.html

Event Timeline

migration created this object with visibility "Restricted Project (Project)".
migration created this object with edit policy "Restricted Project (Project)".
asn updated the task description. (Show Details)Aug 17 2017, 3:10 PM
asn updated the task description. (Show Details)
asn changed the visibility from "Restricted Project (Project)" to "Public (No Login Required)".

This is still occurring as of release 0.8.4. This issue is related: https://bugs.libssh.org/T33

The attached patches fix this issue. There were two problems that needed to be addressed.

The first was that there doesn't seem to be any place that SSH_OPTIONS_TIMEOUT is used as a timeout, except for the initial SSH connection. Because of this, the SFTP object was calling ssh_channel_read() which, by default, uses an infinite timeout.

The second issue was that the ssh_poll() functions would return 0 if there was a disconnect. A return of 0 was registered as SSH_AGAIN, forcing the method to be called endlessly.

Although these patches fix the disconnection bug, I believe there might be a larger issue. It does not look like SSH_OPTIONS_TIMEOUT is used at all, besides for the initial SSH connection. It isn't clear to me if this is by design or not. From reading the documentation, my understanding was that once you set this option, then that will be used by every function that has an optional timeout parameter. Secondly, there is no current way to provide the SFTP connection object with some sort of timeout variable to use. In the case of my solution, I used SSH_OPTIONS_TIMEOUT. However, it would make more sense if there was an equivalent SFTP_OPTIONS_TIMEOUT setting that was used for every function call that has an optional timeout parameter.


asn added a comment.Tue, Nov 20, 8:13 AM

Hi Sarah,

thank you very much for your contribution.

In the second patch you have the following if-clause:

if (SSH_OPTIONS_TIMEOUT > 0) {
    ...
}

As SSH_OPTIONS_TIMEOUT is an enum constant this will always evaluate to true. I think what you want to check is:

sftp->session->opts.timeout

However maybe we should handle that in ssh_channel_read() already correctly and always check if a timeout has been set in the options.

Thanks Andreas. I have attached an updated patch for the sftp.c file using sftp->session->opts.timeout instead of SSH_OPTIONS_TIMEOUT:

Whether or not ssh_channel_read() should check the timeout is a good question. Since there is already also a ssh_channel_read_timeout() function, my thinking is that ssh_channel_read() should always use an infinite timeout.

Thanks for spending time on this Andreas.

The above commit alone does not fix the issue because ssh_handle_packets_termination() is still being called with an infinite timeout from the ssh_channel_read() function. I would still need to use the two patches I provided in conjunction with it.

asn added a comment.Wed, Nov 21, 9:05 AM

Could you please resend the first patch with the requested modifications? I will look into the rest then.

asn added a comment.Wed, Nov 21, 10:25 AM

Sarah,

I've looked into the issue again. Returning SSH_ERROR in ssh_poll() is incorrect.

If you look at the manpage from poll, if a fd has been disconnected POLLHUP is set on the revents. We correctly check for this in the bsd_poll() impelmentation. However POLLHUP is not checked in ssh_poll_ctx_dopoll().

Could you please take a look at:

https://git.libssh.org/users/asn/libssh.git/log/?h=master-fix

and test it?

Hi Andreas,

I pulled master-fix and tested it with and without the 0002-channel-read-timeout-fix.patch and am still seeing an issue.

It is getting hung on line 202 of poll.c:
rc = select (max_fd + 1, &readfds, &writefds, &exceptfds ptv);

Select is returning 0 and ssh_poll_ctx_dopoll() is not detecting a POLLHUP revent so it returns SSH_AGAIN.

asn added a comment.Wed, Nov 21, 5:17 PM

I've pushed two more patches with another idea. Maybe add some logging after getsockopt()

Thanks again for working with me on this issue.

I pulled again and tried building but initially ran into a few errors:

  • poll.c needs to include ws2tcpip.h to access socklen_t
  • The bsd_socket_disconnected function declaration (line 128) is missing missing an 'e'
  • line 258 needed a comma

After fixing these issues, I was able to successfully build. However, when I ran it and disconnected the device, I received a seg fault coming from bsd_socket_disconnect. The errno parameter is invalid.

asn added a comment.Thu, Nov 22, 8:40 AM

Ah, sorry. You can't use the name errno for a variable. I've fixed it and pushed it to the repo.

I pulled the latest master-poll and am no longer seeing a seg fault. However, I'm seeing the same infinite poll issue as before. It happens maybe 75% of the time.

asn added a comment.Tue, Nov 27, 8:28 AM

Now we need to find out what doesn't produce a POLLHUP ...

asn added a comment.Tue, Nov 27, 7:56 PM

Can you try if it works if you handle: WSAECONNREFUSED and WSAETIMEDOUT in bsd_socket_disconnected() or add a breakpoint there and check what the error is that to and what we are missing to handle there?

It looks like bsd_socket_disconnected() isn't even getting called.

https://git.libssh.org/users/asn/libssh.git/tree/src/poll.c?h=master-poll

Lines 235 and 254 are returning false so bsd_socket_disconnected() never gets called.

asn added a comment.Wed, Nov 28, 8:53 AM

I pushed more changes

Hi Andreas,

I pulled the branch and tested again but there is still an infinite loop issue.

Line 677-678- of poll.c:

if ((ctx->pollfds[i].revents & POLLHUP) || (ctx->pollfds[i].revents & POLLERR))

This is returning false. ctx->pollfds[i].revents is 0.

asn added a comment.Thu, Nov 29, 8:35 AM

I've pushed an update, however you need to check what WSAGetLastError() is reporting after the recv()!

I think we are getting closer to fixing this issue. The timeout is correctly being propagated to the winsock select() function. WSAGetLastError() is returning WSAECONNRESET.

However, inside of the loop, line 282 of poll.c, the FD_ISSET() calls are sometimes returning false. Therefore, bsd_socket_compute_revents() sometimes doesn't get called, resulting in the infinite loop again.

asn added a comment.Fri, Nov 30, 8:33 AM

Do you know what the reason is why FD_ISSET() is called, maybe add WSAGetLastError() in that case. Does ssh_poll() return 0 in that case?

It's not clear to me why FD_ISSET() is needed. WSAGetLastError() does correctly return WSAECONNRESET.

https://docs.microsoft.com/en-us/windows/desktop/winsock/error-codes-errno-h-errno-and-wsagetlasterror-2

Check out the couple of code fragments near the bottom. In the loop, instead of using FD_ISSET, can we just call ret = recv(fd, data, 64, MSG_PEEK); Then if the ret is == -1, WSAGetLastError()?

asn added a comment.Wed, Dec 5, 3:03 PM

Well FD_ISSET() checks after select() if the fd is part of the set or not. If not we do not need to inspect it further. If we always call recv() then it doesn't make sense to call select() or poll() at all.

You say that one issue is fixed now and another still exists. What is the remaining issue exactly?

The remaining issue is that the only time bsd_socket_compute_revents is called is when FD_ISSET(fds[i].fd, &readfds) returns true. However, with some instances of the application, FD_ISSET(fds[i].fd, &readfds) will consistently return false, resulting in the infinite loop since bsd_socket_compute_revents doesn't get called.

asn added a comment.Mon, Dec 10, 10:17 AM

I've pushed some changes to the master-poll branch. This should correctly return errors.

cortana63 added a comment.EditedTue, Dec 11, 9:25 PM

Hi Andreas, the issue still persists. Looking at this commit: https://git.libssh.org/users/asn/libssh.git/commit/?h=master-poll&id=917ced39e0c8aaae4d2c22bcdbcb767f6355e62d

The if (rc == 0) check is resulting in true. However, I'm not sure if it should return a 0. A return of 0 is handled as an SSH_AGAIN on line 685 of poll.c, causing the endless loop. Can we instead do something like:

// A timeout occurred and a timeout was provided, return an error
if (rc == 0 && timeout > 0)
{
    return -1;
}
// A timeout occurred but no timeout was provided so try again
if (rc == 0 && timeout < 0)
{
    return  0;
}
asn added a comment.Thu, Dec 13, 8:56 AM

I'm sorry, but returning an error if poll() times out is simply wrong.

With poll() or select() you check if there is data on the fd to read or if all data has been written. If it times out it means we did not receive any data from the server. It would be mean we abort the connection if the connection is on idle!

If we did not detect and error we might not asked to check for data on that fd which got disconnected! However it might be that we do not check for an error in any of the other FD_SET() cases and need to do so.