Page MenuHomePhabricator

Running the sftp_read test for the second time fails
Open, Needs TriagePublic


The sftp_read torture test fails when ran for the second time. After modifying the test this way:

diff --git a/tests/client/torture_sftp_read.c b/tests/client/torture_sftp_read.c
index a9df6abb..8246dbbf 100644
--- a/tests/client/torture_sftp_read.c
+++ b/tests/client/torture_sftp_read.c
@@ -98,6 +98,9 @@ static void torture_sftp_read_blocking(void **state) {
 int torture_run_tests(void) {
     int rc;
     struct CMUnitTest tests[] = {
+        cmocka_unit_test_setup_teardown(torture_sftp_read_blocking,
+                                        session_setup,
+                                        session_teardown),

The second run fails with something like

[2018/11/16 16:33:18.391086, 3] packet_send2:  packet: wrote [len=48,padding_size=11,comp=36,payload=36]
[2018/11/16 16:33:18.391091, 4] ssh_socket_pollcallback:  Poll callback on socket 5 (POLLOUT ), out buffer 0
[2018/11/16 16:33:18.391097, 4] ssh_socket_pollcallback:  sending control flow event
[2018/11/16 16:33:18.391103, 4] ssh_packet_socket_controlflow_callback:  sending channel_write_wontblock callback
[2018/11/16 16:33:18.567133, 4] ssh_socket_pollcallback:  Poll callback on socket 5 (POLLIN ), out buffer 0
[2018/11/16 16:33:18.567169, 1] ssh_socket_exception_callback:  Socket exception callback: 1 (0)
[2018/11/16 16:33:18.567176, 1] ssh_socket_exception_callback:  Socket error: disconnected
Could not run test: s->ssh.session
[   LINE   ] --- /home/jjelen/devel/libssh/tests/client/torture_sftp_read.c:43: error: Failure!Test setup failed
[  ERROR   ] torture_sftp_read_blocking

I am able to reproduce this with revisions down to 094aa5e (but the problem existed probably even earlier). I can not use the same trick to reproduce the bug with the torture_auth test case. But the sftp_fsync works the same way. This could be somehow related to opening channels?

The log looks like the socket for the second invocation is somehow left in nonblocking state and some of the read returns nothing, which is interpreted as a EOF/disconnect, if I read the log correctly.

Event Timeline

After more hours of digging in the code, I isolated this to the two SFTP calls that cause this issue to be reproducible reliably. These are calls to ssh_channel_send_eof() and ssh_channel_free() from sftp_free() called from the cleanup of the first sftp. If I remove these two calls, the testsuite passes reproducibly.

I am still not sure how the cleanup of the first test can affect the invocation of the second test or where the state could be preserved. Any insight would be welcomed.

After getting in a bit deeper, it really looks like the sshd is really closing the socket while doing something with PAM. The last logs from sshd are these in most of the cases:

debug2: input_userauth_request: try method none [preauth]
debug3: user_specific_delay: user specific delay 0.000ms [preauth]
debug3: ensure_minimum_time_since: elapsed 0.532ms, delaying 4.863ms (requested 5.395ms) [preauth]
debug3: mm_request_receive entering
debug3: monitor_read: checking request 100
debug1: PAM: initializing for "alice"

After that the socket is closed, test fails and the test kills the sshd.

Therefore this looks like a bug in OpenSSH (tried with both Fedora and vanila master as for today) or PAM wrapper. Disabling PAM in sshd_config solves the problems. Unfortunately, pam wrapper under openssh privilege separation does not yield useful information and I do not know how to debug this further.

From the server debug log, I observed, that sshd is closing the session and connection upon the closing of the sftp channel (second packet sent during cleanup), while if I do not send these packets, the disconnect is sent by the client. But I did not notice any missing calls to cleanup calls in the OpenSSH.

My proposal at this point would be to allow running OpenSSH without PAM support for most of the tests. The only tests in the current testsuite requiring the PAM support in SSHD are torture_auth_kbdint* tests and they can enable pam by some argument to the setup function.

This comment was removed by asn.