Skip to content

Fix encryption plugin to handle OpenSSL 3.0 (fixes various YDBTest replication test failures - TLSIOERROR)

Narayanan Iyer requested to merge nars1/YDBEncrypt:openssl3 into master

Background

  • Various replication tests when run with TLS enabled failed on a Ubuntu 22.04 system which had OpenSSL 3.0 installed.

  • An example subtest invocation that failed is -t v62000 -replic -st gtm8121 -env gtm_test_tls=TRUE.

  • The test failed with the following error in the receiver server log.

    %YDB-E-TLSIOERROR, Error during TLS/SSL recv operation
    %YDB-I-TEXT, error:0A000126:SSL routines::unexpected eof while reading
  • Interestingly though the test starts up the receiver server 3 times. It was always the 2nd and 3rd receiver server invocations that failed with the above error.

  • Coincidentally, it was only those 2nd and 3rd receiver server invocations that had the source server connect, replicate for a while and get killed on the source side.

  • Some more debugging showed that the error happened in the receiver server when it detects the source server side has disconnected (because the test crashed the source server using a kill -9).

Issue

  • OpenSSL 3.0 changed how an unexpected EOF error got signalled. Below is from https://www.openssl.org/docs/man3.0/man3/SSL_get_error.html.

    On an unexpected EOF, versions before OpenSSL 3.0 returned SSL_ERROR_SYSCALL, nothing was added to the error stack, and errno was 0. Since OpenSSL 3.0 the returned error is SSL_ERROR_SSL with a meaningful error on the error stack.

Fix

  • gtm_tls_impl.c (the reference encryption plugin implementation) is now fixed to handle this OpenSSL difference in behavior. If the major version is less than 3, we continue to do what we did before. If the major version is greater than or equal to 3, we check if the returned error code is SSL_ERROR_SYSCALL and if the actual error number is 0x0A000126. If so, we also signal a connection reset by setting the global variable tls_errno to ECONNRESET.

    As a side note, the error_code variable was previously being overloaded to first store the SSL_ERROR_SYSCALL code and later the 0x0A000126 code. Since we need to examine both in the same if check, a new variable error_code2 now stores the 0x0A000126 value (error detail).

Notes

  • It is not clear if future versions of OpenSSL might change the error code for a connection reset from 0x0A000126 to something else. If so, we might need a better way to detect this instead of a hardcoded check like is the case right now. Something to worry about in the future if the need arises.

Merge request reports