states: Use RESYNC to handle more structured reply server bugs
In software, it's always better to be strict in what you produce and
lenient in what you accept. Continuing on from the previous patch,
there are quite a few situations where we are expecting a particular
structured reply, but can still gracefully recover and keep the
connection alive if the server sends us something unexpected (either a
wrong reply type, or a wrong length to a correct reply type). There
are only a few situations left where we really do want a structured
reply to move to DEAD: read() errors on the socket itself, if the
server's payload is so large as to be a denial of service, or if
malloc() fails.
While touching the code, note that once we check that cmd->type is
NBD_CMD_BLOCK_STATUS, we do not also need to check whether
cmd->cb.fn.extent is non-null.
The handling for NBD_REPLY_TYPE_ERROR_* is interesting: since the
error value comes first, even if we don't get a full error packet over
the wire, we can prefer the server's errno over EPROTO. But that
means error messages should not use RESYNC except when we don't get a
full error value. This patch also changes the code to use EPROTO
instead of EINVAL if the server mistakenly sends NBD_SUCCESS as its
error code.
Again, compliant servers will never trip over to the new state; but an
easy way to demonstrate this in action is with a temporary patch to
nbdkit:
| diff --git i/common/protocol/nbd-protocol.h w/common/protocol/nbd-protocol.h
| index e5d6404b..1e05d825 100644
| --- i/common/protocol/nbd-protocol.h
| +++ w/common/protocol/nbd-protocol.h
| @@ -242,7 +242,7 @@ struct nbd_structured_reply_error {
|
| /* Structured reply types. */
| #define NBD_REPLY_TYPE_NONE 0
| -#define NBD_REPLY_TYPE_OFFSET_DATA 1
| +#define NBD_REPLY_TYPE_OFFSET_DATA 11
| #define NBD_REPLY_TYPE_OFFSET_HOLE 2
| #define NBD_REPLY_TYPE_BLOCK_STATUS 5
| #define NBD_REPLY_TYPE_ERROR NBD_REPLY_TYPE_ERR (1)
With the following command line:
$ ./run /path/to/nbdkit -U - memory 1M --run 'nbdsh -u "$uri" -c "\
try:
h.pread(1, 0)
except nbd.Error as ex:
print(ex.string)
h.flush()
print(h.get_size())
"'
pre-patch results show the client hanging up abruptly on the server:
nbd_pread: unknown structured reply type (11)
nbdsh: command line script failed: nbd_flush: invalid state: DEAD: the handle must be connected with the server: Invalid argument
nbdkit: memory.1: error: read request: Connection reset by peer
while post-patch detects the server error, but allows the next command:
nbd_pread: read: command failed: Protocol error
1048576
Reading the LIBNBD_DEBUG=1 log further shows:
libnbd: debug: nbdsh: nbd_pread: unexpected reply type 11 or payload length 9 for cookie 1 and command 0, this is probably a server bug
Message-Id: <20220812020638.856332-5-eblake@redhat.com>
Acked-by:
Richard W.M. Jones <rjones@redhat.com>
Please register or sign in to comment