Skip to content

IO alignment probing delivers incorrect results on Linux when used with e.g. dm-crypt

In file-posix.c:

/* Check if read is allowed with given memory buffer and length.
 *
 * This function is used to check O_DIRECT memory buffer and request alignment.
 */
static bool raw_is_io_aligned(int fd, void *buf, size_t len)
{
    ssize_t ret = pread(fd, buf, len, 0);

    if (ret >= 0) {
        return true;
    }

#ifdef __linux__
    /* The Linux kernel returns EINVAL for misaligned O_DIRECT reads.  Ignore
     * other errors (e.g. real I/O error), which could happen on a failed
     * drive, since we only care about probing alignment.
     */
    if (errno != EINVAL) {
        return true;
    }
#endif

    return false;
}

The comment claims that Linux always returns EINVAL for misaligned O_DIRECT reads. However, for block devices built on top of the Linux kernel's device-mapper infrastructure, this rule is demonstrably false. A trivial example showing its violation is dm-crypt:

In dm-crypt.c insufficient alignment causes DM_MAPIO_KILL:

	/*
	 * Ensure that bio is a multiple of internal sector encryption size
	 * and is aligned to this size as defined in IO hints.
	 */
	if (unlikely((bio->bi_iter.bi_sector & ((cc->sector_size >> SECTOR_SHIFT) - 1)) != 0))
		return DM_MAPIO_KILL;

	if (unlikely(bio->bi_iter.bi_size & (cc->sector_size - 1)))
		return DM_MAPIO_KILL;

Which is unconditionally translated into an IO error (i.e., EIO) in dm-rq.c:

	case DM_MAPIO_KILL:
		/* The target wants to complete the I/O */
		dm_kill_unmapped_request(rq, BLK_STS_IOERR);
		break;

For the probing of request_alignment (i.e. the alignment of an IO request's length), a blkdebug layer can be used to manually force a specific alignment. For the probing of buf_align however (i.e., the alignment of an IO memory buffer), no such workaround exists (aside from disabling direct IO entirely, of course).

It seems to me that the linux-specific code here should just be removed entirely, given that it appears to be merely a performance optimization when working with failed drives, while at the same time causing essentially unsolvable IO errors when dealing with device-mapper devices that enforce an alignment > 512.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information