      block: check partition alignment · 633395b6
      Partitions that are not aligned to the blocksize of a device may cause
      invalid I/O requests because the blocklayer cares only about alignment
      within the partition when building requests on partitions.
      partition offset 512byte
      When reading/writing one 4k block of the partition this maps to
      reading/writing with an offset of 512 byte of the device leading to
      unaligned requests for the device which in turn may cause unexpected
      behavior of the device driver.
      For DASD devices we have to translate the block number into a cylinder,
      head, record format. The unaligned requests lead to wrong calculation
      and therefore to misdirected I/O. In a "good" case this leads to I/O
      errors because the underlying hardware detects the wrong addressing.
      In a worst case scenario this might destroy data on the device.
      To prevent partitions that are not aligned to the physical blocksize
      of a device check for the alignment in the blkpg_ioctl.
      Signed-off-by: default avatarStefan Haberland <sth@linux.vnet.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      block: Update blkdev_dax_capable() for consistency · a8078b1f
      blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
      to remain as a separate interface for checking dax capability of
      a raw block device.
      Rename and relocate blkdev_dax_capable() to keep them maintained
      consistently, and call bdev_direct_access() for the dax capability
      There is no change in the behavior.
      Link: https://lkml.org/lkml/2016/5/9/950Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      This promise never materialized.  And unlikely will.
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      Let's stop pretending that pages in page cache are special.  They are
      The changes are pretty straight-forward:
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - page_cache_get() -> get_page();
       - page_cache_release() -> put_page();
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      virtual patch
      expression E;
      + E
      expression E;
      + E
      + PAGE_SHIFT
      + PAGE_SIZE
      + PAGE_MASK
      expression E;
      + PAGE_ALIGN(E)
      expression E;
      - page_cache_get(E)
      + get_page(E)
      expression E;
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      block: revert runtime dax control of the raw block device · 9f4736fe
      Dynamically enabling DAX requires that the page cache first be flushed
      and invalidated.  This must occur atomically with the change of DAX mode
      otherwise we confuse the fsync/msync tracking and violate data
      durability guarantees.  Eliminate the possibilty of DAX-disabled to
      DAX-enabled transitions for now and revisit this for the next cycle.
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      wrappers for ->i_mutex access · 5955102c
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      block: enable dax for raw block devices · 5a023cdb
      If an application wants exclusive access to all of the persistent memory
      provided by an NVDIMM namespace it can use this raw-block-dax facility
      to forgo establishing a filesystem.  This capability is targeted
      primarily to hypervisors wanting to provision persistent memory for
      guests.  It can be disabled / enabled dynamically via the new BLKDAXSET
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      block: add an API for Persistent Reservations · bbd3e064
      This commits adds a driver API and ioctls for controlling Persistent
      Reservations s/genericly/generically/ at the block layer.  Persistent
      Reservations are supported by SCSI and NVMe and allow controlling who gets
      access to a device in a shared storage setup.
      Note that we add a pr_ops structure to struct block_device_operations
      instead of adding the members directly to avoid bloating all instances
      of devices that will never support Persistent Reservations.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      block: cleanup blkdev_ioctl · d8e4bb81
      Split out helpers for all non-trivial ioctls to make this function simpler,
      and also start passing around a pointer version of the argument, as that's
      what most ioctl handlers actually need.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      block: replace trylock with mutex_lock in blkdev_reread_part() · b04a5636
      The only possible problem of using mutex_lock() instead of trylock
      is about deadlock.
      If there aren't any locks held before calling blkdev_reread_part(),
      deadlock can't be caused by this conversion.
      If there are locks held before calling blkdev_reread_part(),
      and if these locks arn't required in open, close handler and I/O
      path, deadlock shouldn't be caused too.
      Both user space's ioctl(BLKRRPART) and md_setup_drive() from
      init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
      for the two cases.
      For loop, the previous patches in this pathset has fixed the ABBA lock
      dependency, so the conversion is OK.
      For nbd, tx_lock is held when calling the function:
      	- both open and release won't hold the lock
      	- when blkdev_reread_part() is run, I/O thread has been stopped
      	already, so tx_lock won't be acquired in I/O path at that time.
      	- so the conversion won't cause deadlock for nbd
      For dasd, both dasd_open(), dasd_release() and request function don't
      acquire any mutex/semphone, so the conversion should be safe.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarJarod Wilson <jarod@redhat.com>
      Acked-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      block: export blkdev_reread_part() and __blkdev_reread_part() · be324177
      This patch exports blkdev_reread_part() for block drivers, also
      introduce __blkdev_reread_part().
      For some drivers, such as loop, reread of partitions can be run
      from the release path, and bd_mutex may already be held prior to
      calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
      __blkdev_reread_part for use in such cases.
      CC: Christoph Hellwig <hch@lst.de>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Tejun Heo <tj@kernel.org>
      CC: Alexander Viro <viro@zeniv.linux.org.uk>
      CC: Markus Pargmann <mpa@pengutronix.de>
      CC: Stefan Weinhuber <wein@de.ibm.com>
      CC: Stefan Haberland <stefan.haberland@de.ibm.com>
      CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
      CC: Fabian Frederick <fabf@skynet.be>
      CC: Ming Lei <ming.lei@canonical.com>
      CC: David Herrmann <dh.herrmann@gmail.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: nbd-general@lists.sourceforge.net
      CC: linux-s390@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      block: Add discard flag to blkdev_issue_zeroout() function · d93ba7a5
      blkdev_issue_discard() will zero a given block range. This is done by
      way of explicit writing, thus provisioning or allocating the blocks on
      There are use cases where the desired behavior is to zero the blocks but
      unprovision them if possible. The blocks must deterministically contain
      zeroes when they are subsequently read back.
      This patch adds a flag to blkdev_issue_zeroout() that provides this
      variant. If the discard flag is set and a block device guarantees
      discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
      the device does not support discard_zeroes_data or if the discard
      request fails we will fall back to first REQ_WRITE_SAME and then a
      regular REQ_WRITE.
      Also update the callers of blkdev_issue_zero() to reflect the new flag
      and make sb_issue_zeroout() prefer the discard approach.
      Signed-by: Martin K. Petersen's avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      block, bdi: an active gendisk always has a request_queue associated with it · ff9ea323
      bdev_get_queue() returns the request_queue associated with the
      specified block_device.  blk_get_backing_dev_info() makes use of
      bdev_get_queue() to determine the associated bdi given a block_device.
      All the callers of bdev_get_queue() including
      blk_get_backing_dev_info() assume that bdev_get_queue() may return
      NULL and implement NULL handling; however, bdev_get_queue() requires
      the passed in block_device is opened and attached to its gendisk.
      Because an active gendisk always has a valid request_queue associated
      with it, bdev_get_queue() can never return NULL and neither can
      Make it clear that neither of the two functions can return NULL and
      remove NULL handling from all the callers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: Vivek Goyal's avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      vfs: fix up ENOIOCTLCMD error handling · 07d106d0
      We're doing some odd things there, which already messes up various users
      (see the net/socket.c code that this removes), and it was going to add
      yet more crud to the block layer because of the incorrect error code
      ENOIOCTLCMD is not an error return that should be returned to user mode
      from the "ioctl()" system call, but it should *not* be translated as
      EINVAL ("Invalid argument").  It should be translated as ENOTTY
      ("Inappropriate ioctl for device").
      That EINVAL confusion has apparently so permeated some code that the
      block layer actually checks for it, which is sad.  We continue to do so
      for now, but add a big comment about how wrong that is, and we should
      remove it entirely eventually.  In the meantime, this tries to keep the
      changes localized to just the EINVAL -> ENOTTY fix, and removing code
      that makes it harder to do the right thing.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Tejun Heo authored
      There are cases where suppressing partition scan is useful - e.g. for
      lo devices and pseudo SATA devices which advertise to be a disk but
      get upset on partition scan (some port multiplier control devices show
      such behavior).
      This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
      regardless of the number of possible partitions.  disk_partitionable()
      is renamed to disk_part_scan_enabled() as suppressing partition scan
      doesn't imply the device can't be partitioned using
      BLKPG_ADD/DEL_PARTITION calls from userland.  show_partition() now
      directly tests disk_max_parts() to maintain backward-compatibility.
      -v2: Updated to make it clear that only partition scan is suppressed
           not partitioning itself as suggested by Kay Sievers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      * blkdev_get/put() are the primary open and close functions.
      * bd_claim/release() deal with exclusive open.
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      * bd_link/unlink_disk_holder() to create and remove holder/slave
      * open_by_devnum() wraps bdget() + blkdev_get().
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      * blkdev_put() is similarly extended.  It now takes @Mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      * open_by_devnum() is modified to take @holder argument and pass it to
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarNeil Brown <neilb@suse.de>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: Philipp Reisner's avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
