1. 06 Apr, 2018 1 commit
  2. 30 Mar, 2018 1 commit
  3. 26 Feb, 2018 3 commits
    • Jan Kara's avatar
      blockdev: Avoid two active bdev inodes for one device · 560e7cb2
      Jan Kara authored
      When blkdev_open() races with device removal and creation it can happen
      that unhashed bdev inode gets associated with newly created gendisk
      CPU0					CPU1
        bdev = bd_acquire()
      					remove device
      					create new device with the same number
          disk = get_gendisk()
            - gets reference to gendisk of the new device
      Now another blkdev_open() will not find original 'bdev' as it got
      unhashed, create a new one and associate it with the same 'disk' at
      which point problems start as we have two independent page caches for
      one device.
      Fix the problem by verifying that the bdev inode didn't get unhashed
      before we acquired gendisk reference. That way we make sure gendisk can
      get associated only with visible bdev inodes.
      Tested-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Jan Kara's avatar
      genhd: Fix use after free in __blkdev_get() · 89736653
      Jan Kara authored
      When two blkdev_open() calls race with device removal and recreation,
      __blkdev_get() can use looked up gendisk after it is freed:
      CPU0				CPU1			CPU2
      blkdev_open()			blkdev_open()
        bdev = bd_acquire(inode);
          - creates and returns new inode
      				  bdev = bd_acquire(inode);
      				    - returns the same inode
        __blkdev_get(devt)		  __blkdev_get(devt)
          disk = get_gendisk(devt);
            - got structure of device going away
      							<finish device removal>
      							<new device gets
      							 created under the same
      							 device number>
      				  disk = get_gendisk(devt);
      				    - got new device structure
      				  if (!bdev->bd_openers) {
      				    does the first open
          if (!bdev->bd_openers)
            - false
          } else {
              - remember this was old device - this was last ref and disk is
                now freed
          disk_unblock_events(disk); -> oops
      Fix the problem by making sure we drop reference to disk in
      __blkdev_get() only after we are really done with it.
      Reported-by: default avatarHou Tao <houtao1@huawei.com>
      Tested-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Jan Kara's avatar
      genhd: Add helper put_disk_and_module() · 9df6c299
      Jan Kara authored
      Add a proper counterpart to get_disk_and_module() -
      put_disk_and_module(). Currently it is opencoded in several places.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  4. 11 Nov, 2017 1 commit
    • Bart Van Assche's avatar
      block, scsi: Make SCSI quiesce and resume work reliably · 3a0a5299
      Bart Van Assche authored
      The contexts from which a SCSI device can be quiesced or resumed are:
      * Writing into /sys/class/scsi_device/*/device/state.
      * SCSI parallel (SPI) domain validation.
      * The SCSI device power management methods. See also scsi_bus_pm_ops.
      It is essential during suspend and resume that neither the filesystem
      state nor the filesystem metadata in RAM changes. This is why while
      the hibernation image is being written or restored that SCSI devices
      are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
      and scsi_device_resume(). In the SDEV_QUIESCE state execution of
      non-preempt requests is deferred. This is realized by returning
      BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
      devices. Avoid that a full queue prevents power management requests
      to be submitted by deferring allocation of non-preempt requests for
      devices in the quiesced state. This patch has been tested by running
      the following commands and by verifying that after each resume the
      fio job was still running:
      for ((i=0; i<10; i++)); do
          cd /sys/block/md0/md &&
          while true; do
            [ "$(<sync_action)" = "idle" ] && echo check > sync_action
            sleep 1
        ) &
        for d in /sys/class/block/sd*[a-z]; do
          hcil=$(readlink "$d/device")
          echo 4 > "$d/queue/nr_requests"
          echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
          fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \
            --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16       \
            --iodepth_batch=1 --thread --loops=$((2**31)) &
        sleep 1
        echo "$(date) Hibernating ..." >>hibernate-test-log.txt
        systemctl hibernate
        sleep 10
        kill "${pids[@]}"
        echo idle > /sys/block/md0/md/sync_action
        echo "$(date) Done." >>hibernate-test-log.txt
      Reported-by: Oleksandr Natalenko's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Tested-by: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Tested-by: Oleksandr Natalenko's avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  5. 03 Nov, 2017 1 commit
  6. 13 Oct, 2017 1 commit
  7. 12 Oct, 2017 1 commit
  8. 04 Sep, 2017 1 commit
  9. 23 Aug, 2017 2 commits
  10. 06 Jul, 2017 2 commits
    • Jeff Layton's avatar
      block: convert to errseq_t based writeback error tracking · 372cf243
      Jeff Layton authored
      This is a very minimal conversion to errseq_t based error tracking
      for raw block device access. Just have it use the standard
      file_write_and_wait_range call.
      Note that there are internal callers that call sync_blockdev
      and the like that are not affected by this. They'll continue
      to use the AS_EIO/AS_ENOSPC flags for error reporting like
      they always have for now.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
    • Jeff Layton's avatar
      fs: new infrastructure for writeback error handling and reporting · 5660e13d
      Jeff Layton authored
      Most filesystems currently use mapping_set_error and
      filemap_check_errors for setting and reporting/clearing writeback errors
      at the mapping level. filemap_check_errors is indirectly called from
      most of the filemap_fdatawait_* functions and from
      filemap_write_and_wait*. These functions are called from all sorts of
      contexts to wait on writeback to finish -- e.g. mostly in fsync, but
      also in truncate calls, getattr, etc.
      The non-fsync callers are problematic. We should be reporting writeback
      errors during fsync, but many places spread over the tree clear out
      errors before they can be properly reported, or report errors at
      nonsensical times.
      If I get -EIO on a stat() call, there is no reason for me to assume that
      it is because some previous writeback failed. The fact that it also
      clears out the error such that a subsequent fsync returns 0 is a bug,
      and a nasty one since that's potentially silent data corruption.
      This patch adds a small bit of new infrastructure for setting and
      reporting errors during address_space writeback. While the above was my
      original impetus for adding this, I think it's also the case that
      current fsync semantics are just problematic for userland. Most
      applications that call fsync do so to ensure that the data they wrote
      has hit the backing store.
      In the case where there are multiple writers to the file at the same
      time, this is really hard to determine. The first one to call fsync will
      see any stored error, and the rest get back 0. The processes with open
      fds may not be associated with one another in any way. They could even
      be in different containers, so ensuring coordination between all fsync
      callers is not really an option.
      One way to remedy this would be to track what file descriptor was used
      to dirty the file, but that's rather cumbersome and would likely be
      slow. However, there is a simpler way to improve the semantics here
      without incurring too much overhead.
      This set adds an errseq_t to struct address_space, and a corresponding
      one is added to struct file. Writeback errors are recorded in the
      mapping's errseq_t, and the one in struct file is used as the "since"
      This changes the semantics of the Linux fsync implementation such that
      applications can now use it to determine whether there were any
      writeback errors since fsync(fd) was last called (or since the file was
      opened in the case of fsync having never been called).
      Note that those writeback errors may have occurred when writing data
      that was dirtied via an entirely different fd, but that's the case now
      with the current mapping_set_error/filemap_check_error infrastructure.
      This will at least prevent you from getting a false report of success.
      The new behavior is still consistent with the POSIX spec, and is more
      reliable for application developers. This patch just adds some basic
      infrastructure for doing this, and ensures that the f_wb_err "cursor"
      is properly set when a file is opened. Later patches will change the
      existing code to use this new infrastructure for reporting errors at
      fsync time.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
  11. 28 Jun, 2017 1 commit
    • Jens Axboe's avatar
      block: provide bio_uninit() free freeing integrity/task associations · 9ae3b3f5
      Jens Axboe authored
      Wen reports significant memory leaks with DIF and O_DIRECT:
      "With nvme devive + T10 enabled, On a system it has 256GB and started
      logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
      it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
      /proc/meminfo | grep SUnreclaim...
      SUnreclaim:      6752128 kB
      SUnreclaim:      6874880 kB
      SUnreclaim:      7238080 kB
      SUnreclaim:     22307264 kB
      SUnreclaim:     22485888 kB
      SUnreclaim:     22720256 kB
      When testcases with T10 enabled call into __blkdev_direct_IO_simple,
      code doesn't free memory allocated by bio_integrity_alloc. The patch
      fixes the issue. HTX has been run with +60 hours without failure."
      Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
      doesn't go through the regular bio free. This means that any ancillary
      data allocated with the bio through the stack is not freed. Hence, we
      can leak the integrity data associated with the bio, if the device is
      using DIF/DIX.
      Fix this by providing a bio_uninit() and export it, so that we can use
      it to free this data. Note that this is a minimal fix for this issue.
      Any current user of bio's that are allocated outside of
      bio_alloc_bioset() suffers from this issue, most notably some drivers.
      We will fix those in a more comprehensive patch for 4.13. This also
      means that the commit marked as being fixed by this isn't the real
      culprit, it's just the most obvious one out there.
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Reported-by: default avatarWen Xiong <wenxiong@linux.vnet.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  12. 27 Jun, 2017 1 commit
  13. 18 Jun, 2017 1 commit
  14. 09 Jun, 2017 2 commits
  15. 08 May, 2017 1 commit
    • Dan Williams's avatar
      block, dax: move "select DAX" from BLOCK to FS_DAX · ef510424
      Dan Williams authored
      For configurations that do not enable DAX filesystems or drivers, do not
      require the DAX core to be built.
      Given that the 'direct_access' method has been removed from
      'block_device_operations', we can also go ahead and remove the
      block-related dax helper functions from fs/block_dev.c to
      drivers/dax/super.c. This keeps dax details out of the block layer and
      lets the DAX core be built as a module in the FS_DAX=n case.
      Filesystems need to include dax.h to call bdev_dax_supported().
      Cc: linux-xfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  16. 03 May, 2017 1 commit
  17. 01 May, 2017 1 commit
  18. 25 Apr, 2017 2 commits
  19. 21 Apr, 2017 1 commit
    • Ilya Dryomov's avatar
      block: get rid of blk_integrity_revalidate() · 19b7ccf8
      Ilya Dryomov authored
      Commit 25520d55 ("block: Inline blk_integrity in struct gendisk")
      introduced blk_integrity_revalidate(), which seems to assume ownership
      of the stable pages flag and unilaterally clears it if no blk_integrity
      profile is registered:
          if (bi->profile)
                  disk->queue->backing_dev_info->capabilities |=
                  disk->queue->backing_dev_info->capabilities &=
      It's called from revalidate_disk() and rescan_partitions(), making it
      impossible to enable stable pages for drivers that support partitions
      and don't use blk_integrity: while the call in revalidate_disk() can be
      trivially worked around (see zram, which doesn't support partitions and
      hence gets away with zram_revalidate_disk()), rescan_partitions() can
      be triggered from userspace at any time.  This breaks rbd, where the
      ceph messenger is responsible for generating/verifying CRCs.
      Since blk_integrity_{un,}register() "must" be used for (un)registering
      the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
      setting there.  This way drivers that call blk_integrity_register() and
      use integrity infrastructure won't interfere with drivers that don't
      but still want stable pages.
      Fixes: 25520d55 ("block: Inline blk_integrity in struct gendisk")
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 4.4+, needs backporting
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  20. 20 Apr, 2017 2 commits
    • Dan Williams's avatar
      dax: introduce dax_direct_access() · b0686260
      Dan Williams authored
      Replace bdev_direct_access() with dax_direct_access() that uses
      dax_device and dax_operations instead of a block_device and
      block_device_operations for dax. Once all consumers of the old api have
      been converted bdev_direct_access() will be deleted.
      Given that block device partitioning decisions can cause dax page
      alignment constraints to be violated this also introduces the
      bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
      to the dax_device and also checks for page alignment.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Dan Williams's avatar
      block: kill bdev_dax_capable() · d8f07aee
      Dan Williams authored
      This is leftover dead code that has since been replaced by
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
  21. 08 Apr, 2017 2 commits
  22. 23 Mar, 2017 2 commits
  23. 02 Mar, 2017 1 commit
    • Jan Kara's avatar
      block: Initialize bd_bdi on inode initialization · a5a79d00
      Jan Kara authored
      So far we initialized bd_bdi only in bdget(). That is fine for normal
      bdev inodes however for the special case of the root inode of
      blockdev_superblock that function is never called and thus bd_bdi is
      left uninitialized. As a result bdev_evict_inode() may oops doing
      bdi_put(root->bd_bdi) on that inode as can be seen when doing:
      mount -t bdev none /mnt
      Fix the problem by initializing bd_bdi when first allocating the inode
      and then reinitializing bd_bdi in bdev_evict_inode().
      Thanks to syzkaller team for finding the problem.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Fixes: b1d2dc56 ("block: Make blk_get_backing_dev_info() safe without open bdev")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  24. 28 Feb, 2017 1 commit
  25. 21 Feb, 2017 1 commit
    • Jan Kara's avatar
      block: Revalidate i_bdev reference in bd_aquire() · cccd9fb9
      Jan Kara authored
      When a device gets removed, block device inode unhashed so that it is not
      used anymore (bdget() will not find it anymore). Later when a new device
      gets created with the same device number, we create new block device
      inode. However there may be file system device inodes whose i_bdev still
      points to the original block device inode and thus we get two active
      block device inodes for the same device. They will share the same
      gendisk so the only visible differences will be that page caches will
      not be coherent and BDIs will be different (the old block device inode
      still points to unregistered BDI).
      Fix the problem by checking in bd_acquire() whether i_bdev still points
      to active block device inode and re-lookup the block device if not. That
      way any open of a block device happening after the old device has been
      removed will get correct block device inode.
      Tested-by: default avatarLekshmi Pillai <lekshmicpillai@in.ibm.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  26. 02 Feb, 2017 2 commits
    • Jan Kara's avatar
      block: Make blk_get_backing_dev_info() safe without open bdev · b1d2dc56
      Jan Kara authored
      Currenly blk_get_backing_dev_info() is not safe to be called when the
      block device is not open as bdev->bd_disk is NULL in that case. However
      inode_to_bdi() uses this function and may be call called from flusher
      worker or other writeback related functions without bdev being open
      which leads to crashes such as:
      [113031.075540] Unable to handle kernel paging request for data at address 0x00000000
      [113031.075614] Faulting instruction address: 0xc0000000003692e0
      0:mon> t
      [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
      [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
      [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
      [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
      [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
      [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
      [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
      [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Jan Kara's avatar
      block: Unhash block device inodes on gendisk destruction · f44f1ab5
      Jan Kara authored
      Currently, block device inodes stay around after corresponding gendisk
      hash died until memory reclaim finds them and frees them. Since we will
      make block device inode pin the bdi, we want to free the block device
      inode as soon as the device goes away so that bdi does not stay around
      unnecessarily. Furthermore we need to avoid issues when new device with
      the same major,minor pair gets created since reusing the bdi structure
      would be rather difficult in this case.
      Unhashing block device inode on gendisk destruction nicely deals with
      these problems. Once last block device inode reference is dropped (which
      may be directly in del_gendisk()), the inode gets evicted. Furthermore if
      the major,minor pair gets reallocated, we are guaranteed to get new
      block device inode even if old block device inode is not yet evicted and
      thus we avoid issues with possible reuse of bdi.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  27. 24 Jan, 2017 1 commit
  28. 24 Dec, 2016 1 commit
  29. 22 Dec, 2016 1 commit
  30. 14 Dec, 2016 1 commit