1. 24 Apr, 2015 1 commit
    • Jens Axboe's avatar
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe authored
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      In other setups, Robert Elliott reported seeing good performance
      The more applications accessing the device, the worse it gets.
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      Cc: Andrew Morton <[email protected]>
      Cc: Christoph Hellwig <[email protected]>
      Cc: Theodore Ts'o <[email protected]>
      Cc: Elliott, Robert (Server Storage) <[email protected]>
      Cc: Al Viro <[email protected]>
      Signed-off-by: default avatarJens Axboe <[email protected]>
      Signed-off-by: default avatarAl Viro <[email protected]>
  2. 15 Apr, 2015 1 commit
  3. 12 Apr, 2015 5 commits
  4. 26 Mar, 2015 1 commit
  5. 20 Jan, 2015 2 commits
  6. 14 Jan, 2015 1 commit
  7. 17 Nov, 2014 1 commit
    • Benjamin Marzinski's avatar
      fs: add freeze_super/thaw_super fs hooks · 48b6bca6
      Benjamin Marzinski authored
      Currently, freezing a filesystem involves calling freeze_super, which locks
      sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
      hard for gfs2 (and potentially other cluster filesystems) to use the vfs
      freezing code to do freezes on all the cluster nodes.
      In order to communicate that a freeze has been requested, and to make sure
      that only one node is trying to freeze at a time, gfs2 uses a glock
      (sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
      this lock before calling freeze_super. This means that two nodes can
      attempt to freeze the filesystem by both calling freeze_super, acquiring
      the sb->s_umount lock, and then attempting to grab the cluster glock
      sd_freeze_gl. Only one will succeed, and the other will be stuck in
      freeze_super, making it impossible to finish freezing the node.
      To solve this problem, this patch adds the freeze_super and thaw_super
      hooks.  If a filesystem implements these hooks, they are called instead of
      the vfs freeze_super and thaw_super functions. This means that every
      filesystem that implements these hooks must call the vfs freeze_super and
      thaw_super functions itself within the hook function to make use of the vfs
      freezing code.
      Reviewed-by: default avatarJan Kara <[email protected]>
      Signed-off-by: default avatarBenjamin Marzinski <[email protected]>
      Signed-off-by: Steven Whitehouse's avatarSteven Whitehouse <[email protected]>
  8. 31 Oct, 2014 1 commit
    • David Jeffery's avatar
      Return short read or 0 at end of a raw device, not EIO · b2de525f
      David Jeffery authored
      Author: David Jeffery <[email protected]>
      Changes to the basic direct I/O code have broken the raw driver when reading
      to the end of a raw device.  Instead of returning a short read for a read that
      extends partially beyond the device's end or 0 when at the end of the device,
      these reads now return EIO.
      The raw driver needs the same end of device handling as was added for normal
      block devices.  Using blkdev_read_iter, which has the needed size checks,
      prevents the EIO conditions at the end of the device.
      Signed-off-by: default avatarDavid Jeffery <[email protected]>
      Signed-off-by: default avatarAl Viro <[email protected]>
  9. 10 Oct, 2014 1 commit
    • Akinobu Mita's avatar
      block_dev: implement readpages() to optimize sequential read · 447f05bb
      Akinobu Mita authored
      Sequential read from a block device is expected to be equal or faster than
      from the file on a filesystem.  But it is not correct due to the lack of
      effective readpages() in the address space operations for block device.
      This implements readpages() operation for block device by using
      mpage_readpages() which can create multipage BIOs instead of BIOs for each
      page and reduce system CPU time consumption.
      Install 1GB of RAM disk storage:
      	# modprobe scsi_debug dev_size_mb=1024 delay=0
      Sequential read from file on a filesystem:
      	# mkfs.ext4 /dev/$DEV
      	# mount /dev/$DEV /mnt
      	# fio --name=t --size=512m --rw=read --filename=/mnt/file
      	  read : io=524288KB, bw=2133.4MB/s, iops=546133, runt=   240msec
      Sequential read from a block device:
      	# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
      (Without this commit)
      	  read : io=524288KB, bw=1700.2MB/s, iops=435455, runt=   301msec
      (With this commit)
      	  read : io=524288KB, bw=2160.4MB/s, iops=553046, runt=   237msec
      Signed-off-by: default avatarAkinobu Mita <[email protected]>
      Cc: Jens Axboe <[email protected]>
      Cc: Alexander Viro <[email protected]>
      Cc: Jeff Moyer <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
  10. 08 Sep, 2014 2 commits
    • Tejun Heo's avatar
      bdi: reimplement bdev_inode_switch_bdi() · 018a17bd
      Tejun Heo authored
      A block_device may be attached to different gendisks and thus
      different bdis over time.  bdev_inode_switch_bdi() is used to switch
      the associated bdi.  The function assumes that the inode could be
      dirty and transfers it between bdis if so.  This is a bit nasty in
      that it reaches into bdi internals.
      This patch reimplements the function so that it writes out the inode
      if dirty.  This is a lot simpler and can be implemented without
      exposing bdi internals.
      Signed-off-by: default avatarTejun Heo <[email protected]>
      Cc: Alexander Viro <[email protected]>
      Signed-off-by: default avatarJens Axboe <[email protected]>
    • Tejun Heo's avatar
      block, bdi: an active gendisk always has a request_queue associated with it · ff9ea323
      Tejun Heo authored
      bdev_get_queue() returns the request_queue associated with the
      specified block_device.  blk_get_backing_dev_info() makes use of
      bdev_get_queue() to determine the associated bdi given a block_device.
      All the callers of bdev_get_queue() including
      blk_get_backing_dev_info() assume that bdev_get_queue() may return
      NULL and implement NULL handling; however, bdev_get_queue() requires
      the passed in block_device is opened and attached to its gendisk.
      Because an active gendisk always has a valid request_queue associated
      with it, bdev_get_queue() can never return NULL and neither can
      Make it clear that neither of the two functions can return NULL and
      remove NULL handling from all the callers.
      Signed-off-by: default avatarTejun Heo <[email protected]>
      Cc: Chris Mason <[email protected]>
      Cc: Dave Chinner <[email protected]>
      Signed-off-by: default avatarJens Axboe <[email protected]>
  11. 12 Jun, 2014 1 commit
    • Al Viro's avatar
      ->splice_write() via ->write_iter() · 8d020765
      Al Viro authored
      iter_file_splice_write() - a ->splice_write() instance that gathers the
      pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
      it to ->write_iter().  A bunch of simple cases coverted to that...
      [AV: fixed the braino spotted by Cyrill]
      Signed-off-by: default avatarAl Viro <[email protected]>
  12. 04 Jun, 2014 1 commit
  13. 06 May, 2014 5 commits
  14. 03 Apr, 2014 2 commits
  15. 02 Apr, 2014 1 commit
  16. 04 Sep, 2013 1 commit
  17. 30 Jul, 2013 1 commit
  18. 09 Jul, 2013 1 commit
  19. 03 Jul, 2013 1 commit
  20. 29 Jun, 2013 1 commit
  21. 28 Jun, 2013 1 commit
    • Jan Kara's avatar
      writeback: Fix periodic writeback after fs mount · a5faeaf9
      Jan Kara authored
      Code in blkdev.c moves a device inode to default_backing_dev_info when
      the last reference to the device is put and moves the device inode back
      to its bdi when the first reference is acquired. This includes moving to
      wb.b_dirty list if the device inode is dirty. The code however doesn't
      setup timer to wake corresponding flusher thread and while wb.b_dirty
      list is non-empty __mark_inode_dirty() will not set it up either. Thus
      periodic writeback is effectively disabled until a sync(2) call which can
      lead to unexpected data loss in case of crash or power failure.
      Fix the problem by setting up a timer for periodic writeback in case we
      add the first dirty inode to wb.b_dirty list in bdev_inode_switch_bdi().
      Reported-by: default avatarBert De Jonghe <[email protected]>
      CC: [email protected] # >= 3.0
      Signed-off-by: default avatarJan Kara <[email protected]>
      Signed-off-by: default avatarJens Axboe <[email protected]>
  22. 08 May, 2013 1 commit
  23. 07 May, 2013 2 commits
  24. 01 May, 2013 1 commit
  25. 29 Apr, 2013 1 commit
  26. 01 Apr, 2013 1 commit
    • Anatol Pomozov's avatar
      loop: prevent bdev freeing while device in use · c1681bf8
      Anatol Pomozov authored
      struct block_device lifecycle is defined by its inode (see fs/block_dev.c) -
      block_device allocated first time we access /dev/loopXX and deallocated on
      bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile"
      we want that block_device stay alive until we destroy the loop device
      with "losetup -d".
      But because we do not hold /dev/loopXX inode its counter goes 0, and
      inode/bdev can be destroyed at any moment. Usually it happens at memory
      pressure or when user drops inode cache (like in the test below). When later in
      loop_clr_fd() we want to use bdev we have use-after-free error with following
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000280
        loop_clr_fd+0x1f8/0x420 [loop]
        lo_ioctl+0x200/0x7e0 [loop]
        lo_compat_ioctl+0x47/0xe0 [loop]
      To prevent use-after-free we need to grab the device in loop_set_fd()
      and put it later in loop_clr_fd().
      The issue is reprodusible on current Linus head and v3.3. Here is the test:
        dd if=/dev/zero of=loop.file bs=1M count=1
        while [ true ]; do
          losetup /dev/loop0 loop.file
          echo 2 > /proc/sys/vm/drop_caches
          losetup -d /dev/loop0
      [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every
        time we call loop_set_fd() we check that loop_device->lo_state is
        Lo_unbound and set it to Lo_bound If somebody will try to set_fd again
        it will get EBUSY.  And if we try to loop_clr_fd() on unbound loop
        device we'll get ENXIO.
        loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under
        loop_device->lo_ctl_mutex. ]
      Signed-off-by: Anatol Pomozov's avatarAnatol Pomozov <[email protected]>
      Cc: Al Viro <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
  27. 23 Feb, 2013 1 commit
  28. 22 Feb, 2013 1 commit