1. 24 Apr, 2015 1 commit
    • Jens Axboe's avatar
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe authored
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
  2. 15 Apr, 2015 1 commit
  3. 12 Apr, 2015 5 commits
  4. 26 Mar, 2015 1 commit
  5. 20 Jan, 2015 2 commits
  6. 14 Jan, 2015 1 commit
  7. 17 Nov, 2014 1 commit
    • Benjamin Marzinski's avatar
      fs: add freeze_super/thaw_super fs hooks · 48b6bca6
      Benjamin Marzinski authored
      Currently, freezing a filesystem involves calling freeze_super, which locks
      sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
      hard for gfs2 (and potentially other cluster filesystems) to use the vfs
      freezing code to do freezes on all the cluster nodes.
      
      In order to communicate that a freeze has been requested, and to make sure
      that only one node is trying to freeze at a time, gfs2 uses a glock
      (sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
      this lock before calling freeze_super. This means that two nodes can
      attempt to freeze the filesystem by both calling freeze_super, acquiring
      the sb->s_umount lock, and then attempting to grab the cluster glock
      sd_freeze_gl. Only one will succeed, and the other will be stuck in
      freeze_super, making it impossible to finish freezing the node.
      
      To solve this problem, this patch adds the freeze_super and thaw_super
      hooks.  If a filesystem implements these hooks, they are called instead of
      the vfs freeze_super and thaw_super functions. This means that every
      filesystem that implements these hooks must call the vfs freeze_super and
      thaw_super functions itself within the hook function to make use of the vfs
      freezing code.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: Steven Whitehouse's avatarSteven Whitehouse <swhiteho@redhat.com>
      48b6bca6
  8. 31 Oct, 2014 1 commit
    • David Jeffery's avatar
      Return short read or 0 at end of a raw device, not EIO · b2de525f
      David Jeffery authored
      Author: David Jeffery <djeffery@redhat.com>
      Changes to the basic direct I/O code have broken the raw driver when reading
      to the end of a raw device.  Instead of returning a short read for a read that
      extends partially beyond the device's end or 0 when at the end of the device,
      these reads now return EIO.
      
      The raw driver needs the same end of device handling as was added for normal
      block devices.  Using blkdev_read_iter, which has the needed size checks,
      prevents the EIO conditions at the end of the device.
      Signed-off-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b2de525f
  9. 10 Oct, 2014 1 commit
    • Akinobu Mita's avatar
      block_dev: implement readpages() to optimize sequential read · 447f05bb
      Akinobu Mita authored
      Sequential read from a block device is expected to be equal or faster than
      from the file on a filesystem.  But it is not correct due to the lack of
      effective readpages() in the address space operations for block device.
      
      This implements readpages() operation for block device by using
      mpage_readpages() which can create multipage BIOs instead of BIOs for each
      page and reduce system CPU time consumption.
      
      Install 1GB of RAM disk storage:
      
      	# modprobe scsi_debug dev_size_mb=1024 delay=0
      
      Sequential read from file on a filesystem:
      
      	# mkfs.ext4 /dev/$DEV
      	# mount /dev/$DEV /mnt
      	# fio --name=t --size=512m --rw=read --filename=/mnt/file
      	...
      	  read : io=524288KB, bw=2133.4MB/s, iops=546133, runt=   240msec
      
      Sequential read from a block device:
      	# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
      	...
      (Without this commit)
      	  read : io=524288KB, bw=1700.2MB/s, iops=435455, runt=   301msec
      
      (With this commit)
      	  read : io=524288KB, bw=2160.4MB/s, iops=553046, runt=   237msec
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      447f05bb
  10. 08 Sep, 2014 2 commits
    • Tejun Heo's avatar
      bdi: reimplement bdev_inode_switch_bdi() · 018a17bd
      Tejun Heo authored
      A block_device may be attached to different gendisks and thus
      different bdis over time.  bdev_inode_switch_bdi() is used to switch
      the associated bdi.  The function assumes that the inode could be
      dirty and transfers it between bdis if so.  This is a bit nasty in
      that it reaches into bdi internals.
      
      This patch reimplements the function so that it writes out the inode
      if dirty.  This is a lot simpler and can be implemented without
      exposing bdi internals.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      018a17bd
    • Tejun Heo's avatar
      block, bdi: an active gendisk always has a request_queue associated with it · ff9ea323
      Tejun Heo authored
      bdev_get_queue() returns the request_queue associated with the
      specified block_device.  blk_get_backing_dev_info() makes use of
      bdev_get_queue() to determine the associated bdi given a block_device.
      
      All the callers of bdev_get_queue() including
      blk_get_backing_dev_info() assume that bdev_get_queue() may return
      NULL and implement NULL handling; however, bdev_get_queue() requires
      the passed in block_device is opened and attached to its gendisk.
      Because an active gendisk always has a valid request_queue associated
      with it, bdev_get_queue() can never return NULL and neither can
      blk_get_backing_dev_info().
      
      Make it clear that neither of the two functions can return NULL and
      remove NULL handling from all the callers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ff9ea323
  11. 12 Jun, 2014 1 commit
    • Al Viro's avatar
      ->splice_write() via ->write_iter() · 8d020765
      Al Viro authored
      iter_file_splice_write() - a ->splice_write() instance that gathers the
      pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
      it to ->write_iter().  A bunch of simple cases coverted to that...
      
      [AV: fixed the braino spotted by Cyrill]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      8d020765
  12. 04 Jun, 2014 1 commit
  13. 06 May, 2014 5 commits
  14. 03 Apr, 2014 2 commits
  15. 02 Apr, 2014 1 commit
  16. 04 Sep, 2013 1 commit
  17. 30 Jul, 2013 1 commit
    • Kent Overstreet's avatar
      aio: Kill aio_rw_vect_retry() · 73a7075e
      Kent Overstreet authored
      This code doesn't serve any purpose anymore, since the aio retry
      infrastructure has been removed.
      
      This change should be safe because aio_read/write are also used for
      synchronous IO, and called from do_sync_read()/do_sync_write() - and
      there's no looping done in the sync case (the read and write syscalls).
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: Benjamin LaHaise's avatarBenjamin LaHaise <bcrl@kvack.org>
      73a7075e
  18. 09 Jul, 2013 1 commit
    • Jan Kara's avatar
      writeback: Do not sort b_io list only because of block device inode · a8855990
      Jan Kara authored
      It is very likely that block device inode will be part of BDI dirty list
      as well. However it doesn't make sence to sort inodes on the b_io list
      just because of this inode (as it contains buffers all over the device
      anyway). So save some CPU cycles which is valuable since we hold relatively
      contented wb->list_lock.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      a8855990
  19. 03 Jul, 2013 1 commit
    • Mel Gorman's avatar
      mm: vmscan: take page buffers dirty and locked state into account · b4597226
      Mel Gorman authored
      Page reclaim keeps track of dirty and under writeback pages and uses it
      to determine if wait_iff_congested() should stall or if kswapd should
      begin writing back pages.  This fails to account for buffer pages that
      can be under writeback but not PageWriteback which is the case for
      filesystems like ext3 ordered mode.  Furthermore, PageDirty buffer pages
      can have all the buffers clean and writepage does no IO so it should not
      be accounted as congested.
      
      This patch adds an address_space operation that filesystems may
      optionally use to check if a page is really dirty or really under
      writeback.  An implementation is provided for for buffer_heads is added
      and used for block operations and ext3 in ordered mode.  By default the
      page flags are obeyed.
      
      Credit goes to Jan Kara for identifying that the page flags alone are
      not sufficient for ext3 and sanity checking a number of ideas on how the
      problem could be addressed.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4597226
  20. 29 Jun, 2013 1 commit
  21. 28 Jun, 2013 1 commit
    • Jan Kara's avatar
      writeback: Fix periodic writeback after fs mount · a5faeaf9
      Jan Kara authored
      Code in blkdev.c moves a device inode to default_backing_dev_info when
      the last reference to the device is put and moves the device inode back
      to its bdi when the first reference is acquired. This includes moving to
      wb.b_dirty list if the device inode is dirty. The code however doesn't
      setup timer to wake corresponding flusher thread and while wb.b_dirty
      list is non-empty __mark_inode_dirty() will not set it up either. Thus
      periodic writeback is effectively disabled until a sync(2) call which can
      lead to unexpected data loss in case of crash or power failure.
      
      Fix the problem by setting up a timer for periodic writeback in case we
      add the first dirty inode to wb.b_dirty list in bdev_inode_switch_bdi().
      Reported-by: default avatarBert De Jonghe <Bert.DeJonghe@amplidata.com>
      CC: stable@vger.kernel.org # >= 3.0
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a5faeaf9
  22. 08 May, 2013 1 commit
  23. 07 May, 2013 2 commits
  24. 01 May, 2013 1 commit
  25. 29 Apr, 2013 1 commit
  26. 01 Apr, 2013 1 commit
    • Anatol Pomozov's avatar
      loop: prevent bdev freeing while device in use · c1681bf8
      Anatol Pomozov authored
      struct block_device lifecycle is defined by its inode (see fs/block_dev.c) -
      block_device allocated first time we access /dev/loopXX and deallocated on
      bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile"
      we want that block_device stay alive until we destroy the loop device
      with "losetup -d".
      
      But because we do not hold /dev/loopXX inode its counter goes 0, and
      inode/bdev can be destroyed at any moment. Usually it happens at memory
      pressure or when user drops inode cache (like in the test below). When later in
      loop_clr_fd() we want to use bdev we have use-after-free error with following
      stack:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000280
        bd_set_size+0x10/0xa0
        loop_clr_fd+0x1f8/0x420 [loop]
        lo_ioctl+0x200/0x7e0 [loop]
        lo_compat_ioctl+0x47/0xe0 [loop]
        compat_blkdev_ioctl+0x341/0x1290
        do_filp_open+0x42/0xa0
        compat_sys_ioctl+0xc1/0xf20
        do_sys_open+0x16e/0x1d0
        sysenter_dispatch+0x7/0x1a
      
      To prevent use-after-free we need to grab the device in loop_set_fd()
      and put it later in loop_clr_fd().
      
      The issue is reprodusible on current Linus head and v3.3. Here is the test:
      
        dd if=/dev/zero of=loop.file bs=1M count=1
        while [ true ]; do
          losetup /dev/loop0 loop.file
          echo 2 > /proc/sys/vm/drop_caches
          losetup -d /dev/loop0
        done
      
      [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every
        time we call loop_set_fd() we check that loop_device->lo_state is
        Lo_unbound and set it to Lo_bound If somebody will try to set_fd again
        it will get EBUSY.  And if we try to loop_clr_fd() on unbound loop
        device we'll get ENXIO.
      
        loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under
        loop_device->lo_ctl_mutex. ]
      Signed-off-by: Anatol Pomozov's avatarAnatol Pomozov <anatol.pomozov@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1681bf8
  27. 23 Feb, 2013 1 commit
  28. 22 Feb, 2013 1 commit