1. 21 Oct, 2014 2 commits
    • Robert Elliott's avatar
      fs: clarify rate limit suppressed buffer I/O errors · 432f16e6
      Robert Elliott authored
      When quiet_error applies rate limiting to buffer_io_error calls, what the
      they apply to is unclear because the name is so generic, particularly
      if the messages are interleaved with others:
      
      [ 1936.063572] quiet_error: 664293 callbacks suppressed
      [ 1936.065297] Buffer I/O error on dev sdr, logical block 257429952, lost async page write
      [ 1936.067814] Buffer I/O error on dev sdr, logical block 257429953, lost async page write
      
      Also, the function uses printk_ratelimit(), although printk.h includes a
      comment advising "Please don't use... Instead use printk_ratelimited()."
      
      Change buffer_io_error to check the BH_Quiet bit itself, drop the
      printk_ratelimit call, and print using printk_ratelimited.
      
      This makes the messages look like:
      
      [  387.208839] buffer_io_error: 676394 callbacks suppressed
      [  387.210693] Buffer I/O error on dev sdr, logical block 211291776, lost async page write
      [  387.213432] Buffer I/O error on dev sdr, logical block 211291777, lost async page write
      Signed-off-by: default avatarRobert Elliott <[email protected]>
      Reviewed-by: default avatarWebb Scales <[email protected]>
      Signed-off-by: default avatarJens Axboe <[email protected]>
      432f16e6
    • Robert Elliott's avatar
      fs: merge I/O error prints into one line · b744c2ac
      Robert Elliott authored
      buffer.c uses two printk calls to print these messages:
      [67353.422338] Buffer I/O error on device sdr, logical block 212868488
      [67353.422338] lost page write due to I/O error on sdr
      
      In a busy system, they may be interleaved with other prints,
      losing the context for the second message.  Merge them into
      one line with one printk call so the prints are atomic.
      
      Also, differentiate between async page writes, sync page writes, and
      async page reads.
      
      Also, shorten "device" to "dev" to match the block layer prints:
      [67353.467906] blk_update_request: critical target error, dev sdr, sector
      1707107328
      
      Also, use %llu rather than %Lu.
      
      Resulting prints look like:
      [ 1356.437006] blk_update_request: critical target error, dev sdr, sector 1719693992
      [ 1361.383522] quiet_error: 659876 callbacks suppressed
      [ 1361.385816] Buffer I/O error on dev sdr, logical block 256902912, lost async page write
      [ 1361.385819] Buffer I/O error on dev sdr, logical block 256903644, lost async page write
      Signed-off-by: default avatarRobert Elliott <[email protected]>
      Reviewed-by: default avatarWebb Scales <[email protected]>
      Signed-off-by: default avatarJens Axboe <[email protected]>
      b744c2ac
  2. 14 Oct, 2014 1 commit
    • Zach Brown's avatar
      fs: check bh blocknr earlier when searching lru · 9470dd5d
      Zach Brown authored
      It's very common for the buffer heads in the lru to have different block
      numbers.  By comparing the blocknr before the bdev and size we can
      reduce the cost of searching in the very common case where all the
      entries have the same bdev and size.
      
      In quick hot cache cycle counting tests on a single fs workstation this
      cut the cost of a miss by about 20%.
      
      A diff of the disassembly shows the reordering of the bdev and blocknr
      comparisons.  This is in such a tiny loop that skipping one comparison
      is a meaningful portion of the total work being done:
      
           1628:      83 c1 01                add    $0x1,%ecx
           162b:      83 f9 08                cmp    $0x8,%ecx
           162e:      74 60                   je     1690 <__find_get_block+0xa0>
           1630:      89 c8                   mov    %ecx,%eax
           1632:      65 4c 8b 04 c5 00 00    mov    %gs:0x0(,%rax,8),%r8
           1639:      00 00
           163b:      4d 85 c0                test   %r8,%r8
           163e:      4c 89 c3                mov    %r8,%rbx
           1641:      74 e5                   je     1628 <__find_get_block+0x38>
      -    1643:      4d 3b 68 30             cmp    0x30(%r8),%r13
      +    1643:      4d 3b 68 18             cmp    0x18(%r8),%r13
           1647:      75 df                   jne    1628 <__find_get_block+0x38>
      -    1649:      4d 3b 60 18             cmp    0x18(%r8),%r12
      +    1649:      4d 3b 60 30             cmp    0x30(%r8),%r12
           164d:      75 d9                   jne    1628 <__find_get_block+0x38>
           164f:      49 39 50 20             cmp    %rdx,0x20(%r8)
           1653:      75 d3                   jne    1628 <__find_get_block+0x38>
      Signed-off-by: default avatarZach Brown <[email protected]>
      Cc: Al Viro <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      9470dd5d
  3. 10 Oct, 2014 3 commits
  4. 09 Oct, 2014 1 commit
    • Mikulas Patocka's avatar
      fs: make cont_expand_zero interruptible · c2ca0fcd
      Mikulas Patocka authored
      This patch makes it possible to kill a process looping in
      cont_expand_zero. A process may spend a lot of time in this function, so
      it is desirable to be able to kill it.
      
      It happened to me that I wanted to copy a piece data from the disk to a
      file. By mistake, I used the "seek" parameter to dd instead of "skip". Due
      to the "seek" parameter, dd attempted to extend the file and became stuck
      doing so - the only possibility was to reset the machine or wait many
      hours until the filesystem runs out of space and cont_expand_zero fails.
      We need this patch to be able to terminate the process.
      Signed-off-by: default avatarMikulas Patocka <[email protected]>
      Cc: [email protected]
      Signed-off-by: default avatarAl Viro <[email protected]>
      c2ca0fcd
  5. 02 Oct, 2014 1 commit
    • Jan Kara's avatar
      vfs: fix data corruption when blocksize < pagesize for mmaped data · 90a80202
      Jan Kara authored
      ->page_mkwrite() is used by filesystems to allocate blocks under a page
      which is becoming writeably mmapped in some process' address space. This
      allows a filesystem to return a page fault if there is not enough space
      available, user exceeds quota or similar problem happens, rather than
      silently discarding data later when writepage is called.
      
      However VFS fails to call ->page_mkwrite() in all the cases where
      filesystems need it when blocksize < pagesize. For example when
      blocksize = 1024, pagesize = 4096 the following is problematic:
        ftruncate(fd, 0);
        pwrite(fd, buf, 1024, 0);
        map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
        map[0] = 'a';       ----> page_mkwrite() for index 0 is called
        ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
        mremap(map, 1024, 10000, 0);
        map[4095] = 'a';    ----> no page_mkwrite() called
      
      At the moment ->page_mkwrite() is called, filesystem can allocate only
      one block for the page because i_size == 1024. Otherwise it would create
      blocks beyond i_size which is generally undesirable. But later at
      ->writepage() time, we also need to store data at offset 4095 but we
      don't have block allocated for it.
      
      This patch introduces a helper function filesystems can use to have
      ->page_mkwrite() called at all the necessary moments.
      Signed-off-by: default avatarJan Kara <[email protected]>
      Signed-off-by: Theodore Ts'o's avatarTheodore Ts'o <[email protected]>
      Cc: [email protected]
      90a80202
  6. 22 Sep, 2014 1 commit
    • Anton Altaparmakov's avatar
      Fix nasty 32-bit overflow bug in buffer i/o code. · f2d5a944
      Anton Altaparmakov authored
      On 32-bit architectures, the legacy buffer_head functions are not always
      handling the sector number with the proper 64-bit types, and will thus
      fail on 4TB+ disks.
      
      Any code that uses __getblk() (and thus bread(), breadahead(),
      sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit
      block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop
      in __getblk_slow() with an infinite stream of errors logged to dmesg
      like this:
      
        __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648
        b_state=0x00000020, b_size=512
        device sda1 blocksize: 512
      
      Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the
      top 32-bits are missing (in this case the 0x1 at the top).
      
      This is because grow_dev_page() is broken and has a 32-bit overflow due
      to shifting the page index value (a pgoff_t - which is just 32 bits on
      32-bit architectures) left-shifted as the block number.  But the top
      bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit
      before the shift.
      
      This patch fixes this issue by type casting "index" to sector_t before
      doing the left shift.
      
      Note this is not a theoretical bug but has been seen in the field on a
      4TiB hard drive with logical sector size 512 bytes.
      
      This patch has been verified to fix the infinite loop problem on 3.17-rc5
      kernel using a 4TB disk image mounted using "-o loop".  Without this patch
      doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop
      100% reproducibly whilst with the patch it works fine as expected.
      Signed-off-by: default avatarAnton Altaparmakov <[email protected]>
      Cc: [email protected]
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      f2d5a944
  7. 05 Sep, 2014 1 commit
  8. 16 Jul, 2014 1 commit
    • NeilBrown's avatar
      sched: Remove proliferation of wait_on_bit() action functions · 74316201
      NeilBrown authored
      The current "wait_on_bit" interface requires an 'action'
      function to be provided which does the actual waiting.
      There are over 20 such functions, many of them identical.
      Most cases can be satisfied by one of just two functions, one
      which uses io_schedule() and one which just uses schedule().
      
      So:
       Rename wait_on_bit and        wait_on_bit_lock to
              wait_on_bit_action and wait_on_bit_lock_action
       to make it explicit that they need an action function.
      
       Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
       which are *not* given an action function but implicitly use
       a standard one.
       The decision to error-out if a signal is pending is now made
       based on the 'mode' argument rather than being encoded in the action
       function.
      
       All instances of the old wait_on_bit and wait_on_bit_lock which
       can use the new version have been changed accordingly and their
       action functions have been discarded.
       wait_on_bit{_lock} does not return any specific error code in the
       event of a signal so the caller must check for non-zero and
       interpolate their own error code as appropriate.
      
      The wait_on_bit() call in __fscache_wait_on_invalidate() was
      ambiguous as it specified TASK_UNINTERRUPTIBLE but used
      fscache_wait_bit_interruptible as an action function.
      David Howells confirms this should be uniformly
      "uninterruptible"
      
      The main remaining user of wait_on_bit{,_lock}_action is NFS
      which needs to use a freezer-aware schedule() call.
      
      A comment in fs/gfs2/glock.c notes that having multiple 'action'
      functions is useful as they display differently in the 'wchan'
      field of 'ps'. (and /proc/$PID/wchan).
      As the new bit_wait{,_io} functions are tagged "__sched", they
      will not show up at all, but something higher in the stack.  So
      the distinction will still be visible, only with different
      function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
      gfs2/glock.c case).
      
      Since first version of this patch (against 3.15) two new action
      functions appeared, on in NFS and one in CIFS.  CIFS also now
      uses an action function that makes the same freezer aware
      schedule call as NFS.
      Signed-off-by: default avatarNeilBrown <[email protected]>
      Acked-by: David Howells <[email protected]> (fscache, keys)
      Acked-by: Steven Whitehouse <[email protected]> (gfs2)
      Acked-by: default avatarPeter Zijlstra <[email protected]>
      Cc: Oleg Nesterov <[email protected]>
      Cc: Steve French <[email protected]>
      Cc: Linus Torvalds <[email protected]>
      Link: http://lkml.kernel.org/r/[email protected]Signed-off-by: default avatarIngo Molnar <[email protected]>
      74316201
  9. 04 Jun, 2014 3 commits
    • Mel Gorman's avatar
      mm: non-atomically mark page accessed during page cache allocation where possible · 2457aec6
      Mel Gorman authored
      aops->write_begin may allocate a new page and make it visible only to have
      mark_page_accessed called almost immediately after.  Once the page is
      visible the atomic operations are necessary which is noticable overhead
      when writing to an in-memory filesystem like tmpfs but should also be
      noticable with fast storage.  The objective of the patch is to initialse
      the accessed information with non-atomic operations before the page is
      visible.
      
      The bulk of filesystems directly or indirectly use
      grab_cache_page_write_begin or find_or_create_page for the initial
      allocation of a page cache page.  This patch adds an init_page_accessed()
      helper which behaves like the first call to mark_page_accessed() but may
      called before the page is visible and can be done non-atomically.
      
      The primary APIs of concern in this care are the following and are used
      by most filesystems.
      
      	find_get_page
      	find_lock_page
      	find_or_create_page
      	grab_cache_page_nowait
      	grab_cache_page_write_begin
      
      All of them are very similar in detail to the patch creates a core helper
      pagecache_get_page() which takes a flags parameter that affects its
      behavior such as whether the page should be marked accessed or not.  Then
      old API is preserved but is basically a thin wrapper around this core
      function.
      
      Each of the filesystems are then updated to avoid calling
      mark_page_accessed when it is known that the VM interfaces have already
      done the job.  There is a slight snag in that the timing of the
      mark_page_accessed() has now changed so in rare cases it's possible a page
      gets to the end of the LRU as PageReferenced where as previously it might
      have been repromoted.  This is expected to be rare but it's worth the
      filesystem people thinking about it in case they see a problem with the
      timing change.  It is also the case that some filesystems may be marking
      pages accessed that previously did not but it makes sense that filesystems
      have consistent behaviour in this regard.
      
      The test case used to evaulate this is a simple dd of a large file done
      multiple times with the file deleted on each iterations.  The size of the
      file is 1/10th physical memory to avoid dirty page balancing.  In the
      async case it will be possible that the workload completes without even
      hitting the disk and will have variable results but highlight the impact
      of mark_page_accessed for async IO.  The sync results are expected to be
      more stable.  The exception is tmpfs where the normal case is for the "IO"
      to not hit the disk.
      
      The test machine was single socket and UMA to avoid any scheduling or NUMA
      artifacts.  Throughput and wall times are presented for sync IO, only wall
      times are shown for async as the granularity reported by dd and the
      variability is unsuitable for comparison.  As async results were variable
      do to writback timings, I'm only reporting the maximum figures.  The sync
      results were stable enough to make the mean and stddev uninteresting.
      
      The performance results are reported based on a run with no profiling.
      Profile data is based on a separate run with oprofile running.
      
      async dd
                                          3.15.0-rc3            3.15.0-rc3
                                             vanilla           accessed-v2
      ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
      tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
      btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
      ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
      xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)
      
      The XFS figure is a bit strange as it managed to avoid a worst case by
      sheer luck but the average figures looked reasonable.
      
              samples percentage
      ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      
      [[email protected]: don't run init_page_accessed() against an uninitialised pointer]
      Signed-off-by: default avatarMel Gorman <[email protected]>
      Cc: Johannes Weiner <[email protected]>
      Cc: Vlastimil Babka <[email protected]>
      Cc: Jan Kara <[email protected]>
      Cc: Michal Hocko <[email protected]>
      Cc: Hugh Dickins <[email protected]>
      Cc: Dave Hansen <[email protected]>
      Cc: Theodore Ts'o <[email protected]>
      Cc: "Paul E. McKenney" <[email protected]>
      Cc: Oleg Nesterov <[email protected]>
      Cc: Rik van Riel <[email protected]>
      Cc: Peter Zijlstra <[email protected]>
      Tested-by: Lad, Prabhakar's avatarPrabhakar Lad <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      2457aec6
    • Mel Gorman's avatar
      fs: buffer: do not use unnecessary atomic operations when discarding buffers · e7470ee8
      Mel Gorman authored
      Discarding buffers uses a bunch of atomic operations when discarding
      buffers because ......  I can't think of a reason.  Use a cmpxchg loop to
      clear all the necessary flags.  In most (all?) cases this will be a single
      atomic operations.
      
      [[email protected]: move BUFFER_FLAGS_DISCARD into the .c file]
      Signed-off-by: default avatarMel Gorman <[email protected]>
      Cc: Johannes Weiner <[email protected]>
      Cc: Vlastimil Babka <[email protected]>
      Cc: Jan Kara <[email protected]>
      Cc: Michal Hocko <[email protected]>
      Cc: Hugh Dickins <[email protected]>
      Cc: Dave Hansen <[email protected]>
      Cc: Theodore Ts'o <[email protected]>
      Cc: "Paul E. McKenney" <[email protected]>
      Cc: Oleg Nesterov <[email protected]>
      Cc: Rik van Riel <[email protected]>
      Cc: Peter Zijlstra <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      e7470ee8
    • Matthew Wilcox's avatar
      fs/buffer.c: remove block_write_full_page_endio() · 1b938c08
      Matthew Wilcox authored
      The last in-tree caller of block_write_full_page_endio() was removed in
      January 2013.  It's time to remove the EXPORT_SYMBOL, which leaves
      block_write_full_page() as the only caller of
      block_write_full_page_endio(), so inline block_write_full_page_endio()
      into block_write_full_page().
      Signed-off-by: default avatarMatthew Wilcox <[email protected]>
      Cc: Hugh Dickins <[email protected]>
      Cc: Dave Chinner <[email protected]>
      Cc: Dheeraj Reddy <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]ux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      1b938c08
  10. 18 Apr, 2014 1 commit
  11. 02 Apr, 2014 1 commit
  12. 19 Feb, 2014 1 commit
  13. 06 Feb, 2014 1 commit
  14. 04 Dec, 2013 1 commit
  15. 24 Nov, 2013 1 commit
  16. 17 Oct, 2013 1 commit
    • Johannes Weiner's avatar
      fs: buffer: move allocation failure loop into the allocator · 84235de3
      Johannes Weiner authored
      Buffer allocation has a very crude indefinite loop around waking the
      flusher threads and performing global NOFS direct reclaim because it can
      not handle allocation failures.
      
      The most immediate problem with this is that the allocation may fail due
      to a memory cgroup limit, where flushers + direct reclaim might not make
      any progress towards resolving the situation at all.  Because unlike the
      global case, a memory cgroup may not have any cache at all, only
      anonymous pages but no swap.  This situation will lead to a reclaim
      livelock with insane IO from waking the flushers and thrashing unrelated
      filesystem cache in a tight loop.
      
      Use __GFP_NOFAIL allocations for buffers for now.  This makes sure that
      any looping happens in the page allocator, which knows how to
      orchestrate kswapd, direct reclaim, and the flushers sensibly.  It also
      allows memory cgroups to detect allocations that can't handle failure
      and will allow them to ultimately bypass the limit if reclaim can not
      make progress.
      Reported-by: default avatarazurIt <[email protected]>
      Signed-off-by: default avatarJohannes Weiner <[email protected]>
      Cc: Michal Hocko <[email protected]>
      Cc: <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      84235de3
  17. 03 Jul, 2013 1 commit
  18. 22 May, 2013 1 commit
    • Lukas Czerner's avatar
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner authored
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: default avatarLukas Czerner <[email protected]>
      Cc: Andrew Morton <[email protected]>
      Cc: Hugh Dickins <[email protected]>
      d47992f8
  19. 29 Apr, 2013 2 commits
  20. 20 Apr, 2013 1 commit
  21. 23 Mar, 2013 1 commit
  22. 24 Feb, 2013 1 commit
  23. 23 Feb, 2013 1 commit
  24. 22 Feb, 2013 1 commit
    • Darrick J. Wong's avatar
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong authored
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: default avatarDarrick J. Wong <[email protected]>
      Acked-by: Steven Whitehouse's avatarSteven Whitehouse <[email protected]>
      Reviewed-by: default avatarJan Kara <[email protected]>
      Cc: Adrian Hunter <[email protected]>
      Cc: Andy Lutomirski <[email protected]>
      Cc: Artem Bityutskiy <[email protected]>
      Cc: Joel Becker <[email protected]>
      Cc: Mark Fasheh <[email protected]>
      Cc: Jens Axboe <[email protected]>
      Cc: Eric Van Hensbergen <[email protected]>
      Cc: Ron Minnich <[email protected]>
      Cc: Latchesar Ionkov <[email protected]>
      Signed-off-by: default avatarAndrew Morton <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      1d1d1a76
  25. 14 Jan, 2013 3 commits
    • Linus Torvalds's avatar
      vfs: add missing virtual cache flush after editing partial pages · 6d283dba
      Linus Torvalds authored
      Andrew Morton pointed this out a month ago, and then I completely forgot
      about it.
      
      If we read a partial last page of a block device, we will zero out the
      end of the page, but since that page can then be mapped into user space,
      we should also make sure to flush the cache on architectures that have
      virtual caches.  We have the flush_dcache_page() function for this, so
      use it.
      
      Now, in practice this really never matters, because nobody sane uses
      virtual caches to begin with, and they largely exist on old broken RISC
      arhitectures.
      
      And even if you did run on one of those obsolete CPU's, the whole "mmap
      and access the last partial page of a block device" behavior probably
      doesn't actually exist.  The normal IO functions (read/write) will never
      see the zeroed-out part of the page that migth not be coherent in the
      cache, because they honor the size of the device.
      
      So I'm marking this for stable (3.7 only), but I'm not sure anybody will
      ever care.
      Pointed-out-by: default avatarAndrew Morton <[email protected]>
      Cc: [email protected]  # 3.7
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      6d283dba
    • Tejun Heo's avatar
      block: add block_{touch|dirty}_buffer tracepoint · 5305cb83
      Tejun Heo authored
      The former is triggered from touch_buffer() and the latter
      mark_buffer_dirty().
      
      This is part of tracepoint additions to improve visiblity into
      dirtying / writeback operations for io tracer and userland.
      
      v2: Transformed writeback_dirty_buffer to block_dirty_buffer and made
          it share TP definition with block_touch_buffer.
      Signed-off-by: default avatarTejun Heo <[email protected]>
      Cc: Fengguang Wu <[email protected]>
      Signed-off-by: default avatarJens Axboe <[email protected]>
      5305cb83
    • Tejun Heo's avatar
      buffer: make touch_buffer() an exported function · f0059afd
      Tejun Heo authored
      We want to add a trace point to touch_buffer() but macros and inline
      functions defined in header files can't have tracing points.  Move
      touch_buffer() to fs/buffer.c and make it a proper function.
      
      The new exported function is also declared inline.  As most uses of
      touch_buffer() are inside buffer.c with nilfs2 as the only other user,
      the effect of this change should be negligible.
      Signed-off-by: default avatarTejun Heo <[email protected]>
      Cc: Steven Rostedt <[email protected]>
      Signed-off-by: default avatarJens Axboe <[email protected]>
      f0059afd
  26. 13 Dec, 2012 2 commits
  27. 12 Dec, 2012 1 commit
  28. 05 Dec, 2012 1 commit
  29. 04 Dec, 2012 1 commit
    • Linus Torvalds's avatar
      vfs: avoid "attempt to access beyond end of device" warnings · 57302e0d
      Linus Torvalds authored
      The block device access simplification that avoided accessing the (racy)
      block size information (commit bbec0270: "blkdev_max_block: make
      private to fs/buffer.c") no longer checks the maximum block size in the
      block mapping path.
      
      That was _almost_ as simple as just removing the code entirely, because
      the readers and writers all check the size of the device anyway, so
      under normal circumstances it "just worked".
      
      However, the block size may be such that the end of the device may
      straddle one single buffer_head.  At which point we may still want to
      access the end of the device, but the buffer we use to access it
      partially extends past the end.
      
      The 'bd_set_size()' function intentionally sets the block size to avoid
      this, but mounting the device - or setting the block size by hand to
      some other value - can modify that block size.
      
      So instead, teach 'submit_bh()' about the special case of the buffer
      head straddling the end of the device, and turning such an access into a
      smaller IO access, avoiding the problem.
      
      This, btw, also means that unlike before, we can now access the whole
      device regardless of device block size setting.  So now, even if the
      device size is only 512-byte aligned, we can read and write even the
      last sector even when having a much bigger block size for accessing the
      rest of the device.
      
      So with this, we could now get rid of the 'bd_set_size()' block size
      code entirely - resulting in faster IO for the common case - but that
      would be a separate patch.
      Reported-and-tested-by: default avatarRomain Francoise <[email protected]>
      Reporeted-and-tested-by: Meelis Roos's avatarMeelis Roos <[email protected]>
      Reported-by: default avatarTony Luck <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      57302e0d
  30. 30 Nov, 2012 1 commit
    • Linus Torvalds's avatar
      blkdev_max_block: make private to fs/buffer.c · bbec0270
      Linus Torvalds authored
      We really don't want to look at the block size for the raw block device
      accesses in fs/block-dev.c, because it may be changing from under us.
      So get rid of the max_block logic entirely, since the caller should
      already have done it anyway.
      
      That leaves the only user of this function in fs/buffer.c, so move the
      whole function there and make it static.
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      bbec0270
  31. 29 Nov, 2012 1 commit
    • Linus Torvalds's avatar
      fs/buffer.c: make block-size be per-page and protected by the page lock · 45bce8f3
      Linus Torvalds authored
      This makes the buffer size handling be a per-page thing, which allows us
      to not have to worry about locking too much when changing the buffer
      size.  If a page doesn't have buffers, we still need to read the block
      size from the inode, but we can do that with ACCESS_ONCE(), so that even
      if the size is changing, we get a consistent value.
      
      This doesn't convert all functions - many of the buffer functions are
      used purely by filesystems, which in turn results in the buffer size
      being fixed at mount-time.  So they don't have the same consistency
      issues that the raw device access can have.
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
      45bce8f3