1. 05 Apr, 2019 22 commits
    • Chao Yu's avatar
      f2fs: fix to initialize variable to avoid UBSAN/smatch warning · 283a29de
      Chao Yu authored
      [ Upstream commit f9aa52a8 ]
      
      As Dan Carpenter as below:
      
      The patch df634f444ee9: "f2fs: use rb_*_cached friends" from Oct 4,
      2018, leads to the following static checker warning:
      
      	fs/f2fs/extent_cache.c:606 f2fs_update_extent_tree_range()
      	error: uninitialized symbol 'leftmost'.
      
      And also Eric Biggers, and Kyungtae Kim reported, there is an UBSAN
      warning described as below:
      
      We report a bug in linux-4.20.2: "UBSAN: Undefined behaviour in
      fs/f2fs/extent_cache.c"
      
      kernel config: https://kt0755.github.io/etc/config_v4.20_stable
      repro: https://kt0755.github.io/etc/repro.4a3e7.c (f2fs is mounted on
      /mnt/f2fs/)
      
      This arose in f2fs_update_extent_tree_range (fs/f2fs/extent_cache.c:605).
      It seems that, for some reason, its last argument became "24"
      although that was supposed to be bool type.
      
      =========================================
      UBSAN: Undefined behaviour in fs/f2fs/extent_cache.c:605:4
      load of value 24 is not a valid value for type '_Bool'
      CPU: 0 PID: 6774 Comm: syz-executor5 Not tainted 4.20.2 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0xb1/0x118 lib/dump_stack.c:113
       ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
       __ubsan_handle_load_invalid_value+0x17a/0x1be lib/ubsan.c:457
       f2fs_update_extent_tree_range+0x1d4a/0x1d50 fs/f2fs/extent_cache.c:605
       f2fs_update_extent_cache+0x2b6/0x350 fs/f2fs/extent_cache.c:804
       f2fs_update_data_blkaddr+0x61/0x70 fs/f2fs/data.c:656
       f2fs_outplace_write_data+0x1d6/0x4b0 fs/f2fs/segment.c:3140
       f2fs_convert_inline_page+0x86d/0x2060 fs/f2fs/inline.c:163
       f2fs_convert_inline_inode+0x6b5/0xad0 fs/f2fs/inline.c:208
       f2fs_preallocate_blocks+0x78b/0xb00 fs/f2fs/data.c:982
       f2fs_file_write_iter+0x31b/0xf40 fs/f2fs/file.c:3062
       call_write_iter include/linux/fs.h:1857 [inline]
       new_sync_write fs/read_write.c:474 [inline]
       __vfs_write+0x538/0x6e0 fs/read_write.c:487
       vfs_write+0x1b3/0x520 fs/read_write.c:549
       ksys_write+0xde/0x1c0 fs/read_write.c:598
       __do_sys_write fs/read_write.c:610 [inline]
       __se_sys_write fs/read_write.c:607 [inline]
       __x64_sys_write+0x7e/0xc0 fs/read_write.c:607
       do_syscall_64+0xbe/0x4f0 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4497b9
      Code: e8 8c 9f 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48
      89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
      01 f0 ff ff 0f 83 9b 6b fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f1ea15edc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00007f1ea15ee6cc RCX: 00000000004497b9
      RDX: 0000000000001000 RSI: 0000000020000140 RDI: 0000000000000013
      RBP: 000000000071bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 000000000000bb50 R14: 00000000006f4bf0 R15: 00007f1ea15ee700
      =========================================
      
      As I checked, this uninitialized variable won't cause extent cache
      corruption, but in order to avoid such kind of warning of both UBSAN
      and smatch, fix to initialize related variable.
      Reported-by: 's avatarDan Carpenter <dan.carpenter@oracle.com>
      Reported-by: 's avatarEric Biggers <ebiggers@google.com>
      Reported-by: 's avatarKyungtae Kim <kt0755@gmail.com>
      Signed-off-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      283a29de
    • Sheng Yong's avatar
      f2fs: UBSAN: set boolean value iostat_enable correctly · 3d1a731b
      Sheng Yong authored
      [ Upstream commit ac929858 ]
      
      When setting /sys/fs/f2fs/<DEV>/iostat_enable with non-bool value, UBSAN
      reports the following warning.
      
      [ 7562.295484] ================================================================================
      [ 7562.296531] UBSAN: Undefined behaviour in fs/f2fs/f2fs.h:2776:10
      [ 7562.297651] load of value 64 is not a valid value for type '_Bool'
      [ 7562.298642] CPU: 1 PID: 7487 Comm: dd Not tainted 4.20.0-rc4+ #79
      [ 7562.298653] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [ 7562.298662] Call Trace:
      [ 7562.298760]  dump_stack+0x46/0x5b
      [ 7562.298811]  ubsan_epilogue+0x9/0x40
      [ 7562.298830]  __ubsan_handle_load_invalid_value+0x72/0x90
      [ 7562.298863]  f2fs_file_write_iter+0x29f/0x3f0
      [ 7562.298905]  __vfs_write+0x115/0x160
      [ 7562.298922]  vfs_write+0xa7/0x190
      [ 7562.298934]  ksys_write+0x50/0xc0
      [ 7562.298973]  do_syscall_64+0x4a/0xe0
      [ 7562.298992]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 7562.299001] RIP: 0033:0x7fa45ec19c00
      [ 7562.299004] Code: 73 01 c3 48 8b 0d 88 92 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d dd eb 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce 8f 01 00 48 89 04 24
      [ 7562.299044] RSP: 002b:00007ffca52b49e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [ 7562.299052] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa45ec19c00
      [ 7562.299059] RDX: 0000000000000400 RSI: 000000000093f000 RDI: 0000000000000001
      [ 7562.299065] RBP: 000000000093f000 R08: 0000000000000004 R09: 0000000000000000
      [ 7562.299071] R10: 00007ffca52b47b0 R11: 0000000000000246 R12: 0000000000000400
      [ 7562.299077] R13: 000000000093f000 R14: 000000000093f400 R15: 0000000000000000
      [ 7562.299091] ================================================================================
      
      So, if iostat_enable is enabled, set its value as true.
      Signed-off-by: 's avatarSheng Yong <shengyong1@huawei.com>
      Reviewed-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      3d1a731b
    • Theodore Ts'o's avatar
      jbd2: fix race when writing superblock · 34cbc429
      Theodore Ts'o authored
      [ Upstream commit 538bcaa6 ]
      
      The jbd2 superblock is lockless now, so there is probably a race
      condition between writing it so disk and modifing contents of it, which
      may lead to checksum error. The following race is the one case that we
      have captured.
      
      jbd2                                fsstress
      jbd2_journal_commit_transaction
       jbd2_journal_update_sb_log_tail
        jbd2_write_superblock
         jbd2_superblock_csum_set         jbd2_journal_revoke
                                           jbd2_journal_set_features(revork)
                                           modify superblock
         submit_bh(checksum incorrect)
      
      Fix this by locking the buffer head before modifing it.  We always
      write the jbd2 superblock after we modify it, so this just means
      calling the lock_buffer() a little earlier.
      
      This checksum corruption problem can be reproduced by xfstests
      generic/475.
      Reported-by: 's avatarzhangyi (F) <yi.zhang@huawei.com>
      Suggested-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: Theodore Ts'o's avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      34cbc429
    • Aurelien Jarno's avatar
      vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1 · e6eef524
      Aurelien Jarno authored
      [ Upstream commit cc4b1242 ]
      
      The preadv2 and pwritev2 syscalls are supposed to emulate the readv and
      writev syscalls when offset == -1. Therefore the compat code should
      check for offset before calling do_compat_preadv64 and
      do_compat_pwritev64. This is the case for the preadv2 and pwritev2
      syscalls, but handling of offset == -1 is missing in their 64-bit
      equivalent.
      
      This patch fixes that, calling do_compat_readv and do_compat_writev when
      offset == -1. This fixes the following glibc tests on x32:
       - misc/tst-preadvwritev2
       - misc/tst-preadvwritev64v2
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: 's avatarAurelien Jarno <aurelien@aurel32.net>
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      e6eef524
    • Josef Bacik's avatar
      btrfs: don't enospc all tickets on flush failure · 895927dc
      Josef Bacik authored
      [ Upstream commit f91587e4 ]
      
      With the introduction of the per-inode block_rsv it became possible to
      have really really large reservation requests made because of data
      fragmentation.  Since the ticket stuff assumed that we'd always have
      relatively small reservation requests it just killed all tickets if we
      were unable to satisfy the current request.
      
      However, this is generally not the case anymore.  So fix this logic to
      instead see if we had a ticket that we were able to give some
      reservation to, and if we were continue the flushing loop again.
      
      Likewise we make the tickets use the space_info_add_old_bytes() method
      of returning what reservation they did receive in hopes that it could
      satisfy reservations down the line.
      Reviewed-by: 's avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: 's avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      895927dc
    • Qu Wenruo's avatar
      btrfs: qgroup: Make qgroup async transaction commit more aggressive · 8f6019b4
      Qu Wenruo authored
      [ Upstream commit f5fef459 ]
      
      [BUG]
      Btrfs qgroup will still hit EDQUOT under the following case:
      
        $ dev=/dev/test/test
        $ mnt=/mnt/btrfs
        $ umount $mnt &> /dev/null
        $ umount $dev &> /dev/null
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt -o nospace_cache
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ fallocate -l 900M $mnt/subv/padding
        $ sync
      
        $ rm $mnt/subv/padding
      
        # Hit EDQUOT
        $ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file
      
      [CAUSE]
      Since commit a514d638 ("btrfs: qgroup: Commit transaction in advance
      to reduce early EDQUOT"), btrfs is not forced to commit transaction to
      reclaim more quota space.
      
      Instead, we just check pertrans metadata reservation against some
      threshold and try to do asynchronously transaction commit.
      
      However in above case, the pertrans metadata reservation is pretty small
      thus it will never trigger asynchronous transaction commit.
      
      [FIX]
      Instead of only accounting pertrans metadata reservation, we calculate
      how much free space we have, and if there isn't much free space left,
      commit transaction asynchronously to try to free some space.
      
      This may slow down the fs when we have less than 32M free qgroup space,
      but should reduce a lot of false EDQUOT, so the cost should be
      acceptable.
      Signed-off-by: 's avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      8f6019b4
    • Josef Bacik's avatar
      btrfs: save drop_progress if we drop refs at all · 29b55af8
      Josef Bacik authored
      [ Upstream commit aea6f028 ]
      
      Previously we only updated the drop_progress key if we were in the
      DROP_REFERENCE stage of snapshot deletion.  This is because the
      UPDATE_BACKREF stage checks the flags of the blocks it's converting to
      FULL_BACKREF, so if we go over a block we processed before it doesn't
      matter, we just don't do anything.
      
      The problem is in do_walk_down() we will go ahead and drop the roots
      reference to any blocks that we know we won't need to walk into.
      
      Given subvolume A and snapshot B.  The root of B points to all of the
      nodes that belong to A, so all of those nodes have a refcnt > 1.  If B
      did not modify those blocks it'll hit this condition in do_walk_down
      
      if (!wc->update_ref ||
          generation <= root->root_key.offset)
      	goto skip;
      
      and in "goto skip" we simply do a btrfs_free_extent() for that bytenr
      that we point at.
      
      Now assume we modified some data in B, and then took a snapshot of B and
      call it C.  C points to all the nodes in B, making every node the root
      of B points to have a refcnt > 1.  This assumes the root level is 2 or
      higher.
      
      We delete snapshot B, which does the above work in do_walk_down,
      free'ing our ref for nodes we share with A that we didn't modify.  Now
      we hit a node we _did_ modify, thus we own.  We need to walk down into
      this node and we set wc->stage == UPDATE_BACKREF.  We walk down to level
      0 which we also own because we modified data.  We can't walk any further
      down and thus now need to walk up and start the next part of the
      deletion.  Now walk_up_proc is supposed to put us back into
      DROP_REFERENCE, but there's an exception to this
      
      if (level < wc->shared_level)
      	goto out;
      
      we are at level == 0, and our shared_level == 1.  We skip out of this
      one and go up to level 1.  Since path->slots[1] < nritems we
      path->slots[1]++ and break out of walk_up_tree to stop our transaction
      and loop back around.  Now in btrfs_drop_snapshot we have this snippet
      
      if (wc->stage == DROP_REFERENCE) {
      	level = wc->level;
      	btrfs_node_key(path->nodes[level],
      		       &root_item->drop_progress,
      		       path->slots[level]);
      	root_item->drop_level = level;
      }
      
      our stage == UPDATE_BACKREF still, so we don't update the drop_progress
      key.  This is a problem because we would have done btrfs_free_extent()
      for the nodes leading up to our current position.  If we crash or
      unmount here and go to remount we'll start over where we were before and
      try to free our ref for blocks we've already freed, and thus abort()
      out.
      
      Fix this by keeping track of the last place we dropped a reference for
      our block in do_walk_down.  Then if wc->stage == UPDATE_BACKREF we know
      we'll start over from a place we meant to, and otherwise things continue
      to work as they did before.
      
      I have a complicated reproducer for this problem, without this patch
      we'll fail to fsck the fs when replaying the log writes log.  With this
      patch we can replay the whole log without any fsck or mount failures.
      
      The steps to reproduce this easily are sort of tricky, I had to add a
      couple of debug patches to the kernel in order to make it easy,
      basically I just needed to make sure we did actually commit the
      transaction every time we finished a walk_down_tree/walk_up_tree combo.
      
      The reproducer:
      
      1) Creates a base subvolume.
      2) Creates 100k files in the subvolume.
      3) Snapshots the base subvolume (snap1).
      4) Touches files 5000-6000 in snap1.
      5) Snapshots snap1 (snap2).
      6) Deletes snap1.
      
      I do this with dm-log-writes, and then replay to every FUA in the log
      and fsck the fs.
      Reviewed-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarJosef Bacik <josef@toxicpanda.com>
      [ copy reproducer steps ]
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      29b55af8
    • Carlos Maiolino's avatar
      fs: fix guard_bio_eod to check for real EOD errors · a1d9d214
      Carlos Maiolino authored
      [ Upstream commit dce30ca9 ]
      
      guard_bio_eod() can truncate a segment in bio to allow it to do IO on
      odd last sectors of a device.
      
      It already checks if the IO starts past EOD, but it does not consider
      the possibility of an IO request starting within device boundaries can
      contain more than one segment past EOD.
      
      In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
      underflow bvec->bv_len.
      
      Fix this by checking if truncated_bytes is lower than PAGE_SIZE.
      
      This situation has been found on filesystems such as isofs and vfat,
      which doesn't check the device size before mount, if the device is
      smaller than the filesystem itself, a readahead on such filesystem,
      which spans EOD, can trigger this situation, leading a call to
      zero_user() with a wrong size possibly corrupting memory.
      
      I didn't see any crash, or didn't let the system run long enough to
      check if memory corruption will be hit somewhere, but adding
      instrumentation to guard_bio_end() to check truncated_bytes size, was
      enough to see the error.
      
      The following script can trigger the error.
      
      MNT=/mnt
      IMG=./DISK.img
      DEV=/dev/loop0
      
      mkfs.vfat $IMG
      mount $IMG $MNT
      cp -R /etc $MNT &> /dev/null
      umount $MNT
      
      losetup -D
      
      losetup --find --show --sizelimit 16247280 $IMG
      mount $DEV $MNT
      
      find $MNT -type f -exec cat {} + >/dev/null
      
      Kudos to Eric Sandeen for coming up with the reproducer above
      Reviewed-by: 's avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: Carlos Maiolino's avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: 's avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      a1d9d214
    • Eric Whitney's avatar
      ext4: fix bigalloc cluster freeing when hole punching under load · dee200ab
      Eric Whitney authored
      [ Upstream commit 7bd75230 ]
      
      Ext4 may not free clusters correctly when punching holes in bigalloc
      file systems under high load conditions.  If it's not possible to
      extend and restart the journal in ext4_ext_rm_leaf() when preparing to
      remove blocks from a punched region, a retry of the entire punch
      operation is triggered in ext4_ext_remove_space().  This causes a
      partial cluster to be set to the first cluster in the extent found to
      the right of the punched region.  However, if the punch operation
      prior to the retry had made enough progress to delete one or more
      extents and a partial cluster candidate for freeing had already been
      recorded, the retry would overwrite the partial cluster.  The loss of
      this information makes it impossible to correctly free the original
      partial cluster in all cases.
      
      This bug can cause generic/476 to fail when run as part of
      xfstests-bld's bigalloc and bigalloc_1k test cases.  The failure is
      reported when e2fsck detects bad iblocks counts greater than expected
      in units of whole clusters and also detects a number of negative block
      bitmap differences equal to the iblocks discrepancy in cluster units.
      Signed-off-by: 's avatarEric Whitney <enwlinux@gmail.com>
      Signed-off-by: Theodore Ts'o's avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      dee200ab
    • luojiajun's avatar
      jbd2: fix invalid descriptor block checksum · 1d62e75a
      luojiajun authored
      [ Upstream commit 6e876c3d ]
      
      In jbd2_journal_commit_transaction(), if we are in abort mode,
      we may flush the buffer without setting descriptor block checksum
      by goto start_journal_io. Then fs is mounted,
      jbd2_descriptor_block_csum_verify() failed.
      
      [  271.379811] EXT4-fs (vdd): shut down requested (2)
      [  271.381827] Aborting journal on device vdd-8.
      [  271.597136] JBD2: Invalid checksum recovering block 22199 in log
      [  271.598023] JBD2: recovery failed
      [  271.598484] EXT4-fs (vdd): error loading journal
      
      Fix this problem by keep setting descriptor block checksum if the
      descriptor buffer is not NULL.
      
      This checksum problem can be reproduced by xfstests generic/388.
      Signed-off-by: 's avatarluojiajun <luojiajun3@huawei.com>
      Signed-off-by: Theodore Ts'o's avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: 's avatarJan Kara <jack@suse.cz>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      1d62e75a
    • Yao Liu's avatar
      cifs: Fix NULL pointer dereference of devname · d6dd8042
      Yao Liu authored
      [ Upstream commit 68e2672f ]
      
      There is a NULL pointer dereference of devname in strspn()
      
      The oops looks something like:
      
        CIFS: Attempting to mount (null)
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
        ...
        RIP: 0010:strspn+0x0/0x50
        ...
        Call Trace:
         ? cifs_parse_mount_options+0x222/0x1710 [cifs]
         ? cifs_get_volume_info+0x2f/0x80 [cifs]
         cifs_setup_volume_info+0x20/0x190 [cifs]
         cifs_get_volume_info+0x50/0x80 [cifs]
         cifs_smb3_do_mount+0x59/0x630 [cifs]
         ? ida_alloc_range+0x34b/0x3d0
         cifs_do_mount+0x11/0x20 [cifs]
         mount_fs+0x52/0x170
         vfs_kern_mount+0x6b/0x170
         do_mount+0x216/0xdc0
         ksys_mount+0x83/0xd0
         __x64_sys_mount+0x25/0x30
         do_syscall_64+0x65/0x220
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fix this by adding a NULL check on devname in cifs_parse_devname()
      Signed-off-by: 's avatarYao Liu <yotta.liu@ucloud.cn>
      Signed-off-by: 's avatarSteve French <stfrench@microsoft.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      d6dd8042
    • Namjae Jeon's avatar
      cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED · bcb99efa
      Namjae Jeon authored
      [ Upstream commit 969ae8e8 ]
      
      Old windows version or Netapp SMB server will return
      NT_STATUS_NOT_SUPPORTED since they do not allow or implement
      FSCTL_VALIDATE_NEGOTIATE_INFO. The client should accept the response
      provided it's properly signed.
      
      See
      https://blogs.msdn.microsoft.com/openspecification/2012/06/28/smb3-secure-dialect-negotiation/
      
      and
      
      MS-SMB2 validate negotiate response processing:
      https://msdn.microsoft.com/en-us/library/hh880630.aspx
      
      Samba client had already handled it.
      https://bugzilla.samba.org/attachment.cgi?id=13285&action=editSigned-off-by: 's avatarNamjae Jeon <linkinjeon@gmail.com>
      Signed-off-by: 's avatarSteve French <stfrench@microsoft.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      bcb99efa
    • Chao Yu's avatar
      f2fs: fix to check inline_xattr_size boundary correctly · 88596e78
      Chao Yu authored
      [ Upstream commit 500e0b28 ]
      
      We use below condition to check inline_xattr_size boundary:
      
      	if (!F2FS_OPTION(sbi).inline_xattr_size ||
      		F2FS_OPTION(sbi).inline_xattr_size >=
      				DEF_ADDRS_PER_INODE -
      				F2FS_TOTAL_EXTRA_ATTR_SIZE -
      				DEF_INLINE_RESERVED_SIZE -
      				DEF_MIN_INLINE_SIZE)
      
      There is there problems in that check:
      - we should allow inline_xattr_size equaling to min size of inline
      {data,dentry} area.
      - F2FS_TOTAL_EXTRA_ATTR_SIZE and inline_xattr_size are based on
      different size unit, previous one is 4 bytes, latter one is 1 bytes.
      - DEF_MIN_INLINE_SIZE only indicate min size of inline data area,
      however, we need to consider min size of inline dentry area as well,
      minimal inline dentry should at least contain two entries: '.' and
      '..', so that min inline_dentry size is 40 bytes.
      
      .bitmap		1 * 1 = 1
      .reserved	1 * 1 = 1
      .dentry		11 * 2 = 22
      .filename	8 * 2 = 16
      total		40
      Signed-off-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      88596e78
    • Louis Taylor's avatar
      cifs: use correct format characters · f2e34b4f
      Louis Taylor authored
      [ Upstream commit 259594be ]
      
      When compiling with -Wformat, clang emits the following warnings:
      
      fs/cifs/smb1ops.c:312:20: warning: format specifies type 'unsigned
      short' but the argument has type 'unsigned int' [-Wformat]
                               tgt_total_cnt, total_in_tgt);
                                              ^~~~~~~~~~~~
      
      fs/cifs/cifs_dfs_ref.c:289:4: warning: format specifies type 'short'
      but the argument has type 'int' [-Wformat]
                       ref->flags, ref->server_type);
                       ^~~~~~~~~~
      
      fs/cifs/cifs_dfs_ref.c:289:16: warning: format specifies type 'short'
      but the argument has type 'int' [-Wformat]
                       ref->flags, ref->server_type);
                                   ^~~~~~~~~~~~~~~~
      
      fs/cifs/cifs_dfs_ref.c:291:4: warning: format specifies type 'short'
      but the argument has type 'int' [-Wformat]
                       ref->ref_flag, ref->path_consumed);
                       ^~~~~~~~~~~~~
      
      fs/cifs/cifs_dfs_ref.c:291:19: warning: format specifies type 'short'
      but the argument has type 'int' [-Wformat]
                       ref->ref_flag, ref->path_consumed);
                                      ^~~~~~~~~~~~~~~~~~
      The types of these arguments are unconditionally defined, so this patch
      updates the format character to the correct ones for ints and unsigned
      ints.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/378Signed-off-by: Louis Taylor's avatarLouis Taylor <louis@kragniz.eu>
      Signed-off-by: 's avatarSteve French <stfrench@microsoft.com>
      Reviewed-by: 's avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      f2e34b4f
    • Shuriyc Chu's avatar
      fs/file.c: initialize init_files.resize_wait · 0326696a
      Shuriyc Chu authored
      [ Upstream commit 5704a068 ]
      
      (Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647)
      
      'get_unused_fd_flags' in kthread cause kernel crash.  It works fine on
      4.1, but causes crash after get 64 fds.  It also cause crash on
      ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the
      same.
      
      The crash message on centos7.5 shows below:
      
        start fd 61
        start fd 62
        start fd 63
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: __wake_up_common+0x2e/0x90
        PGD 0
        Oops: 0000 [#1] SMP
        Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core
         virtio floppy dm_mirror dm_region_hash dm_log dm_mod
        CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.3.3.el7.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
        task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000
        RIP: 0010:__wake_up_common+0x2e/0x90
        RSP: 0018:ffff8e94247a2d18  EFLAGS: 00010086
        RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0
        RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8
        R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
        FS:  0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0
        Call Trace:
          __wake_up+0x39/0x50
          expand_files+0x131/0x250
          __alloc_fd+0x47/0x170
          get_unused_fd_flags+0x30/0x40
          test_fd+0x12a/0x1c0 [test]
          kthread+0xd1/0xe0
          ret_from_fork_nospec_begin+0x21/0x21
        Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 <48> 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef
        RIP   __wake_up_common+0x2e/0x90
         RSP <ffff8e94247a2d18>
        CR2: 0000000000000000
      
      This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4
      (3.10.0-693.21.1 ) is ok.  Root cause: the item 'resize_wait' is not
      initialized before being used.
      Reported-by: 's avatarRichard Zhang <zhang.zijian@h3c.com>
      Reviewed-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      0326696a
    • zhengliang's avatar
      f2fs: fix to data block override node segment by mistake · 902507da
      zhengliang authored
      [ Upstream commit a0770e13 ]
      
      v4: Rearrange the previous three versions.
      
      The following scenario could lead to data block override by mistake.
      
      TASK A            |  TASK kworker                                            |     TASK B                                            |       TASK C
                        |                                                          |                                                       |
      open              |                                                          |                                                       |
      write             |                                                          |                                                       |
      close             |                                                          |                                                       |
                        |  f2fs_write_data_pages                                   |                                                       |
                        |    f2fs_write_cache_pages                                |                                                       |
                        |      f2fs_outplace_write_data                            |                                                       |
                        |        f2fs_allocate_data_block (get block in seg S,     |                                                       |
                        |                                  S is full, and only     |                                                       |
                        |                                  have this valid data    |                                                       |
                        |                                  block)                  |                                                       |
                        |          allocate_segment                                |                                                       |
                        |          locate_dirty_segment (mark S as PRE)            |                                                       |
                        |        f2fs_submit_page_write (submit but is not         |                                                       |
                        |                                written on dev)           |                                                       |
      unlink            |                                                          |                                                       |
       iput_final       |                                                          |                                                       |
        f2fs_drop_inode |                                                          |                                                       |
          f2fs_truncate |                                                          |                                                       |
       (not evict)      |                                                          |                                                       |
                        |                                                          | write_checkpoint                                      |
                        |                                                          |  flush merged bio but not wait file data writeback    |
                        |                                                          |  set_prefree_as_free (mark S as FREE)                 |
                        |                                                          |                                                       | update NODE/DATA
                        |                                                          |                                                       | allocate_segment (select S)
                        |     writeback done                                       |                                                       |
      
      So we need to guarantee io complete before truncate inode in f2fs_drop_inode.
      Reviewed-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarZheng Liang <zhengliang6@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      902507da
    • Sahitya Tummala's avatar
      f2fs: do not use mutex lock in atomic context · 36672151
      Sahitya Tummala authored
      [ Upstream commit 9083977d ]
      
      Fix below warning coming because of using mutex lock in atomic context.
      
      BUG: sleeping function called from invalid context at kernel/locking/mutex.c:98
      in_atomic(): 1, irqs_disabled(): 0, pid: 585, name: sh
      Preemption disabled at: __radix_tree_preload+0x28/0x130
      Call trace:
       dump_backtrace+0x0/0x2b4
       show_stack+0x20/0x28
       dump_stack+0xa8/0xe0
       ___might_sleep+0x144/0x194
       __might_sleep+0x58/0x8c
       mutex_lock+0x2c/0x48
       f2fs_trace_pid+0x88/0x14c
       f2fs_set_node_page_dirty+0xd0/0x184
      
      Do not use f2fs_radix_tree_insert() to avoid doing cond_resched() with
      spin_lock() acquired.
      Signed-off-by: 's avatarSahitya Tummala <stummala@codeaurora.org>
      Reviewed-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      36672151
    • Jia Guo's avatar
      ocfs2: fix a panic problem caused by o2cb_ctl · e92a6db0
      Jia Guo authored
      [ Upstream commit cc725ef3 ]
      
      In the process of creating a node, it will cause NULL pointer
      dereference in kernel if o2cb_ctl failed in the interval (mkdir,
      o2cb_set_node_attribute(node_num)] in function o2cb_add_node.
      
      The node num is initialized to 0 in function o2nm_node_group_make_item,
      o2nm_node_group_drop_item will mistake the node number 0 for a valid
      node number when we delete the node before the node number is set
      correctly.  If the local node number of the current host happens to be
      0, cluster->cl_local_node will be set to O2NM_INVALID_NODE_NUM while
      o2hb_thread still running.  The panic stack is generated as follows:
      
        o2hb_thread
            \-o2hb_do_disk_heartbeat
                \-o2hb_check_own_slot
                    |-slot = &reg->hr_slots[o2nm_this_node()];
                    //o2nm_this_node() return O2NM_INVALID_NODE_NUM
      
      We need to check whether the node number is set when we delete the node.
      
      Link: http://lkml.kernel.org/r/133d8045-72cc-863e-8eae-5013f9f6bc51@huawei.comSigned-off-by: 's avatarJia Guo <guojia12@huawei.com>
      Reviewed-by: 's avatarJoseph Qi <jiangqi903@gmail.com>
      Acked-by: JunPiao's avatarJun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      e92a6db0
    • Chao Yu's avatar
      f2fs: fix to avoid deadlock in f2fs_read_inline_dir() · a98984da
      Chao Yu authored
      [ Upstream commit aadcef64 ]
      
      As Jiqun Li reported in bugzilla:
      
      https://bugzilla.kernel.org/show_bug.cgi?id=202883
      
      sometimes, dead lock when make system call SYS_getdents64 with fsync() is
      called by another process.
      
      monkey running on android9.0
      
      1.  task 9785 held sbi->cp_rwsem and waiting lock_page()
      2.  task 10349 held mm_sem and waiting sbi->cp_rwsem
      3. task 9709 held lock_page() and waiting mm_sem
      
      so this is a dead lock scenario.
      
      task stack is show by crash tools as following
      
      crash_arm64> bt ffffffc03c354080
      PID: 9785   TASK: ffffffc03c354080  CPU: 1   COMMAND: "RxIoScheduler-3"
      >> #7 [ffffffc01b50fac0] __lock_page at ffffff80081b11e8
      
      crash-arm64> bt 10349
      PID: 10349  TASK: ffffffc018b83080  CPU: 1   COMMAND: "BUGLY_ASYNC_UPL"
      >> #3 [ffffffc01f8cfa40] rwsem_down_read_failed at ffffff8008a93afc
           PC: 00000033  LR: 00000000  SP: 00000000  PSTATE: ffffffffffffffff
      
      crash-arm64> bt 9709
      PID: 9709   TASK: ffffffc03e7f3080  CPU: 1   COMMAND: "IntentService[A"
      >> #3 [ffffffc001e67850] rwsem_down_read_failed at ffffff8008a93afc
      >> #8 [ffffffc001e67b80] el1_ia at ffffff8008084fc4
           PC: ffffff8008274114  [compat_filldir64+120]
           LR: ffffff80083584d4  [f2fs_fill_dentries+448]
           SP: ffffffc001e67b80  PSTATE: 80400145
          X29: ffffffc001e67b80  X28: 0000000000000000  X27: 000000000000001a
          X26: 00000000000093d7  X25: ffffffc070d52480  X24: 0000000000000008
          X23: 0000000000000028  X22: 00000000d43dfd60  X21: ffffffc001e67e90
          X20: 0000000000000011  X19: ffffff80093a4000  X18: 0000000000000000
          X17: 0000000000000000  X16: 0000000000000000  X15: 0000000000000000
          X14: ffffffffffffffff  X13: 0000000000000008  X12: 0101010101010101
          X11: 7f7f7f7f7f7f7f7f  X10: 6a6a6a6a6a6a6a6a   X9: 7f7f7f7f7f7f7f7f
           X8: 0000000080808000   X7: ffffff800827409c   X6: 0000000080808000
           X5: 0000000000000008   X4: 00000000000093d7   X3: 000000000000001a
           X2: 0000000000000011   X1: ffffffc070d52480   X0: 0000000000800238
      >> #9 [ffffffc001e67be0] f2fs_fill_dentries at ffffff80083584d0
           PC: 0000003c  LR: 00000000  SP: 00000000  PSTATE: 000000d9
          X12: f48a02ff X11: d4678960 X10: d43dfc00  X9: d4678ae4
           X8: 00000058  X7: d4678994  X6: d43de800  X5: 000000d9
           X4: d43dfc0c  X3: d43dfc10  X2: d46799c8  X1: 00000000
           X0: 00001068
      
      Below potential deadlock will happen between three threads:
      Thread A		Thread B		Thread C
      - f2fs_do_sync_file
       - f2fs_write_checkpoint
        - down_write(&sbi->node_change) -- 1)
      			- do_page_fault
      			 - down_write(&mm->mmap_sem) -- 2)
      			  - do_wp_page
      			   - f2fs_vm_page_mkwrite
      						- getdents64
      						 - f2fs_read_inline_dir
      						  - lock_page -- 3)
        - f2fs_sync_node_pages
         - lock_page -- 3)
      			    - __do_map_lock
      			     - down_read(&sbi->node_change) -- 1)
      						  - f2fs_fill_dentries
      						   - dir_emit
      						    - compat_filldir64
      						     - do_page_fault
      						      - down_read(&mm->mmap_sem) -- 2)
      
      Since f2fs_readdir is protected by inode.i_rwsem, there should not be
      any updates in inode page, we're safe to lookup dents in inode page
      without its lock held, so taking off the lock to improve concurrency
      of readdir and avoid potential deadlock.
      Reported-by: 's avatarJiqun Li <jiqun.li@unisoc.com>
      Signed-off-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      a98984da
    • Chao Yu's avatar
      f2fs: fix to adapt small inline xattr space in __find_inline_xattr() · 8d661a66
      Chao Yu authored
      [ Upstream commit 2c28aba8 ]
      
      With below testcase, we will fail to find existed xattr entry:
      
      1. mkfs.f2fs -O extra_attr -O flexible_inline_xattr /dev/zram0
      2. mount -t f2fs -o inline_xattr_size=1 /dev/zram0 /mnt/f2fs/
      3. touch /mnt/f2fs/file
      4. setfattr -n "user.name" -v 0 /mnt/f2fs/file
      5. getfattr -n "user.name" /mnt/f2fs/file
      
      /mnt/f2fs/file: user.name: No such attribute
      
      The reason is for inode which has very small inline xattr size,
      __find_inline_xattr() will fail to traverse any entry due to first
      entry may not be loaded from xattr node yet, later, we may skip to
      check entire xattr datas in __find_xattr(), result in such wrong
      condition.
      
      This patch adds condition to check such case to avoid this issue.
      Signed-off-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      8d661a66
    • aaptel's avatar
      CIFS: fix POSIX lock leak and invalid ptr deref · fae38f28
      aaptel authored
      [ Upstream commit bc31d0cd ]
      
      We have a customer reporting crashes in lock_get_status() with many
      "Leaked POSIX lock" messages preceeding the crash.
      
       Leaked POSIX lock on dev=0x0:0x56 ...
       Leaked POSIX lock on dev=0x0:0x56 ...
       Leaked POSIX lock on dev=0x0:0x56 ...
       Leaked POSIX lock on dev=0x0:0x53 ...
       Leaked POSIX lock on dev=0x0:0x53 ...
       Leaked POSIX lock on dev=0x0:0x53 ...
       Leaked POSIX lock on dev=0x0:0x53 ...
       POSIX: fl_owner=ffff8900e7b79380 fl_flags=0x1 fl_type=0x1 fl_pid=20709
       Leaked POSIX lock on dev=0x0:0x4b ino...
       Leaked locks on dev=0x0:0x4b ino=0xf911400000029:
       POSIX: fl_owner=ffff89f41c870e00 fl_flags=0x1 fl_type=0x1 fl_pid=19592
       stack segment: 0000 [#1] SMP
       Modules linked in: binfmt_misc msr tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 arc4 ecb auth_rpcgss nfsv4 md4 nfs nls_utf8 lockd grace cifs sunrpc ccm dns_resolver fscache af_packet iscsi_ibft iscsi_boot_sysfs vmw_vsock_vmci_transport vsock xfs libcrc32c sb_edac edac_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drbg ansi_cprng vmw_balloon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr vmxnet3 i2c_piix4 vmw_vmci shpchp fjes processor button ac btrfs xor raid6_pq sr_mod cdrom ata_generic sd_mod ata_piix vmwgfx crc32c_intel drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm serio_raw ahci libahci drm libata vmw_pvscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
      
       Supported: Yes
       CPU: 6 PID: 28250 Comm: lsof Not tainted 4.4.156-94.64-default #1
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
       task: ffff88a345f28740 ti: ffff88c74005c000 task.ti: ffff88c74005c000
       RIP: 0010:[<ffffffff8125dcab>]  [<ffffffff8125dcab>] lock_get_status+0x9b/0x3b0
       RSP: 0018:ffff88c74005fd90  EFLAGS: 00010202
       RAX: ffff89bde83e20ae RBX: ffff89e870003d18 RCX: 0000000049534f50
       RDX: ffffffff81a3541f RSI: ffffffff81a3544e RDI: ffff89bde83e20ae
       RBP: 0026252423222120 R08: 0000000020584953 R09: 000000000000ffff
       R10: 0000000000000000 R11: ffff88c74005fc70 R12: ffff89e5ca7b1340
       R13: 00000000000050e5 R14: ffff89e870003d30 R15: ffff89e5ca7b1340
       FS:  00007fafd64be800(0000) GS:ffff89f41fd00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000001c80018 CR3: 000000a522048000 CR4: 0000000000360670
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Stack:
        0000000000000208 ffffffff81a3d6b6 ffff89e870003d30 ffff89e870003d18
        ffff89e5ca7b1340 ffff89f41738d7c0 ffff89e870003d30 ffff89e5ca7b1340
        ffffffff8125e08f 0000000000000000 ffff89bc22b67d00 ffff88c74005ff28
       Call Trace:
        [<ffffffff8125e08f>] locks_show+0x2f/0x70
        [<ffffffff81230ad1>] seq_read+0x251/0x3a0
        [<ffffffff81275bbc>] proc_reg_read+0x3c/0x70
        [<ffffffff8120e456>] __vfs_read+0x26/0x140
        [<ffffffff8120e9da>] vfs_read+0x7a/0x120
        [<ffffffff8120faf2>] SyS_read+0x42/0xa0
        [<ffffffff8161cbc3>] entry_SYSCALL_64_fastpath+0x1e/0xb7
      
      When Linux closes a FD (close(), close-on-exec, dup2(), ...) it calls
      filp_close() which also removes all posix locks.
      
      The lock struct is initialized like so in filp_close() and passed
      down to cifs
      
      	...
              lock.fl_type = F_UNLCK;
              lock.fl_flags = FL_POSIX | FL_CLOSE;
              lock.fl_start = 0;
              lock.fl_end = OFFSET_MAX;
      	...
      
      Note the FL_CLOSE flag, which hints the VFS code that this unlocking
      is done for closing the fd.
      
      filp_close()
        locks_remove_posix(filp, id);
          vfs_lock_file(filp, F_SETLK, &lock, NULL);
            return filp->f_op->lock(filp, cmd, fl) => cifs_lock()
              rc = cifs_setlk(file, flock, type, wait_flag, posix_lck, lock, unlock, xid);
                rc = server->ops->mand_unlock_range(cfile, flock, xid);
                if (flock->fl_flags & FL_POSIX && !rc)
                        rc = locks_lock_file_wait(file, flock)
      
      Notice how we don't call locks_lock_file_wait() which does the
      generic VFS lock/unlock/wait work on the inode if rc != 0.
      
      If we are closing the handle, the SMB server is supposed to remove any
      locks associated with it. Similarly, cifs.ko frees and wakes up any
      lock and lock waiter when closing the file:
      
      cifs_close()
        cifsFileInfo_put(file->private_data)
      	/*
      	 * Delete any outstanding lock records. We'll lose them when the file
      	 * is closed anyway.
      	 */
      	down_write(&cifsi->lock_sem);
      	list_for_each_entry_safe(li, tmp, &cifs_file->llist->locks, llist) {
      		list_del(&li->llist);
      		cifs_del_lock_waiters(li);
      		kfree(li);
      	}
      	list_del(&cifs_file->llist->llist);
      	kfree(cifs_file->llist);
      	up_write(&cifsi->lock_sem);
      
      So we can safely ignore unlocking failures in cifs_lock() if they
      happen with the FL_CLOSE flag hint set as both the server and the
      client take care of it during the actual closing.
      
      This is not a proper fix for the unlocking failure but it's safe and
      it seems to prevent the lock leakages and crashes the customer
      experiences.
      Signed-off-by: aaptel's avatarAurelien Aptel <aaptel@suse.com>
      Signed-off-by: 's avatarNeilBrown <neil@brown.name>
      Signed-off-by: 's avatarSteve French <stfrench@microsoft.com>
      Acked-by: 's avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      fae38f28
    • zhangyi (F)'s avatar
      ext4: cleanup bh release code in ext4_ind_remove_space() · dc2b4d4a
      zhangyi (F) authored
      commit 5e86bdda upstream.
      
      Currently, we are releasing the indirect buffer where we are done with
      it in ext4_ind_remove_space(), so we can see the brelse() and
      BUFFER_TRACE() everywhere.  It seems fragile and hard to read, and we
      may probably forget to release the buffer some day.  This patch cleans
      up the code by putting of the code which releases the buffers to the
      end of the function.
      Signed-off-by: 's avatarzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: Theodore Ts'o's avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: 's avatarJan Kara <jack@suse.cz>
      Cc: Jari Ruusu <jari.ruusu@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc2b4d4a
  2. 03 Apr, 2019 14 commits
    • YueHaibing's avatar
      fs/proc/proc_sysctl.c: fix NULL pointer dereference in put_links · 79d8bdf3
      YueHaibing authored
      commit 23da9588 upstream.
      
      Syzkaller reports:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN PTI
      CPU: 1 PID: 5373 Comm: syz-executor.0 Not tainted 5.0.0-rc8+ #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:put_links+0x101/0x440 fs/proc/proc_sysctl.c:1599
      Code: 00 0f 85 3a 03 00 00 48 8b 43 38 48 89 44 24 20 48 83 c0 38 48 89 c2 48 89 44 24 28 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 fe 02 00 00 48 8b 74 24 20 48 c7 c7 60 2a 9d 91
      RSP: 0018:ffff8881d828f238 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: ffff8881e01b1140 RCX: ffffffff8ee98267
      RDX: 0000000000000007 RSI: ffffc90001479000 RDI: ffff8881e01b1178
      RBP: dffffc0000000000 R08: ffffed103ee27259 R09: ffffed103ee27259
      R10: 0000000000000001 R11: ffffed103ee27258 R12: fffffffffffffff4
      R13: 0000000000000006 R14: ffff8881f59838c0 R15: dffffc0000000000
      FS:  00007f072254f700(0000) GS:ffff8881f7100000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fff8b286668 CR3: 00000001f0542002 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       drop_sysctl_table+0x152/0x9f0 fs/proc/proc_sysctl.c:1629
       get_subdir fs/proc/proc_sysctl.c:1022 [inline]
       __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
       br_netfilter_init+0xbc/0x1000 [br_netfilter]
       do_one_initcall+0xfa/0x5ca init/main.c:887
       do_init_module+0x204/0x5f6 kernel/module.c:3460
       load_module+0x66b2/0x8570 kernel/module.c:3808
       __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
       do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x462e99
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f072254ec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
      RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
      RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000003
      RBP: 00007f072254ec70 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f072254f6bc
      R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
      Modules linked in: br_netfilter(+) dvb_usb_dibusb_mc_common dib3000mc dibx000_common dvb_usb_dibusb_common dvb_usb_dw2102 dvb_usb classmate_laptop palmas_regulator cn videobuf2_v4l2 v4l2_common snd_soc_bd28623 mptbase snd_usb_usx2y snd_usbmidi_lib snd_rawmidi wmi libnvdimm lockd sunrpc grace rc_kworld_pc150u rc_core rtc_da9063 sha1_ssse3 i2c_cros_ec_tunnel adxl34x_spi adxl34x nfnetlink lib80211 i5500_temp dvb_as102 dvb_core videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops udc_core lnbp22 leds_lp3952 hid_roccat_ryos s1d13xxxfb mtd vport_geneve openvswitch nf_conncount nf_nat_ipv6 nsh geneve udp_tunnel ip6_udp_tunnel snd_soc_mt6351 sis_agp phylink snd_soc_adau1761_spi snd_soc_adau1761 snd_soc_adau17x1 snd_soc_core snd_pcm_dmaengine ac97_bus snd_compress snd_soc_adau_utils snd_soc_sigmadsp_regmap snd_soc_sigmadsp raid_class hid_roccat_konepure hid_roccat_common hid_roccat c2port_duramar2150 core mdio_bcm_unimac iptable_security iptable_raw iptable_mangle
       iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim devlink vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel joydev mousedev ide_pci_generic piix aesni_intel aes_x86_64 ide_core crypto_simd atkbd cryptd glue_helper serio_raw ata_generic pata_acpi i2c_piix4 floppy sch_fq_codel ip_tables x_tables ipv6 [last unloaded: lm73]
      Dumping ftrace buffer:
         (ftrace buffer empty)
      ---[ end trace 770020de38961fd0 ]---
      
      A new dir entry can be created in get_subdir and its 'header->parent' is
      set to NULL.  Only after insert_header success, it will be set to 'dir',
      otherwise 'header->parent' is set to NULL and drop_sysctl_table is called.
      However in err handling path of get_subdir, drop_sysctl_table also be
      called on 'new->header' regardless its value of parent pointer.  Then
      put_links is called, which triggers NULL-ptr deref when access member of
      header->parent.
      
      In fact we have multiple error paths which call drop_sysctl_table() there,
      upon failure on insert_links() we also call drop_sysctl_table().And even
      in the successful case on __register_sysctl_table() we still always call
      drop_sysctl_table().This patch fix it.
      
      Link: http://lkml.kernel.org/r/20190314085527.13244-1-yuehaibing@huawei.com
      Fixes: 0e47c99d ("sysctl: Replace root_list with links between sysctl_table_sets")
      Signed-off-by: 's avatarYueHaibing <yuehaibing@huawei.com>
      Reported-by: 's avatarHulk Robot <hulkci@huawei.com>
      Acked-by: 's avatarLuis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>    [3.4+]
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      79d8bdf3
    • Darrick J. Wong's avatar
      ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock · 310891a8
      Darrick J. Wong authored
      commit e6a9467e upstream.
      
      ocfs2_reflink_inodes_lock() can swap the inode1/inode2 variables so that
      we always grab cluster locks in order of increasing inode number.
      
      Unfortunately, we forget to swap the inode record buffer head pointers
      when we've done this, which leads to incorrect bookkeepping when we're
      trying to make the two inodes have the same refcount tree.
      
      This has the effect of causing filesystem shutdowns if you're trying to
      reflink data from inode 100 into inode 97, where inode 100 already has a
      refcount tree attached and inode 97 doesn't.  The reflink code decides
      to copy the refcount tree pointer from 100 to 97, but uses inode 97's
      inode record to open the tree root (which it doesn't have) and blows up.
      This issue causes filesystem shutdowns and metadata corruption!
      
      Link: http://lkml.kernel.org/r/20190312214910.GK20533@magnolia
      Fixes: 29ac8e85 ("ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features")
      Signed-off-by: 's avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: 's avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      310891a8
    • Tetsuo Handa's avatar
      fs/open.c: allow opening only regular files during execve() · f2391e67
      Tetsuo Handa authored
      commit 73601ea5 upstream.
      
      syzbot is hitting lockdep warning [1] due to trying to open a fifo
      during an execve() operation.  But we don't need to open non regular
      files during an execve() operation, for all files which we will need are
      the executable file itself and the interpreter programs like /bin/sh and
      ld-linux.so.2 .
      
      Since the manpage for execve(2) says that execve() returns EACCES when
      the file or a script interpreter is not a regular file, and the manpage
      for uselib(2) says that uselib() can return EACCES, and we use
      FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
      regular file is requested with FMODE_EXEC set.
      
      Since this deadlock followed by khungtaskd warnings is trivially
      reproducible by a local unprivileged user, and syzbot's frequent crash
      due to this deadlock defers finding other bugs, let's workaround this
      deadlock until we get a chance to find a better solution.
      
      [1] https://syzkaller.appspot.com/bug?id=b5095bfec44ec84213bac54742a82483aad578ce
      
      Link: http://lkml.kernel.org/r/1552044017-7890-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpReported-by: 's avatarsyzbot <syzbot+e93a80c1bb7c5c56e522461c149f8bf55eab1b2b@syzkaller.appspotmail.com>
      Fixes: 8924feff ("splice: lift pipe_lock out of splice_to_pipe()")
      Signed-off-by: 's avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: 's avatarKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2391e67
    • Olga Kornievskaia's avatar
      NFSv4.1 don't free interrupted slot on open · 01105243
      Olga Kornievskaia authored
      commit 0cb98abb upstream.
      
      Allow the async rpc task for finish and update the open state if needed,
      then free the slot. Otherwise, the async rpc unable to decode the reply.
      Signed-off-by: 's avatarOlga Kornievskaia <kolga@netapp.com>
      Fixes: ae55e59d ("pnfs: Don't release the sequence slot...")
      Cc: stable@vger.kernel.org # v4.18+
      Signed-off-by: 's avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      01105243
    • NeilBrown's avatar
      NFS: fix mount/umount race in nlmclnt. · e37c15d7
      NeilBrown authored
      commit 4a9be28c upstream.
      
      If the last NFSv3 unmount from a given host races with a mount from the
      same host, we can destroy an nlm_host that is still in use.
      
      Specifically nlmclnt_lookup_host() can increment h_count on
      an nlm_host that nlmclnt_release_host() has just successfully called
      refcount_dec_and_test() on.
      Once nlmclnt_lookup_host() drops the mutex, nlm_destroy_host_lock()
      will be called to destroy the nlmclnt which is now in use again.
      
      The cause of the problem is that the dec_and_test happens outside the
      locked region.  This is easily fixed by using
      refcount_dec_and_mutex_lock().
      
      Fixes: 8ea6ecc8 ("lockd: Create client-side nlm_host cache")
      Cc: stable@vger.kernel.org (v2.6.38+)
      Signed-off-by: 's avatarNeilBrown <neilb@suse.com>
      Signed-off-by: 's avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e37c15d7
    • Catalin Marinas's avatar
      NFS: Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data() · 7a4cdaf9
      Catalin Marinas authored
      commit 3028efe0 upstream.
      
      Commit 7b587e1a ("NFS: use locks_copy_lock() to copy locks.")
      changed the lock copying from memcpy() to the dedicated
      locks_copy_lock() function. The latter correctly increments the
      nfs4_lock_state.ls_count via nfs4_fl_copy_lock(), however, this refcount
      has already been incremented in the nfs4_alloc_{lock,unlock}data().
      Kmemleak subsequently reports an unreferenced nfs4_lock_state object as
      below (arm64 platform):
      
      unreferenced object 0xffff8000fce0b000 (size 256):
        comm "systemd-sysuser", pid 1608, jiffies 4294892825 (age 32.348s)
        hex dump (first 32 bytes):
          20 57 4c fb 00 80 ff ff 20 57 4c fb 00 80 ff ff   WL..... WL.....
          00 57 4c fb 00 80 ff ff 01 00 00 00 00 00 00 00  .WL.............
        backtrace:
          [<000000000d15010d>] kmem_cache_alloc+0x178/0x208
          [<00000000d7c1d264>] nfs4_set_lock_state+0x124/0x1f0
          [<000000009c867628>] nfs4_proc_lock+0x90/0x478
          [<000000001686bd74>] do_setlk+0x64/0xe8
          [<00000000e01500d4>] nfs_lock+0xe8/0x1f0
          [<000000004f387d8d>] vfs_lock_file+0x18/0x40
          [<00000000656ab79b>] do_lock_file_wait+0x68/0xf8
          [<00000000f17c4a4b>] fcntl_setlk+0x224/0x280
          [<0000000052a242c6>] do_fcntl+0x418/0x730
          [<000000004f47291a>] __arm64_sys_fcntl+0x84/0xd0
          [<00000000d6856e01>] el0_svc_common+0x80/0xf0
          [<000000009c4bd1df>] el0_svc_handler+0x2c/0x80
          [<00000000b1a0d479>] el0_svc+0x8/0xc
          [<0000000056c62a0f>] 0xffffffffffffffff
      
      This patch removes the original refcount_inc(&lsp->ls_count) that was
      paired with the memcpy() lock copying.
      
      Fixes: 7b587e1a ("NFS: use locks_copy_lock() to copy locks.")
      Cc: <stable@vger.kernel.org> # 5.0.x-
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: 's avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: 's avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a4cdaf9
    • Jeff Layton's avatar
      locks: wake any locks blocked on request before deadlock check · 7bcb0024
      Jeff Layton authored
      commit 945ab8f6 upstream.
      
      Andreas reported that he was seeing the tdbtorture test fail in some
      cases with -EDEADLCK when it wasn't before. Some debugging showed that
      deadlock detection was sometimes discovering the caller's lock request
      itself in a dependency chain.
      
      While we remove the request from the blocked_lock_hash prior to
      reattempting to acquire it, any locks that are blocked on that request
      will still be present in the hash and will still have their fl_blocker
      pointer set to the current request.
      
      This causes posix_locks_deadlock to find a deadlock dependency chain
      when it shouldn't, as a lock request cannot block itself.
      
      We are going to end up waking all of those blocked locks anyway when we
      go to reinsert the request back into the blocked_lock_hash, so just do
      it prior to checking for deadlocks. This ensures that any lock blocked
      on the current request will no longer be part of any blocked request
      chain.
      
      URL: https://bugzilla.kernel.org/show_bug.cgi?id=202975
      Fixes: 5946c431 ("fs/locks: allow a lock request to block other requests.")
      Cc: stable@vger.kernel.org
      Reported-by: 's avatarAndreas Schneider <asn@redhat.com>
      Signed-off-by: 's avatarNeil Brown <neilb@suse.com>
      Signed-off-by: 's avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7bcb0024
    • Filipe Manana's avatar
      Btrfs: fix assertion failure on fsync with NO_HOLES enabled · 3ba84d2d
      Filipe Manana authored
      commit 0ccc3876 upstream.
      
      Back in commit a89ca6f2 ("Btrfs: fix fsync after truncate when
      no_holes feature is enabled") I added an assertion that is triggered when
      an inline extent is found to assert that the length of the (uncompressed)
      data the extent represents is the same as the i_size of the inode, since
      that is true most of the time I couldn't find or didn't remembered about
      any exception at that time. Later on the assertion was expanded twice to
      deal with a case of a compressed inline extent representing a range that
      matches the sector size followed by an expanding truncate, and another
      case where fallocate can update the i_size of the inode without adding
      or updating existing extents (if the fallocate range falls entirely within
      the first block of the file). These two expansion/fixes of the assertion
      were done by commit 7ed586d0 ("Btrfs: fix assertion on fsync of
      regular file when using no-holes feature") and commit 6399fb5a
      ("Btrfs: fix assertion failure during fsync in no-holes mode").
      These however missed the case where an falloc expands the i_size of an
      inode to exactly the sector size and inline extent exists, for example:
      
       $ mkfs.btrfs -f -O no-holes /dev/sdc
       $ mount /dev/sdc /mnt
      
       $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
       wrote 1096/1096 bytes at offset 0
       1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
      
       $ xfs_io -c "falloc 1096 3000" /mnt/foobar
       $ xfs_io -c "fsync" /mnt/foobar
       Segmentation fault
      
       $ dmesg
       [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
       [701253.602962] ------------[ cut here ]------------
       [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
       [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G        W         5.0.0-rc8-btrfs-next-45 #1
       [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
       (...)
       [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
       [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
       [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
       [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
       [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
       [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
       [701253.607769] FS:  00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
       [701253.608163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
       [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [701253.609608] Call Trace:
       [701253.609994]  btrfs_log_inode+0xdfb/0xe40 [btrfs]
       [701253.610383]  btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
       [701253.610770]  ? do_raw_spin_unlock+0x49/0xc0
       [701253.611150]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [701253.611537]  btrfs_sync_file+0x3b2/0x440 [btrfs]
       [701253.612010]  ? do_sysinfo+0xb0/0xf0
       [701253.612552]  do_fsync+0x38/0x60
       [701253.612988]  __x64_sys_fsync+0x10/0x20
       [701253.613360]  do_syscall_64+0x60/0x1b0
       [701253.613733]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [701253.614103] RIP: 0033:0x7f14da4e66d0
       (...)
       [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
       [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
       [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
       [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
       [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
       (...)
       [701253.619941] ---[ end trace e088d74f132b6da5 ]---
      
      Updating the assertion again to allow for this particular case would result
      in a meaningless assertion, plus there is currently no risk of logging
      content that would result in any corruption after a log replay if the size
      of the data encoded in an inline extent is greater than the inode's i_size
      (which is not currently possibe either with or without compression),
      therefore just remove the assertion.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ba84d2d
    • Nikolay Borisov's avatar
      btrfs: Avoid possible qgroup_rsv_size overflow in btrfs_calculate_inode_block_rsv_size · 84104398
      Nikolay Borisov authored
      commit 139a5617 upstream.
      
      qgroup_rsv_size is calculated as the product of
      outstanding_extent * fs_info->nodesize. The product is calculated with
      32 bit precision since both variables are defined as u32. Yet
      qgroup_rsv_size expects a 64 bit result.
      
      Avoid possible multiplication overflow by casting outstanding_extent to
      u64. Such overflow would in the worst case (64K nodesize) require more
      than 65536 extents, which is quite large and i'ts not likely that it
      would happen in practice.
      
      Fixes-coverity-id: 1435101
      Fixes: ff6bc37e ("btrfs: qgroup: Use independent and accurate per inode qgroup rsv")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: 's avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: 's avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      84104398
    • Nikolay Borisov's avatar
      btrfs: Fix bound checking in qgroup_trace_new_subtree_blocks · e3a60563
      Nikolay Borisov authored
      commit 7ff2c2a1 upstream.
      
      If 'cur_level' is 7  then the bound checking at the top of the function
      will actually pass. Later on, it's possible to dereference
      ds_path->nodes[cur_level+1] which will be an out of bounds.
      
      The correct check will be cur_level >= BTRFS_MAX_LEVEL - 1 .
      
      Fixes-coverty-id: 1440918
      Fixes-coverty-id: 1440911
      Fixes: ea49f3e7 ("btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree")
      CC: stable@vger.kernel.org # 4.20+
      Reviewed-by: 's avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: 's avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3a60563
    • Andrea Righi's avatar
      btrfs: raid56: properly unmap parity page in finish_parity_scrub() · 4a0584a2
      Andrea Righi authored
      commit 3897b6f0 upstream.
      
      Parity page is incorrectly unmapped in finish_parity_scrub(), triggering
      a reference counter bug on i386, i.e.:
      
       [ 157.662401] kernel BUG at mm/highmem.c:349!
       [ 157.666725] invalid opcode: 0000 [#1] SMP PTI
      
      The reason is that kunmap(p_page) was completely left out, so we never
      did an unmap for the p_page and the loop unmapping the rbio page was
      iterating over the wrong number of stripes: unmapping should be done
      with nr_data instead of rbio->real_stripes.
      
      Test case to reproduce the bug:
      
       - create a raid5 btrfs filesystem:
         # mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde
      
       - mount it:
         # mount /dev/sdb /mnt
      
       - run btrfs scrub in a loop:
         # while :; do btrfs scrub start -BR /mnt; done
      
      BugLink: https://bugs.launchpad.net/bugs/1812845
      Fixes: 5a6ac9ea ("Btrfs, raid56: support parity scrub on raid56")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: 's avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: 's avatarAndrea Righi <andrea.righi@canonical.com>
      Reviewed-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a0584a2
    • David Sterba's avatar
      btrfs: don't report readahead errors and don't update statistics · da2dea63
      David Sterba authored
      commit 0cc068e6 upstream.
      
      As readahead is an optimization, all errors are usually filtered out,
      but still properly handled when the real read call is done. The commit
      5e9d3982 ("btrfs: readpages() should submit IO as read-ahead") added
      REQ_RAHEAD to readpages() because that's only used for readahead
      (despite what one would expect from the callback name).
      
      This causes a flood of messages and inflated read error stats, so skip
      reporting in case it's readahead.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403Reported-by: 's avatarLimeTech <tomm@lime-technology.com>
      Fixes: 5e9d3982 ("btrfs: readpages() should submit IO as read-ahead")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da2dea63
    • Josef Bacik's avatar
      btrfs: remove WARN_ON in log_dir_items · 70c88bf9
      Josef Bacik authored
      commit 2cc83342 upstream.
      
      When Filipe added the recursive directory logging stuff in
      2f2ff0ee ("Btrfs: fix metadata inconsistencies after directory
      fsync") he specifically didn't take the directory i_mutex for the
      children directories that we need to log because of lockdep.  This is
      generally fine, but can lead to this WARN_ON() tripping if we happen to
      run delayed deletion's in between our first search and our second search
      of dir_item/dir_indexes for this directory.  We expect this to happen,
      so the WARN_ON() isn't necessary.  Drop the WARN_ON() and add a comment
      so we know why this case can happen.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      70c88bf9
    • Filipe Manana's avatar
      Btrfs: fix incorrect file size after shrinking truncate and fsync · ab0600d4
      Filipe Manana authored
      commit bf504110 upstream.
      
      If we do a shrinking truncate against an inode which is already present
      in the respective log tree and then rename it, as part of logging the new
      name we end up logging an inode item that reflects the old size of the
      file (the one which we previously logged) and not the new smaller size.
      The decision to preserve the size previously logged was added by commit
      1a4bcf47 ("Btrfs: fix fsync data loss after adding hard link to
      inode") in order to avoid data loss after replaying the log. However that
      decision is only needed for the case the logged inode size is smaller then
      the current size of the inode, as explained in that commit's change log.
      If the current size of the inode is smaller then the previously logged
      size, we know a shrinking truncate happened and therefore need to use
      that smaller size.
      
      Example to trigger the problem:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
        $ xfs_io -c "fsync" /mnt/foo
        $ xfs_io -c "truncate 3000" /mnt/foo
      
        $ mv /mnt/foo /mnt/bar
        $ xfs_io -c "fsync" /mnt/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/bar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0008000
      
      Once we rename the file, we log its name (and inode item), and because
      the inode was already logged before in the current transaction, we log it
      with a size of 8000 bytes because that is the size we previously logged
      (with the first fsync). As part of the rename, besides logging the inode,
      we do also sync the log, which is done since commit d4682ba0
      ("Btrfs: sync log after logging new name"), so the next fsync against our
      inode is effectively a no-op, since no new changes happened since the
      rename operation. Even if did not sync the log during the rename
      operation, the same problem (fize size of 8000 bytes instead of 3000
      bytes) would be visible after replaying the log if the log ended up
      getting synced to disk through some other means, such as for example by
      fsyncing some other modified file. In the example above the fsync after
      the rename operation is there just because not every filesystem may
      guarantee logging/journalling the inode (and syncing the log/journal)
      during the rename operation, for example it is needed for f2fs, but not
      for ext4 and xfs.
      
      Fix this scenario by, when logging a new name (which is triggered by
      rename and link operations), using the current size of the inode instead
      of the previously logged inode size.
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: 's avatarSeulbae Kim <seulbae@gatech.edu>
      Signed-off-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab0600d4
  3. 27 Mar, 2019 4 commits
    • Linus Torvalds's avatar
      aio: simplify - and fix - fget/fput for io_submit() · a179695e
      Linus Torvalds authored
      commit 84c4e1f8 upstream.
      
      Al Viro root-caused a race where the IOCB_CMD_POLL handling of
      fget/fput() could cause us to access the file pointer after it had
      already been freed:
      
       "In more details - normally IOCB_CMD_POLL handling looks so:
      
         1) io_submit(2) allocates aio_kiocb instance and passes it to
            aio_poll()
      
         2) aio_poll() resolves the descriptor to struct file by req->file =
            fget(iocb->aio_fildes)
      
         3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
            aio_kiocb to 2 (bumps by 1, that is).
      
         4) aio_poll() calls vfs_poll(). After sanity checks (basically,
            "poll_wait() had been called and only once") it locks the queue.
            That's what the extra reference to iocb had been for - we know we
            can safely access it.
      
         5) With queue locked, we check if ->woken has already been set to
            true (by aio_poll_wake()) and, if it had been, we unlock the
            queue, drop a reference to aio_kiocb and bugger off - at that
            point it's a responsibility to aio_poll_wake() and the stuff
            called/scheduled by it. That code will drop the reference to file
            in req->file, along with the other reference to our aio_kiocb.
      
         6) otherwise, we see whether we need to wait. If we do, we unlock the
            queue, drop one reference to aio_kiocb and go away - eventual
            wakeup (or cancel) will deal with the reference to file and with
            the other reference to aio_kiocb
      
         7) otherwise we remove ourselves from waitqueue (still under the
            queue lock), so that wakeup won't get us. No async activity will
            be happening, so we can safely drop req->file and iocb ourselves.
      
        If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
        won't get freed under us, so we can do all the checks and locking
        safely. And we don't touch ->file if we detect that case.
      
        However, vfs_poll() most certainly *does* touch the file it had been
        given. So wakeup coming while we are still in ->poll() might end up
        doing fput() on that file. That case is not too rare, and usually we
        are saved by the still present reference from descriptor table - that
        fput() is not the final one.
      
        But if another thread closes that descriptor right after our fget()
        and wakeup does happen before ->poll() returns, we are in trouble -
        final fput() done while we are in the middle of a method:
      
      Al also wrote a patch to take an extra reference to the file descriptor
      to fix this, but I instead suggested we just streamline the whole file
      pointer handling by submit_io() so that the generic aio submission code
      simply keeps the file pointer around until the aio has completed.
      
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Acked-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a179695e
    • Chao Yu's avatar
      f2fs: fix to avoid deadlock of atomic file operations · 1c0fc5e9
      Chao Yu authored
      commit 48432984 upstream.
      
      Thread A				Thread B
      - __fput
       - f2fs_release_file
        - drop_inmem_pages
         - mutex_lock(&fi->inmem_lock)
         - __revoke_inmem_pages
          - lock_page(page)
      					- open
      					- f2fs_setattr
      					- truncate_setsize
      					 - truncate_inode_pages_range
      					  - lock_page(page)
      					  - truncate_cleanup_page
      					   - f2fs_invalidate_page
      					    - drop_inmem_page
      					    - mutex_lock(&fi->inmem_lock);
      
      We may encounter above ABBA deadlock as reported by Kyungtae Kim:
      
      I'm reporting a bug in linux-4.17.19: "INFO: task hung in
      drop_inmem_page" (no reproducer)
      
      I think this might be somehow related to the following:
      https://groups.google.com/forum/#!searchin/syzkaller-bugs/INFO$3A$20task$20hung$20in$20%7Csort:date/syzkaller-bugs/c6soBTrdaIo/AjAzPeIzCgAJ
      
      =========================================
      INFO: task syz-executor7:10822 blocked for more than 120 seconds.
            Not tainted 4.17.19 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor7   D27024 10822   6346 0x00000004
      Call Trace:
       context_switch kernel/sched/core.c:2867 [inline]
       __schedule+0x721/0x1e60 kernel/sched/core.c:3515
       schedule+0x88/0x1c0 kernel/sched/core.c:3559
       schedule_preempt_disabled+0x18/0x30 kernel/sched/core.c:3617
       __mutex_lock_common kernel/locking/mutex.c:833 [inline]
       __mutex_lock+0x5bd/0x1410 kernel/locking/mutex.c:893
       mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:908
       drop_inmem_page+0xcb/0x810 fs/f2fs/segment.c:327
       f2fs_invalidate_page+0x337/0x5e0 fs/f2fs/data.c:2401
       do_invalidatepage mm/truncate.c:165 [inline]
       truncate_cleanup_page+0x261/0x330 mm/truncate.c:187
       truncate_inode_pages_range+0x552/0x1610 mm/truncate.c:367
       truncate_inode_pages mm/truncate.c:478 [inline]
       truncate_pagecache+0x6d/0x90 mm/truncate.c:801
       truncate_setsize+0x81/0xa0 mm/truncate.c:826
       f2fs_setattr+0x44f/0x1270 fs/f2fs/file.c:781
       notify_change+0xa62/0xe80 fs/attr.c:313
       do_truncate+0x12e/0x1e0 fs/open.c:63
       do_last fs/namei.c:2955 [inline]
       path_openat+0x2042/0x29f0 fs/namei.c:3505
       do_filp_open+0x1bd/0x2c0 fs/namei.c:3540
       do_sys_open+0x35e/0x4e0 fs/open.c:1101
       __do_sys_open fs/open.c:1119 [inline]
       __se_sys_open fs/open.c:1114 [inline]
       __x64_sys_open+0x89/0xc0 fs/open.c:1114
       do_syscall_64+0xc4/0x4e0 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4497b9
      RSP: 002b:00007f734e459c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
      RAX: ffffffffffffffda RBX: 00007f734e45a6cc RCX: 00000000004497b9
      RDX: 0000000000000104 RSI: 00000000000a8280 RDI: 0000000020000080
      RBP: 000000000071bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 0000000000007230 R14: 00000000006f02d0 R15: 00007f734e45a700
      INFO: task syz-executor7:10858 blocked for more than 120 seconds.
            Not tainted 4.17.19 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor7   D28880 10858   6346 0x00000004
      Call Trace:
       context_switch kernel/sched/core.c:2867 [inline]
       __schedule+0x721/0x1e60 kernel/sched/core.c:3515
       schedule+0x88/0x1c0 kernel/sched/core.c:3559
       __rwsem_down_write_failed_common kernel/locking/rwsem-xadd.c:565 [inline]
       rwsem_down_write_failed+0x5e6/0xc90 kernel/locking/rwsem-xadd.c:594
       call_rwsem_down_write_failed+0x17/0x30 arch/x86/lib/rwsem.S:117
       __down_write arch/x86/include/asm/rwsem.h:142 [inline]
       down_write+0x58/0xa0 kernel/locking/rwsem.c:72
       inode_lock include/linux/fs.h:713 [inline]
       do_truncate+0x120/0x1e0 fs/open.c:61
       do_last fs/namei.c:2955 [inline]
       path_openat+0x2042/0x29f0 fs/namei.c:3505
       do_filp_open+0x1bd/0x2c0 fs/namei.c:3540
       do_sys_open+0x35e/0x4e0 fs/open.c:1101
       __do_sys_open fs/open.c:1119 [inline]
       __se_sys_open fs/open.c:1114 [inline]
       __x64_sys_open+0x89/0xc0 fs/open.c:1114
       do_syscall_64+0xc4/0x4e0 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4497b9
      RSP: 002b:00007f734e3b4c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
      RAX: ffffffffffffffda RBX: 00007f734e3b56cc RCX: 00000000004497b9
      RDX: 0000000000000104 RSI: 00000000000a8280 RDI: 0000000020000080
      RBP: 000000000071c238 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 0000000000007230 R14: 00000000006f02d0 R15: 00007f734e3b5700
      INFO: task syz-executor5:10829 blocked for more than 120 seconds.
            Not tainted 4.17.19 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor5   D28760 10829   6308 0x80000002
      Call Trace:
       context_switch kernel/sched/core.c:2867 [inline]
       __schedule+0x721/0x1e60 kernel/sched/core.c:3515
       schedule+0x88/0x1c0 kernel/sched/core.c:3559
       io_schedule+0x21/0x80 kernel/sched/core.c:5179
       wait_on_page_bit_common mm/filemap.c:1100 [inline]
       __lock_page+0x2b5/0x390 mm/filemap.c:1273
       lock_page include/linux/pagemap.h:483 [inline]
       __revoke_inmem_pages+0xb35/0x11c0 fs/f2fs/segment.c:231
       drop_inmem_pages+0xa3/0x3e0 fs/f2fs/segment.c:306
       f2fs_release_file+0x2c7/0x330 fs/f2fs/file.c:1556
       __fput+0x2c7/0x780 fs/file_table.c:209
       ____fput+0x1a/0x20 fs/file_table.c:243
       task_work_run+0x151/0x1d0 kernel/task_work.c:113
       exit_task_work include/linux/task_work.h:22 [inline]
       do_exit+0x8ba/0x30a0 kernel/exit.c:865
       do_group_exit+0x13b/0x3a0 kernel/exit.c:968
       get_signal+0x6bb/0x1650 kernel/signal.c:2482
       do_signal+0x84/0x1b70 arch/x86/kernel/signal.c:810
       exit_to_usermode_loop+0x155/0x190 arch/x86/entry/common.c:162
       prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
       syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
       do_syscall_64+0x445/0x4e0 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4497b9
      RSP: 002b:00007f1c68e74ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
      RAX: fffffffffffffe00 RBX: 000000000071bf80 RCX: 00000000004497b9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000071bf80
      RBP: 000000000071bf80 R08: 0000000000000000 R09: 000000000071bf58
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 00007f1c68e759c0 R15: 00007f1c68e75700
      
      This patch tries to use trylock_page to mitigate such deadlock condition
      for fix.
      Signed-off-by: 's avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: 's avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c0fc5e9
    • zhangyi (F)'s avatar
      ext4: brelse all indirect buffer in ext4_ind_remove_space() · c29313c0
      zhangyi (F) authored
      commit 674a2b27 upstream.
      
      All indirect buffers get by ext4_find_shared() should be released no
      mater the branch should be freed or not. But now, we forget to release
      the lower depth indirect buffers when removing space from the same
      higher depth indirect block. It will lead to buffer leak and futher
      more, it may lead to quota information corruption when using old quota,
      consider the following case.
      
       - Create and mount an empty ext4 filesystem without extent and quota
         features,
       - quotacheck and enable the user & group quota,
       - Create some files and write some data to them, and then punch hole
         to some files of them, it may trigger the buffer leak problem
         mentioned above.
       - Disable quota and run quotacheck again, it will create two new
         aquota files and write the checked quota information to them, which
         probably may reuse the freed indirect block(the buffer and page
         cache was not freed) as data block.
       - Enable quota again, it will invoke
         vfs_load_quota_inode()->invalidate_bdev() to try to clean unused
         buffers and pagecache. Unfortunately, because of the buffer of quota
         data block is still referenced, quota code cannot read the up to date
         quota info from the device and lead to quota information corruption.
      
      This problem can be reproduced by xfstests generic/231 on ext3 file
      system or ext4 file system without extent and quota features.
      
      This patch fix this problem by releasing the missing indirect buffers,
      in ext4_ind_remove_space().
      Reported-by: 's avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: 's avatarzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: Theodore Ts'o's avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: 's avatarJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c29313c0
    • Lukas Czerner's avatar
      ext4: fix data corruption caused by unaligned direct AIO · f1902fd0
      Lukas Czerner authored
      commit 372a03e0 upstream.
      
      Ext4 needs to serialize unaligned direct AIO because the zeroing of
      partial blocks of two competing unaligned AIOs can result in data
      corruption.
      
      However it decides not to serialize if the potentially unaligned aio is
      past i_size with the rationale that no pending writes are possible past
      i_size. Unfortunately if the i_size is not block aligned and the second
      unaligned write lands past i_size, but still into the same block, it has
      the potential of corrupting the previous unaligned write to the same
      block.
      
      This is (very simplified) reproducer from Frank
      
          // 41472 = (10 * 4096) + 512
          // 37376 = 41472 - 4096
      
          ftruncate(fd, 41472);
          io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376);
          io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472);
      
          io_submit(io_ctx, 1, &iocbs[1]);
          io_submit(io_ctx, 1, &iocbs[2]);
      
          io_getevents(io_ctx, 2, 2, events, NULL);
      
      Without this patch the 512B range from 40960 up to the start of the
      second unaligned write (41472) is going to be zeroed overwriting the data
      written by the first write. This is a data corruption.
      
      00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
      *
      0000a000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31
      
      With this patch the data corruption is avoided because we will recognize
      the unaligned_aio and wait for the unwritten extent conversion.
      
      00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
      *
      00009200  30 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30
      *
      0000a200  31 31 31 31 31 31 31 31  31 31 31 31 31 31 31 31
      *
      0000b200
      Reported-by: 's avatarFrank Sorenson <fsorenso@redhat.com>
      Signed-off-by: 's avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: Theodore Ts'o's avatarTheodore Ts'o <tytso@mit.edu>
      Fixes: e9e3bcec ("ext4: serialize unaligned asynchronous DIO")
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f1902fd0