1. 12 Feb, 2019 1 commit
    • Parav Pandit's avatar
      RDMA/core: Sync unregistration with netlink commands · f9acb020
      Parav Pandit authored
      [ Upstream commit 01b67117 ]
      
      When the rdma device is getting removed, get resource info can race with
      device removal, as below:
      
            CPU-0                                  CPU-1
          --------                               --------
          rdma_nl_rcv_msg()
             nldev_res_get_cq_dumpit()
                mutex_lock(device_lock);
                get device reference
                mutex_unlock(device_lock);        [..]
                                                  ib_unregister_device()
                                                  /* Valid reference to
                                                   * device->dev exists.
                                                   */
                                                   ib_dealloc_device()
      
                [..]
                provider->fill_res_entry();
      
      Even though device object is not freed, fill_res_entry() can get called on
      device which doesn't have a driver anymore. Kernel core device reference
      count is not sufficient, as this only keeps the structure valid, and
      doesn't guarantee the driver is still loaded.
      
      Similar race can occur with device renaming and device removal, where
      device_rename() tries to rename a unregistered device. While this is fine
      for devices of a class which are not net namespace aware, but it is
      incorrect for net namespace aware class coming in subsequent series.  If a
      class is net namespace aware, then the below [1] call trace is observed in
      above situation.
      
      Therefore, to avoid the race, keep a reference count and let device
      unregistration wait until all netlink users drop the reference.
      
      [1] Call trace:
      kernfs: ns required in 'infiniband' for 'mlx5_0'
      WARNING: CPU: 18 PID: 44270 at fs/kernfs/dir.c:842 kernfs_find_ns+0x104/0x120
      libahci i2c_core mlxfw libata dca [last unloaded: devlink]
      RIP: 0010:kernfs_find_ns+0x104/0x120
      Call Trace:
      kernfs_find_and_get_ns+0x2e/0x50
      sysfs_rename_link_ns+0x40/0xb0
      device_rename+0xb2/0xf0
      ib_device_rename+0xb3/0x100 [ib_core]
      nldev_set_doit+0x165/0x190 [ib_core]
      rdma_nl_rcv_msg+0x249/0x250 [ib_core]
      ? netlink_deliver_tap+0x8f/0x3e0
      rdma_nl_rcv+0xd6/0x120 [ib_core]
      netlink_unicast+0x17c/0x230
      netlink_sendmsg+0x2f0/0x3e0
      sock_sendmsg+0x30/0x40
      __sys_sendto+0xdc/0x160
      
      Fixes: da5c8507 ("RDMA/nldev: add driver-specific resource tracking")
      Signed-off-by: 's avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: 's avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: 's avatarSasha Levin <sashal@kernel.org>
      f9acb020
  2. 06 Feb, 2019 2 commits
    • Yishai Hadas's avatar
      IB/uverbs: Fix OOPs in uverbs_user_mmap_disassociate · dfccdac6
      Yishai Hadas authored
      commit 7b21b69a upstream.
      
      The vma->vm_mm can become impossible to get before rdma_umap_close() is
      called, in this case we must not try to get an mm that is already
      undergoing process exit. In this case there is no need to wait for
      anything as the VMA will be destroyed by another thread soon and is
      already effectively 'unreachable' by userspace.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
       PGD 800000012bc50067 P4D 800000012bc50067 PUD 129db5067 PMD 0
       Oops: 0000 [#1] SMP PTI
       CPU: 1 PID: 2050 Comm: bash Tainted: G        W  OE 4.20.0-rc6+ #3
       Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
       RIP: 0010:__rb_erase_color+0xb9/0x280
       Code: 84 17 01 00 00 48 3b 68 10 0f 84 15 01 00 00 48 89
                     58 08 48 89 de 48 89 ef 4c 89 e3 e8 90 84 22 00 e9 60 ff ff ff 48 8b 5d
                     10 <f6> 03 01 0f 84 9c 00 00 00 48 8b 43 10 48 85 c0 74 09 f6 00 01 0f
       RSP: 0018:ffffbecfc090bab8 EFLAGS: 00010246
       RAX: ffff97616346cf30 RBX: 0000000000000000 RCX: 0000000000000101
       RDX: 0000000000000000 RSI: ffff97623b6ca828 RDI: ffff97621ef10828
       RBP: ffff97621ef10828 R08: ffff97621ef10828 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff97623b6ca838
       R13: ffffffffbb3fef50 R14: ffff97623b6ca828 R15: 0000000000000000
       FS:  00007f7a5c31d740(0000) GS:ffff97623bb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 000000011255a000 CR4: 00000000000006e0
       Call Trace:
        unlink_file_vma+0x3b/0x50
        free_pgtables+0xa1/0x110
        exit_mmap+0xca/0x1a0
        ? mlx5_ib_dealloc_pd+0x28/0x30 [mlx5_ib]
        mmput+0x54/0x140
        uverbs_user_mmap_disassociate+0xcc/0x160 [ib_uverbs]
        uverbs_destroy_ufile_hw+0xf7/0x120 [ib_uverbs]
        ib_uverbs_remove_one+0xea/0x240 [ib_uverbs]
        ib_unregister_device+0xfb/0x200 [ib_core]
        mlx5_ib_remove+0x51/0xe0 [mlx5_ib]
        mlx5_remove_device+0xc1/0xd0 [mlx5_core]
        mlx5_unregister_device+0x3d/0xb0 [mlx5_core]
        remove_one+0x2a/0x90 [mlx5_core]
        pci_device_remove+0x3b/0xc0
        device_release_driver_internal+0x16d/0x240
        unbind_store+0xb2/0x100
        kernfs_fop_write+0x102/0x180
        __vfs_write+0x36/0x1a0
        ? __alloc_fd+0xa9/0x170
        ? set_close_on_exec+0x49/0x70
        vfs_write+0xad/0x1a0
        ksys_write+0x52/0xc0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Cc: <stable@vger.kernel.org> # 4.19
      Fixes: 5f9794dc ("RDMA/ucontext: Add a core API for mmaping driver IO memory")
      Signed-off-by: 's avatarYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: 's avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dfccdac6
    • Yishai Hadas's avatar
      IB/uverbs: Fix OOPs upon device disassociation · bc755c6a
      Yishai Hadas authored
      commit 425784aa upstream.
      
      The async_file might be freed before the disassociation has been ended,
      causing qp shutdown to use after free on it.
      
      Since uverbs_destroy_ufile_hw is not a fence, it returns if a
      disassociation is ongoing in another thread. It has to be written this way
      to avoid deadlock. However this means that the ufile FD close cannot
      destroy anything that may still be used by an active kref, such as the the
      async_file.
      
      To fix that move the kref_put() to be in ib_uverbs_release_file().
      
       BUG: unable to handle kernel paging request at ffffffffba682787
       PGD bc80e067 P4D bc80e067 PUD bc80f063 PMD 1313df163 PTE 80000000bc682061
       Oops: 0003 [#1] SMP PTI
       CPU: 1 PID: 32410 Comm: bash Tainted: G           OE 4.20.0-rc6+ #3
       Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
       RIP: 0010:__pv_queued_spin_lock_slowpath+0x1b3/0x2a0
       Code: 98 83 e2 60 49 89 df 48 8b 04 c5 80 18 72 ba 48 8d
      		ba 80 32 02 00 ba 00 80 00 00 4c 8d 65 14 41 bd 01 00 00 00 48 01 c7 85
      		d2 <48> 89 2f 48 89 fb 74 14 8b 45 08 85 c0 75 42 84 d2 74 6b f3 90 83
       RSP: 0018:ffffc1bbc064fb58 EFLAGS: 00010006
       RAX: ffffffffba65f4e7 RBX: ffff9f209c656c00 RCX: 0000000000000001
       RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffffffffba682787
       RBP: ffff9f217bb23280 R08: 0000000000000001 R09: 0000000000000000
       R10: ffff9f209d2c7800 R11: ffffffffffffffe8 R12: ffff9f217bb23294
       R13: 0000000000000001 R14: 0000000000000000 R15: ffff9f209c656c00
       FS:  00007fac55aad740(0000) GS:ffff9f217bb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffffba682787 CR3: 000000012f8e0000 CR4: 00000000000006e0
       Call Trace:
        _raw_spin_lock_irq+0x27/0x30
        ib_uverbs_release_uevent+0x1e/0xa0 [ib_uverbs]
        uverbs_free_qp+0x7e/0x90 [ib_uverbs]
        destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs]
        uverbs_destroy_uobject+0x2e/0x180 [ib_uverbs]
        __uverbs_cleanup_ufile+0x73/0x90 [ib_uverbs]
        uverbs_destroy_ufile_hw+0x5d/0x120 [ib_uverbs]
        ib_uverbs_remove_one+0xea/0x240 [ib_uverbs]
        ib_unregister_device+0xfb/0x200 [ib_core]
        mlx5_ib_remove+0x51/0xe0 [mlx5_ib]
        mlx5_remove_device+0xc1/0xd0 [mlx5_core]
        mlx5_unregister_device+0x3d/0xb0 [mlx5_core]
        remove_one+0x2a/0x90 [mlx5_core]
        pci_device_remove+0x3b/0xc0
        device_release_driver_internal+0x16d/0x240
        unbind_store+0xb2/0x100
        kernfs_fop_write+0x102/0x180
        __vfs_write+0x36/0x1a0
        ? __alloc_fd+0xa9/0x170
        ? set_close_on_exec+0x49/0x70
        vfs_write+0xad/0x1a0
        ksys_write+0x52/0xc0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fac551aac60
      
      Cc: <stable@vger.kernel.org> # 4.2
      Fixes: 036b1063 ("IB/uverbs: Enable device removal when there are active user space applications")
      Signed-off-by: 's avatarYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: 's avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc755c6a
  3. 22 Jan, 2019 1 commit
  4. 13 Jan, 2019 1 commit
    • Steve Wise's avatar
      RDMA/iwcm: Don't copy past the end of dev_name() string · de3b4f54
      Steve Wise authored
      commit d53ec8af upstream.
      
      We now use dev_name(&ib_device->dev) instead of ib_device->name in iwpm
      messages.  The name field in struct device is a const char *, where as
      ib_device->name is a char array of size IB_DEVICE_NAME_MAX, and it is
      pre-initialized to zeros.
      
      Since iw_cm_map() was using memcpy() to copy in the device name, and
      copying IWPM_DEVNAME_SIZE bytes, it ends up copying past the end of the
      source device name string and copying random bytes.  This results in iwpmd
      failing the REGISTER_PID request from iwcm.  Thus port mapping is broken.
      
      Validate the device and if names, and use strncpy() to inialize the entire
      message field.
      
      Fixes: 896de009 ("RDMA/core: Use dev_name instead of ibdev->name")
      Cc: stable@vger.kernel.org
      Signed-off-by: 's avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: 's avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de3b4f54
  5. 12 Dec, 2018 1 commit
  6. 26 Nov, 2018 1 commit
  7. 21 Nov, 2018 1 commit
    • Parav Pandit's avatar
      RDMA/core: Add GIDs while changing MAC addr only for registered ndev · d52ef88a
      Parav Pandit authored
      Currently when MAC address is changed, regardless of the netdev reg_state,
      GID entries are removed and added to reflect the new MAC address and new
      default GID entries.
      
      When a bonding device is used and the underlying PCI device is removed
      several netdevice events are generated. Two events of the interest are
      CHANGEADDR and UNREGISTER event on lower(slave) netdevice of the bond
      netdevice.
      
      Sometimes CHANGEADDR event is generated when netdev state is
      UNREGISTERING (after UNREGISTER event is generated). In this scenario, GID
      entries for default GIDs are added and never deleted because GID entries
      are deleted only when netdev state is < UNREGISTERED.
      
      This leads to non zero reference count on the netdevice. Due to this, PCI
      device unbind operation is getting stuck.
      
      To avoid it, when changing mac address, add GID entries only if netdev is
      in REGISTERED state.
      
      Fixes: 03db3a2d ("IB/core: Add RoCE GID table management")
      Signed-off-by: 's avatarParav Pandit <parav@mellanox.com>
      Reviewed-by: 's avatarMark Bloch <markb@mellanox.com>
      Signed-off-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: 's avatarJason Gunthorpe <jgg@mellanox.com>
      d52ef88a
  8. 17 Oct, 2018 4 commits
  9. 16 Oct, 2018 14 commits
  10. 11 Oct, 2018 2 commits
  11. 05 Oct, 2018 3 commits
  12. 04 Oct, 2018 1 commit
  13. 03 Oct, 2018 3 commits
  14. 01 Oct, 2018 3 commits
  15. 27 Sep, 2018 1 commit
    • Parav Pandit's avatar
      RDMA/core: Acquire and release mmap_sem on page range · 3994586f
      Parav Pandit authored
      Currently mmap_sem is read locked while pinning the memory.  In a
      multi-threaded application of a process, holding mmap_sem lock creates
      contention with other threads who might be either registering memory,
      creating QPs or simply doing mmap() as such operations also require to
      hold the mmap_sem write lock.
      
      All such operation cannot make forward progress until one memory pin
      operation is completed.  It becomes more worse if the memory is unpinned
      and/or memory registration is large (in GB range).
      
      Therefore, instead of holding mmap_sem for too long (for whole region
      pinning), acquire and release the lock for every few pages.  For example
      on x86 with 4K page size, acquire and release mmap_sem for every 2Mbytes
      memory chunk.
      
      This allows other competing threads to make progress who might wish to
      hold mmap_sem for shorter duration.
      
      When memory registration latency is measured using [1] for memory sizes
      ranging from 4K to 48GB, <= 1% or 0.5% degradation is noticed. In many
      runs no difference is seen other than run-to-run variance.
      
      In other targeted tests of users with large memory, desired improvements
      are seen due to reduced contention of mmap_sem.
      
      [1] https://github.com/paravmellanox/rtool
      
      $ rdma_resource_lat -c 1 -s 48G -a -u L -i 500 -A
      
      It registers pinned memory from 4K to 48GB size with 500 iterations for
      each memory size.
      
      $ rdma_resource_lat -c 1 -s 12G -a -u L -i 500 -t 4
      
      4 competing threads pin memory, each of 12GB size with 500 iterations.
      Signed-off-by: 's avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: 's avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: 's avatarJason Gunthorpe <jgg@mellanox.com>
      3994586f
  16. 26 Sep, 2018 1 commit