1. 11 Feb, 2020 40 commits
    • Oleksandr Natalenko's avatar
      Merge branch 'futex-5.5' into pf-5.5 · 5b966c77
      Oleksandr Natalenko authored
    • Gabriel Krisman Bertazi's avatar
    • Gabriel Krisman Bertazi's avatar
      futex: Implement mechanism to wait on any of several futexes · b860d513
      Gabriel Krisman Bertazi authored
      This is a new futex operation, called FUTEX_WAIT_MULTIPLE, which allows
      a thread to wait on several futexes at the same time, and be awoken by
      any of them.  In a sense, it implements one of the features that was
      supported by pooling on the old FUTEX_FD interface.
      My use case for this operation lies in Wine, where we want to implement
      a similar interface available in Windows, used mainly for event
      handling.  The wine folks have an implementation that uses eventfd, but
      it suffers from FD exhaustion (I was told they have application that go
      to the order of multi-milion FDs), and higher CPU utilization.
      In time, we are also proposing modifications to glibc and libpthread to
      make this feature available for Linux native multithreaded applications
      using libpthread, which can benefit from the behavior of waiting on any
      of a group of futexes.
      In particular, using futexes in our Wine use case reduced the CPU
      utilization by 4% for the game Beat Saber and by 1.5% for the game
      Shadow of Tomb Raider, both running over Proton (a wine based solution
      for Windows emulation), when compared to the eventfd interface. This
      implementation also doesn't rely of file descriptors, so it doesn't risk
      overflowing the resource.
      Technically, the existing FUTEX_WAIT implementation can be easily
      reworked by using do_futex_wait_multiple with a count of one, and I
      have a patch showing how it works.  I'm not proposing it, since
      futex is such a tricky code, that I'd be more confortable to have
      FUTEX_WAIT_MULTIPLE running upstream for a couple development cycles,
      before considering modifying FUTEX_WAIT.
      From an implementation perspective, the futex list is passed as an array
      of (pointer,value,bitset) to the kernel, which will enqueue all of them
      and sleep if none was already triggered. It returns a hint of which
      futex caused the wake up event to userspace, but the hint doesn't
      guarantee that is the only futex triggered.  Before calling the syscall
      again, userspace should traverse the list, trying to re-acquire any of
      the other futexes, to prevent an immediate -EWOULDBLOCK return code from
      the kernel.
      This was tested using three mechanisms:
      1) By reimplementing FUTEX_WAIT in terms of FUTEX_WAIT_MULTIPLE and
      running the unmodified tools/testing/selftests/futex and a full linux
      distro on top of this kernel.
      2) By an example code that exercises the FUTEX_WAIT_MULTIPLE path on a
      multi-threaded, event-handling setup.
      3) By running the Wine fsync implementation and executing multi-threaded
      applications, in particular the modern games mentioned above, on top of
      this implementation.
      Signed-off-by: default avatarZebediah Figura <[email protected]>
      Signed-off-by: Steven Noonan's avatarSteven Noonan <[email protected]>
      Signed-off-by: default avatarPierre-Loup A. Griffais <[email protected]>
      Signed-off-by: default avatarGabriel Krisman Bertazi <[email protected]>
    • Gabriel Krisman Bertazi's avatar
      futex: Split key setup from key queue locking and read · 5e1061c6
      Gabriel Krisman Bertazi authored
      split the futex key setup from the queue locking and key reading.  This
      is useful to support the setup of multiple keys at the same time, like
      what is done in futex_requeue() and what will be done for the
      FUTEX_WAIT_MULTIPLE command.
      Signed-off-by: default avatarGabriel Krisman Bertazi <[email protected]>
    • Oleksandr Natalenko's avatar
      Merge branch 'fixes-5.5' into pf-5.5 · 1af9dbe9
      Oleksandr Natalenko authored
    • Linus Torvalds's avatar
      pipe: use exclusive waits when reading or writing · 4bc2e12a
      Linus Torvalds authored
      This makes the pipe code use separate wait-queues and exclusive waiting
      for readers and writers, avoiding a nasty thundering herd problem when
      there are lots of readers waiting for data on a pipe (or, less commonly,
      lots of writers waiting for a pipe to have space).
      While this isn't a common occurrence in the traditional "use a pipe as a
      data transport" case, where you typically only have a single reader and
      a single writer process, there is one common special case: using a pipe
      as a source of "locking tokens" rather than for data communication.
      In particular, the GNU make jobserver code ends up using a pipe as a way
      to limit parallelism, where each job consumes a token by reading a byte
      from the jobserver pipe, and releases the token by writing a byte back
      to the pipe.
      This pattern is fairly traditional on Unix, and works very well, but
      will waste a lot of time waking up a lot of processes when only a single
      reader needs to be woken up when a writer releases a new token.
      A simplified test-case of just this pipe interaction is to create 64
      processes, and then pass a single token around between them (this
      test-case also intentionally passes another token that gets ignored to
      test the "wake up next" logic too, in case anybody wonders about it):
          #include <unistd.h>
          int main(int argc, char **argv)
              int fd[2], counters[2];
              counters[0] = 0;
              counters[1] = -1;
              write(fd[1], counters, sizeof(counters));
              /* 64 processes */
              fork(); fork(); fork(); fork(); fork(); fork();
              do {
                      int i;
                      read(fd[0], &i, sizeof(i));
                      if (i < 0)
                      counters[0] = i+1;
                      write(fd[1], counters, (1+(i & 1)) *sizeof(int));
              } while (counters[0] < 1000000);
              return 0;
      and in a perfect world, passing that token around should only cause one
      context switch per transfer, when the writer of a token causes a
      directed wakeup of just a single reader.
      But with the "writer wakes all readers" model we traditionally had, on
      my test box the above case causes more than an order of magnitude more
      scheduling: instead of the expected ~1M context switches, "perf stat"
              231,852.37 msec task-clock                #   15.857 CPUs utilized
              11,250,961      context-switches          #    0.049 M/sec
                 616,304      cpu-migrations            #    0.003 M/sec
                   1,648      page-faults               #    0.007 K/sec
       1,097,903,998,514      cycles                    #    4.735 GHz
         120,781,778,352      instructions              #    0.11  insn per cycle
          27,997,056,043      branches                  #  120.754 M/sec
             283,581,233      branch-misses             #    1.01% of all branches
            14.621273891 seconds time elapsed
             0.018243000 seconds user
             3.611468000 seconds sys
      before this commit.
      After this commit, I get
                5,229.55 msec task-clock                #    3.072 CPUs utilized
               1,212,233      context-switches          #    0.232 M/sec
                 103,951      cpu-migrations            #    0.020 M/sec
                   1,328      page-faults               #    0.254 K/sec
          21,307,456,166      cycles                    #    4.074 GHz
          12,947,819,999      instructions              #    0.61  insn per cycle
           2,881,985,678      branches                  #  551.096 M/sec
              64,267,015      branch-misses             #    2.23% of all branches
             1.702148350 seconds time elapsed
             0.004868000 seconds user
             0.110786000 seconds sys
      instead. Much better.
      [ Note! This kernel improvement seems to be very good at triggering a
        race condition in the make jobserver (in GNU make 4.2.1) for me. It's
        a long known bug that was fixed back in June 2017 by GNU make commit
        b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
        avoid hangs.").
        But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
        so a number of distributions may still have the buggy version. Some
        have backported the fix to their 4.2.1 release, though, and even
        without the fix it's quite timing-dependent whether the bug actually
        is hit. ]
      Josh Triplett says:
       "I've been hammering on your pipe fix patch (switching to exclusive
        wait queues) for a month or so, on several different systems, and I've
        run into no issues with it. The patch *substantially* improves
        parallel build times on large (~100 CPU) systems, both with parallel
        make and with other things that use make's pipe-based jobserver.
        All current distributions (including stable and long-term stable
        distributions) have versions of GNU make that no longer have the
        jobserver bug"
      Tested-by: Josh Triplett's avatarJosh Triplett <[email protected]>
      Signed-off-by: default avatarLinus Torvalds <[email protected]>
    • Oleksandr Natalenko's avatar
    • Oleksandr Natalenko's avatar
      Merge branch 'fixes-5.5' into pf-5.5 · f252b759
      Oleksandr Natalenko authored
    • Chris Wilson's avatar
      drm/i915/execlists: Always force a context reload when rewinding RING_TAIL · 3c7191ea
      Chris Wilson authored
      If we rewind the RING_TAIL on a context, due to a preemption event, we
      must force the context restore for the RING_TAIL update to be properly
      handled. Rather than note which preemption events may cause us to rewind
      the tail, compare the new request's tail with the previously submitted
      RING_TAIL, as it turns out that timeslicing was causing unexpected
         <idle>-0       0d.s2 1280851190us : __execlists_submission_tasklet: 0000:00:02.0 rcs0: expired last=130:4698, prio=3, hint=3
         <idle>-0       0d.s2 1280851192us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 66:119966, current 119964
         <idle>-0       0d.s2 1280851195us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 130:4698, current 4695
         <idle>-0       0d.s2 1280851198us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence 130:4696, current 4695
      ^----  Note we unwind 2 requests from the same context
         <idle>-0       0d.s2 1280851208us : __i915_request_submit: 0000:00:02.0 rcs0: fence 130:4696, current 4695
         <idle>-0       0d.s2 1280851213us : __i915_request_submit: 0000:00:02.0 rcs0: fence 134:1508, current 1506
      ^---- But to apply the new timeslice, we have to replay the first request
            before the new client can start -- the unexpected RING_TAIL rewind
         <idle>-0       0d.s2 1280851219us : trace_ports: 0000:00:02.0 rcs0: submit { 130:4696*, 134:1508 }
       synmark2-5425    2..s. 1280851239us : process_csb: 0000:00:02.0 rcs0: cs-irq head=5, tail=0
       synmark2-5425    2..s. 1280851240us : process_csb: 0000:00:02.0 rcs0: csb[0]: status=0x00008002:0x00000000
      ^---- Preemption event for the ELSP update; note the lite-restore
       synmark2-5425    2..s. 1280851243us : trace_ports: 0000:00:02.0 rcs0: preempted { 130:4698, 66:119966 }
       synmark2-5425    2..s. 1280851246us : trace_ports: 0000:00:02.0 rcs0: promote { 130:4696*, 134:1508 }
       synmark2-5425    2.... 1280851462us : __i915_request_commit: 0000:00:02.0 rcs0: fence 130:4700, current 4695
       synmark2-5425    2.... 1280852111us : __i915_request_commit: 0000:00:02.0 rcs0: fence 130:4702, current 4695
       synmark2-5425    2.Ns1 1280852296us : process_csb: 0000:00:02.0 rcs0: cs-irq head=0, tail=2
       synmark2-5425    2.Ns1 1280852297us : process_csb: 0000:00:02.0 rcs0: csb[1]: status=0x00000814:0x00000000
       synmark2-5425    2.Ns1 1280852299us : trace_ports: 0000:00:02.0 rcs0: completed { 130:4696!, 134:1508 }
       synmark2-5425    2.Ns1 1280852301us : process_csb: 0000:00:02.0 rcs0: csb[2]: status=0x00000818:0x00000040
       synmark2-5425    2.Ns1 1280852302us : trace_ports: 0000:00:02.0 rcs0: completed { 134:1508, 0:0 }
       synmark2-5425    2.Ns1 1280852313us : process_csb: process_csb:2336 GEM_BUG_ON(!i915_request_completed(*execlists->active) && !reset_in_progress(execlists))
      Fixes: 8ee36e04 ("drm/i915/execlists: Minimalistic timeslicing")
      Referenecs: 82c69bf5 ("drm/i915/gt: Detect if we miss WaIdleLiteRestore")
      Signed-off-by: default avatarChris Wilson <[email protected]>
      Cc: Mika Kuoppala <[email protected]>
      Reviewed-by: default avatarMika Kuoppala <[email protected]>
      Cc: <[email protected]> # v5.4+
      Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    • Chris Wilson's avatar
      drm: Remove PageReserved manipulation from drm_pci_alloc · ab44b7a3
      Chris Wilson authored
      drm_pci_alloc/drm_pci_free are very thin wrappers around the core dma
      facilities, and we have no special reason within the drm layer to behave
      differently. In particular, since
      commit de09d31d
      Author: Kirill A. Shutemov <[email protected]>
      Date:   Fri Jan 15 16:51:42 2016 -0800
          page-flags: define PG_reserved behavior on compound pages
          As far as I can see there's no users of PG_reserved on compound pages.
          Let's use PF_NO_COMPOUND here.
      it has been illegal to combine GFP_COMP with SetPageReserved, so lets
      stop doing both and leave the dma layer to its own devices.
      Reported-by: Taketo Kabe
      Bug: https://gitlab.freedesktop.org/drm/intel/issues/1027
      Fixes: de09d31d ("page-flags: define PG_reserved behavior on compound pages")
      Signed-off-by: default avatarChris Wilson <[email protected]>
      Cc: <[email protected]> # v4.5+
      Reviewed-by: default avatarAlex Deucher <[email protected]>
      Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    • Chris Wilson's avatar
      drm/i915: Wean off drm_pci_alloc/drm_pci_free · 7bac9de8
      Chris Wilson authored
      drm_pci_alloc and drm_pci_free are just very thin wrappers around
      dma_alloc_coherent, with a note that we should be removing them.
      Furthermore since
      commit de09d31d
      Author: Kirill A. Shutemov <[email protected]>
      Date:   Fri Jan 15 16:51:42 2016 -0800
          page-flags: define PG_reserved behavior on compound pages
          As far as I can see there's no users of PG_reserved on compound pages.
          Let's use PF_NO_COMPOUND here.
      drm_pci_alloc has been declared broken since it mixes GFP_COMP and
      SetPageReserved. Avoid this conflict by weaning ourselves off using the
      abstraction and using the dma functions directly.
      Reported-by: Taketo Kabe
      Closes: https://gitlab.freedesktop.org/drm/intel/issues/1027
      Fixes: de09d31d ("page-flags: define PG_reserved behavior on compound pages")
      Signed-off-by: default avatarChris Wilson <[email protected]>
      Cc: <[email protected]> # v4.5+
      Reviewed-by: Daniel Vetter's avatarDaniel Vetter <[email protected]>
      Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
    • Nikhil Mahale's avatar
      ALSA: hda - Fix DP-MST support for NVIDIA codecs · fc9c5535
      Nikhil Mahale authored
      If dyn_pcm_assign is set, different jack objects are being created
      for pcm and pins.
      If dyn_pcm_assign is set, generic_hdmi_build_jack() calls into
      add_hdmi_jack_kctl() to create and track separate jack object for
      pcm. Like sync_eld_via_acomp(), hdmi_present_sense_via_verbs() also
      need to report status change of the pcm jack.
      Rename pin_idx_to_jack() to pin_idx_to_pcm_jack(). Update
      hdmi_present_sense_via_verbs() to report plug state of pcm jack
      object. Unlike sync_eld_via_acomp(), for !acomp drivers the pcm
      jack's plug state must be consistent with plug state
      of pin's jack.
      Fixes: 5398e94f ("ALSA: hda - Add DP-MST support for NVIDIA codecs")
      Signed-off-by: default avatarNikhil Mahale <[email protected]>
      Reviewed-by: default avatarKai Vehmanen <[email protected]>
    • Steven Barrett's avatar
      Revert "ALSA: hda - Fix DP-MST support for NVIDIA codecs" · 118c8001
      Steven Barrett authored
      Prepare for updated patch: https://patchwork.kernel.org/patch/11364379/
      This reverts commit f9a12e98cf2455f1f45cfb8e7108d0e25febc405.
    • Greg Kroah-Hartman's avatar
      Linux 5.5.3 · deff2fcb
      Greg Kroah-Hartman authored
    • Arnd Bergmann's avatar
      compat: ARM64: always include asm-generic/compat.h · 94ab9535
      Arnd Bergmann authored
      commit 556d687a upstream.
      In order to use compat_* type defininitions in device drivers
      outside of CONFIG_COMPAT, move the inclusion of asm-generic/compat.h
      ahead of the #ifdef.
      All other architectures already do this.
      Acked-by: default avatarWill Deacon <[email protected]>
      Reviewed-by: Ben Hutchings's avatarBen Hutchings <[email protected]>
      Signed-off-by: default avatarArnd Bergmann <[email protected]>
      Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • Christophe Leroy's avatar
      powerpc/kuap: Fix set direction in allow/prevent_user_access() · 68e0a154
      Christophe Leroy authored
      [ Upstream commit 1d8f739b ]
      __builtin_constant_p() always return 0 for pointers, so on RADIX
      we always end up opening both direction (by writing 0 in SPR29):
        0000000000000170 <._copy_to_user>:
         1b0:	4c 00 01 2c 	isync
         1b4:	39 20 00 00 	li      r9,0
         1b8:	7d 3d 03 a6 	mtspr   29,r9
         1bc:	4c 00 01 2c 	isync
         1c0:	48 00 00 01 	bl      1c0 <._copy_to_user+0x50>
        			1c0: R_PPC64_REL24	.__copy_tofrom_user
        0000000000000220 <._copy_from_user>:
         2ac:	4c 00 01 2c 	isync
         2b0:	39 20 00 00 	li      r9,0
         2b4:	7d 3d 03 a6 	mtspr   29,r9
         2b8:	4c 00 01 2c 	isync
         2bc:	7f c5 f3 78 	mr      r5,r30
         2c0:	7f 83 e3 78 	mr      r3,r28
         2c4:	48 00 00 01 	bl      2c4 <._copy_from_user+0xa4>
        			2c4: R_PPC64_REL24	.__copy_tofrom_user
      Use an explicit parameter for direction selection, so that GCC
      is able to see it is a constant:
        00000000000001b0 <._copy_to_user>:
         1f0:	4c 00 01 2c 	isync
         1f4:	3d 20 40 00 	lis     r9,16384
         1f8:	79 29 07 c6 	rldicr  r9,r9,32,31
         1fc:	7d 3d 03 a6 	mtspr   29,r9
         200:	4c 00 01 2c 	isync
         204:	48 00 00 01 	bl      204 <._copy_to_user+0x54>
        			204: R_PPC64_REL24	.__copy_tofrom_user
        0000000000000260 <._copy_from_user>:
         2ec:	4c 00 01 2c 	isync
         2f0:	39 20 ff ff 	li      r9,-1
         2f4:	79 29 00 04 	rldicr  r9,r9,0,0
         2f8:	7d 3d 03 a6 	mtspr   29,r9
         2fc:	4c 00 01 2c 	isync
         300:	7f c5 f3 78 	mr      r5,r30
         304:	7f 83 e3 78 	mr      r3,r28
         308:	48 00 00 01 	bl      308 <._copy_from_user+0xa8>
        			308: R_PPC64_REL24	.__copy_tofrom_user
      Signed-off-by: default avatarChristophe Leroy <[email protected]>
      [mpe: Spell out the directions, s/KUAP_R/KUAP_READ/ etc.]
      Signed-off-by: Michael Ellerman's avatarMichael Ellerman <[email protected]>
      Link: https://lore.kernel.org/r/f4e88ec49[email protected]c-s.frSigned-off-by: default avatarSasha Levin <[email protected]>
    • Stephen Rothwell's avatar
    • Tudor Ambarus's avatar
      crypto: atmel-tdes - Map driver data flags to Mode Register · 80c090cc
      Tudor Ambarus authored
      [ Upstream commit 848572f8 ]
      Simplifies the configuration of the TDES IP.
      Signed-off-by: default avatarTudor Ambarus <[email protected]>
      Signed-off-by: default avatarHerbert Xu <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Tudor Ambarus's avatar
      crypto: atmel-aes - Fix CTR counter overflow when multiple fragments · 51887844
      Tudor Ambarus authored
      [ Upstream commit 3907ccfa ]
      The CTR transfer works in fragments of data of maximum 1 MByte because
      of the 16 bit CTR counter embedded in the IP. Fix the CTR counter
      overflow handling for messages larger than 1 MByte.
      Reported-by: default avatarDan Carpenter <[email protected]>
      Fixes: 781a08d9 ("crypto: atmel-aes - Fix counter overflow in CTR mode")
      Signed-off-by: default avatarTudor Ambarus <[email protected]>
      Signed-off-by: default avatarHerbert Xu <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Tudor Ambarus's avatar
      crypto: atmel-aes - Fix saving of IV for CTR mode · abd0966e
      Tudor Ambarus authored
      [ Upstream commit 371731ec ]
      The req->iv of the skcipher_request is expected to contain the
      last used IV. Update the req->iv for CTR mode.
      Fixes: bd3c7b5c ("crypto: atmel - add Atmel AES driver")
      Signed-off-by: default avatarTudor Ambarus <[email protected]>
      Signed-off-by: default avatarHerbert Xu <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Tudor Ambarus's avatar
      crypto: atmel-{aes,tdes} - Do not save IV for ECB mode · 4c61ade8
      Tudor Ambarus authored
      [ Upstream commit c65d1237 ]
      ECB mode does not use IV.
      Signed-off-by: default avatarTudor Ambarus <[email protected]>
      Signed-off-by: default avatarHerbert Xu <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Arnd Bergmann's avatar
      IB/core: Fix build failure without hugepages · ab45cc58
      Arnd Bergmann authored
      [ Upstream commit 74f75cda ]
      HPAGE_SHIFT is only defined on architectures that support hugepages:
      drivers/infiniband/core/umem_odp.c: In function 'ib_umem_odp_get':
      drivers/infiniband/core/umem_odp.c:245:26: error: 'HPAGE_SHIFT' undeclared (first use in this function); did you mean 'PAGE_SHIFT'?
      Enclose this in an #ifdef.
      Fixes: 9ff1b646 ("IB/core: Fix ODP with IB_ACCESS_HUGETLB handling")
      Link: https://lore.kernel.org/r/[email protected]Signed-off-by: default avatarArnd Bergmann <[email protected]>
      Reviewed-by: default avatarJason Gunthorpe <[email protected]>
      Signed-off-by: default avatarJason Gunthorpe <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • David Howells's avatar
      rxrpc: Fix service call disconnection · e5327291
      David Howells authored
      [ Upstream commit b39a934e ]
      The recent patch that substituted a flag on an rxrpc_call for the
      connection pointer being NULL as an indication that a call was disconnected
      puts the set_bit in the wrong place for service calls.  This is only a
      problem if a call is implicitly terminated by a new call coming in on the
      same connection channel instead of a terminating ACK packet.
      In such a case, rxrpc_input_implicit_end_call() calls
      __rxrpc_disconnect_call(), which is now (incorrectly) setting the
      disconnection bit, meaning that when rxrpc_release_call() is later called,
      it doesn't call rxrpc_disconnect_call() and so the call isn't removed from
      the peer's error distribution list and the list gets corrupted.
      KASAN finds the issue as an access after release on a call, but the
      position at which it occurs is confusing as it appears to be related to a
      different call (the call site is where the latter call is being removed
      from the error distribution list and either the next or pprev pointer
      points to a previously released call).
      Fix this by moving the setting of the flag from __rxrpc_disconnect_call()
      to rxrpc_disconnect_call() in the same place that the connection pointer
      was being cleared.
      Fixes: 5273a191 ("rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect")
      Signed-off-by: default avatarDavid Howells <[email protected]>
      Signed-off-by: default avatarDavid S. Miller <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Sean Christopherson's avatar
      KVM: Play nice with read-only memslots when querying host page size · 5ec60785
      Sean Christopherson authored
      [ Upstream commit 42cde48b ]
      Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
      on read-only memslots due to gfn_to_hva() assuming writes.  Functionally,
      this allows x86 to create large mappings for read-only memslots that
      are backed by HugeTLB mappings.
      Note, the changelog for commit 05da4558 ("KVM: MMU: large page
      support") states "If the largepage contains write-protected pages, a
      large pte is not used.", but "write-protected" refers to pages that are
      temporarily read-only, e.g. read-only memslots didn't even exist at the
      Fixes: 4d8b81ab ("KVM: introduce readonly memslot")
      Cc: [email protected]
      Signed-off-by: default avatarSean Christopherson <[email protected]>
      [Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Sean Christopherson's avatar
      KVM: Use vcpu-specific gva->hva translation when querying host page size · a0df80d6
      Sean Christopherson authored
      [ Upstream commit f9b84e19 ]
      Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
      correct set of memslots is used when handling x86 page faults in SMM.
      Fixes: 54bf36aa ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
      Cc: [email protected]
      Signed-off-by: default avatarSean Christopherson <[email protected]>
      Signed-off-by: default avatarPaolo Bonzini <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Miaohe Lin's avatar
      KVM: nVMX: vmread should not set rflags to specify success in case of #PF · 20f5d7ed
      Miaohe Lin authored
      [ Upstream commit a4d956b9 ]
      In case writing to vmread destination operand result in a #PF, vmread
      should not call nested_vmx_succeed() to set rflags to specify success.
      Similar to as done in VMPTRST (See handle_vmptrst()).
      Reviewed-by: default avatarLiran Alon <[email protected]>
      Signed-off-by: default avatarMiaohe Lin <[email protected]>
      Cc: [email protected]
      Reviewed-by: default avatarSean Christopherson <[email protected]>
      Signed-off-by: default avatarPaolo Bonzini <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Marios Pomonis's avatar
      KVM: x86: Protect exit_reason from being used in Spectre-v1/L1TF attacks · 6f8f35ba
      Marios Pomonis authored
      [ Upstream commit c926f2f7 ]
      This fixes a Spectre-v1/L1TF vulnerability in vmx_handle_exit().
      While exit_reason is set by the hardware and therefore should not be
      attacker-influenced, an unknown exit_reason could potentially be used to
      perform such an attack.
      Fixes: 55d2375e ("KVM: nVMX: Move nested code to dedicated files")
      Signed-off-by: default avatarMarios Pomonis <[email protected]>
      Signed-off-by: default avatarNick Finco <[email protected]>
      Suggested-by: default avatarSean Christopherson <[email protected]>
      Reviewed-by: default avatarAndrew Honig <[email protected]>
      Cc: [email protected]
      Signed-off-by: default avatarPaolo Bonzini <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Jens Axboe's avatar
      io_uring: prevent potential eventfd recursion on poll · e458a195
      Jens Axboe authored
      [ Upstream commit f0b493e6 ]
      If we have nested or circular eventfd wakeups, then we can deadlock if
      we run them inline from our poll waitqueue wakeup handler. It's also
      possible to have very long chains of notifications, to the extent where
      we could risk blowing the stack.
      Check the eventfd recursion count before calling eventfd_signal(). If
      it's non-zero, then punt the signaling to async context. This is always
      safe, as it takes us out-of-line in terms of stack and locking context.
      Cc: [email protected] # 5.1+
      Signed-off-by: default avatarJens Axboe <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Sasha Levin's avatar
      io_uring: enable option to only trigger eventfd for async completions · e70e2046
      Sasha Levin authored
      [ Upstream commit f2842ab5 ]
      If an application is using eventfd notifications with poll to know when
      new SQEs can be issued, it's expecting the following read/writes to
      complete inline. And with that, it knows that there are events available,
      and don't want spurious wakeups on the eventfd for those requests.
      This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like
      IORING_REGISTER_EVENTFD, except it only triggers notifications for events
      that happen from async completions (IRQ, or io-wq worker completions).
      Any completions inline from the submission itself will not trigger
      Suggested-by: default avatarMark Papadakis <[email protected]>
      Signed-off-by: default avatarJens Axboe <[email protected]>
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Wayne Lin's avatar
      drm/dp_mst: Remove VCPI while disabling topology mgr · c0e6f4d4
      Wayne Lin authored
      [ Upstream commit 64e62bdf ]
      This patch is trying to address the issue observed when hotplug DP
      daisy chain monitors.
      src-mstb-mstb-sst -> src (unplug) mstb-mstb-sst -> src-mstb-mstb-sst
      (plug in again)
      Once unplug a DP MST capable device, driver will call
      drm_dp_mst_topology_mgr_set_mst() to disable MST. In this function,
      it cleans data of topology manager while disabling mst_state. However,
      it doesn't clean up the proposed_vcpis of topology manager.
      If proposed_vcpi is not reset, once plug in MST daisy chain monitors
      later, code will fail at checking port validation while trying to
      allocate payloads.
      When MST capable device is plugged in again and try to allocate
      payloads by calling drm_dp_update_payload_part1(), this
      function will iterate over all proposed virtual channels to see if
      any proposed VCPI's num_slots is greater than 0. If any proposed
      VCPI's num_slots is greater than 0 and the port which the
      specific virtual channel directed to is not in the topology, code then
      fails at the port validation. Since there are stale VCPI allocations
      from the previous topology enablement in proposed_vcpi[], code will fail
      at port validation and reurn EINVAL.
      Clean up the data of stale proposed_vcpi[] and reset mgr->proposed_vcpis
      to NULL while disabling mst in drm_dp_mst_topology_mgr_set_mst().
      Changes since v1:
      *Add on more details in commit message to describe the issue which the
      patch is trying to fix
      Signed-off-by: default avatarWayne Lin <[email protected]>
      [added cc to stable]
      Signed-off-by: default avatarLyude Paul <[email protected]>
      Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
      Cc: <[email protected]> # v3.17+
      Signed-off-by: default avatarSasha Levin <[email protected]>
    • Song Liu's avatar
      perf/cgroups: Install cgroup events to correct cpuctx · 77ee5b32
      Song Liu authored
      commit 07c59729 upstream.
      cgroup events are always installed in the cpuctx. However, when it is not
      installed via IPI, list_update_cgroup_event() adds it to cpuctx of current
      CPU, which triggers list corruption:
        [] list_add double add: new=ffff888ff7cf0db0, prev=ffff888ff7ce82f0, next=ffff888ff7cf0db0.
      To reproduce this, we can simply run:
        # perf stat -e cs -a &
        # perf stat -e cs -G anycgroup
      Fix this by installing it to cpuctx that contains event->ctx, and the
      proper cgrp_cpuctx_list.
      Fixes: db0503e4 ("perf/core: Optimize perf_install_in_event()")
      Suggested-by: default avatarPeter Zijlstra (Intel) <[email protected]>
      Signed-off-by: default avatarSong Liu <[email protected]>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <[email protected]>
      Signed-off-by: default avatarIngo Molnar <[email protected]>
      Cc: <[email protected]>
      Link: https://lkml.kernel.org/r/[email protected]Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • Song Liu's avatar
      perf/core: Fix mlock accounting in perf_mmap() · 3a53ef49
      Song Liu authored
      commit 00346155 upstream.
      Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
      a perf ring buffer may lead to an integer underflow in locked memory
      accounting. This may lead to the undesired behaviors, such as failures in
      BPF map creation.
      Address this by adjusting the accounting logic to take into account the
      possibility that the amount of already locked memory may exceed the
      current limit.
      Fixes: c4b75479 ("perf/core: Make the mlock accounting simple again")
      Suggested-by: default avatarAlexander Shishkin <[email protected]>
      Signed-off-by: default avatarSong Liu <[email protected]>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <[email protected]>
      Signed-off-by: default avatarIngo Molnar <[email protected]>
      Cc: <[email protected]>
      Acked-by: default avatarAlexander Shishkin <[email protected]>
      Link: https://lkml.kernel.org/r/[email protected]Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • Konstantin Khlebnikov's avatar
      clocksource: Prevent double add_timer_on() for watchdog_timer · 5afe1951
      Konstantin Khlebnikov authored
      commit febac332 upstream.
      Kernel crashes inside QEMU/KVM are observed:
        kernel BUG at kernel/time/timer.c:1154!
        BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().
      At the same time another cpu got:
        general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:
        __hlist_del at include/linux/list.h:681
        (inlined by) detach_timer at kernel/time/timer.c:818
        (inlined by) expire_timers at kernel/time/timer.c:1355
        (inlined by) __run_timers at kernel/time/timer.c:1686
        (inlined by) run_timer_softirq at kernel/time/timer.c:1699
      Unfortunately kernel logs are badly scrambled, stacktraces are lost.
      Printing the timer->function before the BUG_ON() pointed to
      The execution of clocksource_watchdog() can race with a sequence of
      clocksource_stop_watchdog() .. clocksource_start_watchdog():
       detach_timer(timer, true);
        timer->entry.pprev = NULL;
      					clocksource_watchdog_kthread() or
      					spin_lock_irqsave(&watchdog_lock, flags);
      					 watchdog_running = 0;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
      					spin_lock_irqsave(&watchdog_lock, flags);
      					 add_timer_on(&watchdog_timer, ...);
      					 watchdog_running = 1;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
        add_timer_on(&watchdog_timer, ...);
         BUG_ON(timer_pending(timer) || !timer->function);
          timer_pending() -> true
      I.e. inside clocksource_watchdog() watchdog_timer could be already armed.
      Check timer_pending() before calling add_timer_on(). This is sufficient as
      all operations are synchronized by watchdog_lock.
      Fixes: 75c5158f ("timekeeping: Update clocksource with stop_machine")
      Signed-off-by: default avatarKonstantin Khlebnikov <[email protected]>
      Signed-off-by: default avatarThomas Gleixner <[email protected]>
      Cc: [email protected]
      Link: https://lore.kernel.org/r/[email protected]Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • Thomas Gleixner's avatar
      x86/apic/msi: Plug non-maskable MSI affinity race · 38253ee1
      Thomas Gleixner authored
      commit 6f1a4891 upstream.
      Evan tracked down a subtle race between the update of the MSI message and
      the device raising an interrupt internally on PCI devices which do not
      support MSI masking. The update of the MSI message is non-atomic and
      consists of either 2 or 3 sequential 32bit wide writes to the PCI config
         - Write address low 32bits
         - Write address high 32bits (If supported by device)
         - Write data
      When an interrupt is migrated then both address and data might change, so
      the kernel attempts to mask the MSI interrupt first. But for MSI masking is
      optional, so there exist devices which do not provide it. That means that
      if the device raises an interrupt internally between the writes then a MSI
      message is sent built from half updated state.
      On x86 this can lead to spurious interrupts on the wrong interrupt
      vector when the affinity setting changes both address and data. As a
      consequence the device interrupt can be lost causing the device to
      become stuck or malfunctioning.
      Evan tried to handle that by disabling MSI accross an MSI message
      update. That's not feasible because disabling MSI has issues on its own:
       If MSI is disabled the PCI device is routing an interrupt to the legacy
       INTx mechanism. The INTx delivery can be disabled, but the disablement is
       not working on all devices.
       Some devices lose interrupts when both MSI and INTx delivery are disabled.
      Another way to solve this would be to enforce the allocation of the same
      vector on all CPUs in the system for this kind of screwed devices. That
      could be done, but it would bring back the vector space exhaustion problems
      which got solved a few years ago.
      Fortunately the high address (if supported by the device) is only relevant
      when X2APIC is enabled which implies interrupt remapping. In the interrupt
      remapping case the affinity setting is happening at the interrupt remapping
      unit and the PCI MSI message is programmed only once when the PCI device is
      That makes it possible to solve it with a two step update:
        1) Target the MSI msg to the new vector on the current target CPU
        2) Target the MSI msg to the new vector on the new target CPU
      In both cases writing the MSI message is only changing a single 32bit word
      which prevents the issue of inconsistency.
      After writing the final destination it is necessary to check whether the
      device issued an interrupt while the intermediate state #1 (new vector,
      current CPU) was in effect.
      This is possible because the affinity change is always happening on the
      current target CPU. The code runs with interrupts disabled, so the
      interrupt can be detected by checking the IRR of the local APIC. If the
      vector is pending in the IRR then the interrupt is retriggered on the new
      target CPU by sending an IPI for the associated vector on the target CPU.
      This can cause spurious interrupts on both the local and the new target
       1) If the new vector is not in use on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then interrupt entry code will
          ignore that spurious interrupt. The vector is marked so that the
          'No irq handler for vector' warning is supressed once.
       2) If the new vector is in use already on the local CPU then the IRR check
          might see an pending interrupt from the device which is using this
          vector. The IPI to the new target CPU will then invoke the handler of
          the device, which got the affinity change, even if that device did not
          issue an interrupt
       3) If the new vector is in use already on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then the handler of the device which
          uses that vector on the local CPU will be invoked.
      expose issues in device driver interrupt handlers which are not prepared to
      handle a spurious interrupt correctly. This not a regression, it's just
      exposing something which was already broken as spurious interrupts can
      happen for a lot of reasons and all driver handlers need to be able to deal
      with them.
      Reported-by: default avatarEvan Green <[email protected]>
      Debugged-by: default avatarEvan Green <[email protected]>
      Signed-off-by: default avatarThomas Gleixner <[email protected]>
      Tested-by: default avatarEvan Green <[email protected]>
      Cc: [email protected]
      Link: https://lore.kernel.org/r/[email protected]Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • aaptel's avatar
      cifs: fix mode bits from dir listing when mounted with modefromsid · 14272cb0
      aaptel authored
      commit e3e056c3 upstream.
      When mounting with -o modefromsid, the mode bits are stored in an
      ACE. Directory enumeration (e.g. ls -l /mnt) triggers an SMB Query Dir
      which does not include ACEs in its response. The mode bits in this
      case are silently set to a default value of 755 instead.
      This patch marks the dentry created during the directory enumeration
      as needing re-evaluation (i.e. additional Query Info with ACEs) so
      that the mode bits can be properly extracted.
      Quick repro:
      $ mount.cifs //win19.test/data /mnt -o ...,modefromsid
      $ touch /mnt/foo && chmod 751 /mnt/foo
      $ stat /mnt/foo
        # reports 751 (OK)
      $ sleep 2
        # dentry older than 1s by default get invalidated
      $ ls -l /mnt
        # since dentry invalid, ls does a Query Dir
        # and reports foo as 755 (WRONG)
      Signed-off-by: aaptel's avatarAurelien Aptel <[email protected]>
      Signed-off-by: default avatarSteve French <[email protected]>
      CC: Stable <[email protected]>
      Reviewed-by: default avatarPavel Shilovsky <[email protected]>
      Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • Ronnie Sahlberg's avatar
      cifs: fail i/o on soft mounts if sessionsetup errors out · 7b520269
      Ronnie Sahlberg authored
      commit b0dd940e upstream.
      RHBZ: 1579050
      If we have a soft mount we should fail commands for session-setup
      failures (such as the password having changed/ account being deleted/ ...)
      and return an error back to the application.
      Signed-off-by: default avatarRonnie Sahlberg <[email protected]>
      Signed-off-by: default avatarSteve French <[email protected]>
      CC: Stable <[email protected]>
      Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • Tariq Toukan's avatar
      net/mlx5e: TX, Error completion is for last WQE in batch · 36549c86
      Tariq Toukan authored
      [ Upstream commit b57e66ad ]
      For a cyclic work queue, when not requesting a completion per WQE,
      a single CQE might indicate the completion of several WQEs.
      However, in case some WQE in the batch causes an error, then an error
      completion is issued, breaking the batch, and pointing to the offending
      WQE in the wqe_counter field.
      Hence, WQE-specific error CQE handling (like printing, breaking, etc...)
      should be performed only for the last WQE in batch.
      Fixes: 130c7b46 ("net/mlx5e: TX, Dump WQs wqe descriptors on CQE with error events")
      Fixes: fd9b4be8 ("net/mlx5e: RX, Support multiple outstanding UMR posts")
      Signed-off-by: default avatarTariq Toukan <[email protected]>
      Reviewed-by: default avatarAya Levin <[email protected]>
      Signed-off-by: default avatarSaeed Mahameed <[email protected]>
      Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • Heiner Kallweit's avatar
      r8169: fix performance regression related to PCIe max read request size · 1cb84bea
      Heiner Kallweit authored
      [ Upstream commit 21b5f672 ]
      It turned out that on low performance systems the original change can
      cause lower tx performance. On a N3450-based mini-PC tx performance
      in iperf3 was reduced from 950Mbps to ~900Mbps. Therefore effectively
      revert the original change, just use pcie_set_readrq() now instead of
      changing the PCIe capability register directly.
      Fixes: 2df49d36 ("r8169: remove fiddling with the PCIe max read request size")
      Signed-off-by: default avatarHeiner Kallweit <[email protected]>
      Signed-off-by: default avatarDavid S. Miller <[email protected]>
      Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • Tariq Toukan's avatar
      net/mlx5: Deprecate usage of generic TLS HW capability bit · e24b21a5
      Tariq Toukan authored
      [ Upstream commit 61c00cca ]
      Deprecate the generic TLS cap bit, use the new TX-specific
      TLS cap bit instead.
      Fixes: a12ff35e ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
      Signed-off-by: default avatarTariq Toukan <[email protected]>
      Reviewed-by: default avatarEran Ben Elisha <[email protected]>
      Signed-off-by: default avatarSaeed Mahameed <[email protected]>
      Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>
    • Maor Gottlieb's avatar
      net/mlx5: Fix deadlock in fs_core · 3afd6c54
      Maor Gottlieb authored
      [ Upstream commit c1948390 ]
      free_match_list could be called when the flow table is already
      locked. We need to pass this notation to tree_put_node.
      It fixes the following lockdep warnning:
      [ 1797.268537] ============================================
      [ 1797.276837] WARNING: possible recursive locking detected
      [ 1797.285101] 5.5.0-rc5+ #10 Not tainted
      [ 1797.291641] --------------------------------------------
      [ 1797.299917] handler10/9296 is trying to acquire lock:
      [ 1797.307885] ffff889ad399a0a0 (&node->lock){++++}, at:
      tree_put_node+0x1d5/0x210 [mlx5_core]
      [ 1797.319694]
      [ 1797.319694] but task is already holding lock:
      [ 1797.330904] ffff889ad399a0a0 (&node->lock){++++}, at:
      nested_down_write_ref_node.part.33+0x1a/0x60 [mlx5_core]
      [ 1797.344707]
      [ 1797.344707] other info that might help us debug this:
      [ 1797.356952]  Possible unsafe locking scenario:
      [ 1797.356952]
      [ 1797.368333]        CPU0
      [ 1797.373357]        ----
      [ 1797.378364]   lock(&node->lock);
      [ 1797.384222]   lock(&node->lock);
      [ 1797.390031]
      [ 1797.390031]  *** DEADLOCK ***
      [ 1797.390031]
      [ 1797.403003]  May be due to missing lock nesting notation
      [ 1797.403003]
      [ 1797.414691] 3 locks held by handler10/9296:
      [ 1797.421465]  #0: ffff889cf2c5a110 (&block->cb_lock){++++}, at:
      [ 1797.432810]  #1: ffff88a030081490 (&comp->sem){++++}, at:
      mlx5_devcom_get_peer_data+0x4c/0xb0 [mlx5_core]
      [ 1797.445829]  #2: ffff889ad399a0a0 (&node->lock){++++}, at:
      nested_down_write_ref_node.part.33+0x1a/0x60 [mlx5_core]
      [ 1797.459913]
      [ 1797.459913] stack backtrace:
      [ 1797.469436] CPU: 1 PID: 9296 Comm: handler10 Kdump: loaded Not
      tainted 5.5.0-rc5+ #10
      [ 1797.480643] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS
      2.4.3 01/17/2017
      [ 1797.491480] Call Trace:
      [ 1797.496701]  dump_stack+0x96/0xe0
      [ 1797.502864]  __lock_acquire.cold.63+0xf8/0x212
      [ 1797.510301]  ? lockdep_hardirqs_on+0x250/0x250
      [ 1797.517701]  ? mark_held_locks+0x55/0xa0
      [ 1797.524547]  ? quarantine_put+0xb7/0x160
      [ 1797.531422]  ? lockdep_hardirqs_on+0x17d/0x250
      [ 1797.538913]  lock_acquire+0xd6/0x1f0
      [ 1797.545529]  ? tree_put_node+0x1d5/0x210 [mlx5_core]
      [ 1797.553701]  down_write+0x94/0x140
      [ 1797.560206]  ? tree_put_node+0x1d5/0x210 [mlx5_core]
      [ 1797.568464]  ? down_write_killable_nested+0x170/0x170
      [ 1797.576925]  ? del_hw_flow_group+0xde/0x1f0 [mlx5_core]
      [ 1797.585629]  tree_put_node+0x1d5/0x210 [mlx5_core]
      [ 1797.593891]  ? free_match_list.part.25+0x147/0x170 [mlx5_core]
      [ 1797.603389]  free_match_list.part.25+0xe0/0x170 [mlx5_core]
      [ 1797.612654]  _mlx5_add_flow_rules+0x17e2/0x20b0 [mlx5_core]
      [ 1797.621838]  ? lock_acquire+0xd6/0x1f0
      [ 1797.629028]  ? esw_get_prio_table+0xb0/0x3e0 [mlx5_core]
      [ 1797.637981]  ? alloc_insert_flow_group+0x420/0x420 [mlx5_core]
      [ 1797.647459]  ? try_to_wake_up+0x4c7/0xc70
      [ 1797.654881]  ? lock_downgrade+0x350/0x350
      [ 1797.662271]  ? __mutex_unlock_slowpath+0xb1/0x3f0
      [ 1797.670396]  ? find_held_lock+0xac/0xd0
      [ 1797.677540]  ? mlx5_add_flow_rules+0xdc/0x360 [mlx5_core]
      [ 1797.686467]  mlx5_add_flow_rules+0xdc/0x360 [mlx5_core]
      [ 1797.695134]  ? _mlx5_add_flow_rules+0x20b0/0x20b0 [mlx5_core]
      [ 1797.704270]  ? irq_exit+0xa5/0x170
      [ 1797.710764]  ? retint_kernel+0x10/0x10
      [ 1797.717698]  ? mlx5_eswitch_set_rule_source_port.isra.9+0x122/0x230
      [ 1797.728708]  mlx5_eswitch_add_offloaded_rule+0x465/0x6d0 [mlx5_core]
      [ 1797.738713]  ? mlx5_eswitch_get_prio_range+0x30/0x30 [mlx5_core]
      [ 1797.748384]  ? mlx5_fc_stats_work+0x670/0x670 [mlx5_core]
      [ 1797.757400]  mlx5e_tc_offload_fdb_rules.isra.27+0x24/0x90 [mlx5_core]
      [ 1797.767665]  mlx5e_tc_add_fdb_flow+0xaf8/0xd40 [mlx5_core]
      [ 1797.776886]  ? mlx5e_encap_put+0xd0/0xd0 [mlx5_core]
      [ 1797.785562]  ? mlx5e_alloc_flow.isra.43+0x18c/0x1c0 [mlx5_core]
      [ 1797.795353]  __mlx5e_add_fdb_flow+0x2e2/0x440 [mlx5_core]
      [ 1797.804558]  ? mlx5e_tc_update_neigh_used_value+0x8c0/0x8c0
      [ 1797.815093]  ? wait_for_completion+0x260/0x260
      [ 1797.823272]  mlx5e_configure_flower+0xe94/0x1620 [mlx5_core]
      [ 1797.832792]  ? __mlx5e_add_fdb_flow+0x440/0x440 [mlx5_core]
      [ 1797.842096]  ? down_read+0x11a/0x2e0
      [ 1797.849090]  ? down_write+0x140/0x140
      [ 1797.856142]  ? mlx5e_rep_indr_setup_block_cb+0xc0/0xc0 [mlx5_core]
      [ 1797.866027]  tc_setup_cb_add+0x11a/0x250
      [ 1797.873339]  fl_hw_replace_filter+0x25e/0x320 [cls_flower]
      [ 1797.882385]  ? fl_hw_destroy_filter+0x1c0/0x1c0 [cls_flower]
      [ 1797.891607]  fl_change+0x1d54/0x1fb6 [cls_flower]
      [ 1797.899772]  ? __rhashtable_insert_fast.constprop.50+0x9f0/0x9f0
      [ 1797.910728]  ? lock_downgrade+0x350/0x350
      [ 1797.918187]  ? __radix_tree_lookup+0xa5/0x130
      [ 1797.926046]  ? fl_set_key+0x1590/0x1590 [cls_flower]
      [ 1797.934611]  ? __rhashtable_insert_fast.constprop.50+0x9f0/0x9f0
      [ 1797.945673]  tc_new_tfilter+0xcd1/0x1240
      [ 1797.953138]  ? tc_del_tfilter+0xb10/0xb10
      [ 1797.960688]  ? avc_has_perm_noaudit+0x92/0x320
      [ 1797.968721]  ? avc_has_perm_noaudit+0x1df/0x320
      [ 1797.976816]  ? avc_has_extended_perms+0x990/0x990
      [ 1797.985090]  ? mark_lock+0xaa/0x9e0
      [ 1797.991988]  ? match_held_lock+0x1b/0x240
      [ 1797.999457]  ? match_held_lock+0x1b/0x240
      [ 1798.006859]  ? find_held_lock+0xac/0xd0
      [ 1798.014045]  ? symbol_put_addr+0x40/0x40
      [ 1798.021317]  ? rcu_read_lock_sched_held+0xd0/0xd0
      [ 1798.029460]  ? tc_del_tfilter+0xb10/0xb10
      [ 1798.036810]  rtnetlink_rcv_msg+0x4d5/0x620
      [ 1798.044236]  ? rtnl_bridge_getlink+0x460/0x460
      [ 1798.052034]  ? lockdep_hardirqs_on+0x250/0x250
      [ 1798.059837]  ? match_held_lock+0x1b/0x240
      [ 1798.067146]  ? find_held_lock+0xac/0xd0
      [ 1798.074246]  netlink_rcv_skb+0xc6/0x1f0
      [ 1798.081339]  ? rtnl_bridge_getlink+0x460/0x460
      [ 1798.089104]  ? netlink_ack+0x440/0x440
      [ 1798.096061]  netlink_unicast+0x2d4/0x3b0
      [ 1798.103189]  ? netlink_attachskb+0x3f0/0x3f0
      [ 1798.110724]  ? _copy_from_iter_full+0xda/0x370
      [ 1798.118415]  netlink_sendmsg+0x3ba/0x6a0
      [ 1798.125478]  ? netlink_unicast+0x3b0/0x3b0
      [ 1798.132705]  ? netlink_unicast+0x3b0/0x3b0
      [ 1798.139880]  sock_sendmsg+0x94/0xa0
      [ 1798.146332]  ____sys_sendmsg+0x36c/0x3f0
      [ 1798.153251]  ? copy_msghdr_from_user+0x165/0x230
      [ 1798.160941]  ? kernel_sendmsg+0x30/0x30
      [ 1798.167738]  ___sys_sendmsg+0xeb/0x150
      [ 1798.174411]  ? sendmsg_copy_msghdr+0x30/0x30
      [ 1798.181649]  ? lock_downgrade+0x350/0x350
      [ 1798.188559]  ? rcu_read_lock_sched_held+0xd0/0xd0
      [ 1798.196239]  ? __fget+0x21d/0x320
      [ 1798.202335]  ? do_dup2+0x2a0/0x2a0
      [ 1798.208499]  ? lock_downgrade+0x350/0x350
      [ 1798.215366]  ? __fget_light+0xd6/0xf0
      [ 1798.221808]  ? syscall_trace_enter+0x369/0x5d0
      [ 1798.229112]  __sys_sendmsg+0xd3/0x160
      [ 1798.235511]  ? __sys_sendmsg_sock+0x60/0x60
      [ 1798.242478]  ? syscall_trace_enter+0x233/0x5d0
      [ 1798.249721]  ? syscall_slow_exit_work+0x280/0x280
      [ 1798.257211]  ? do_syscall_64+0x1e/0x2e0
      [ 1798.263680]  do_syscall_64+0x72/0x2e0
      [ 1798.269950]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      Fixes: bd71b08e ("net/mlx5: Support multiple updates of steering rules in parallel")
      Signed-off-by: default avatarMaor Gottlieb <[email protected]>
      Signed-off-by: default avatarAlaa Hleihel <[email protected]>
      Reviewed-by: default avatarMark Bloch <[email protected]>
      Signed-off-by: default avatarSaeed Mahameed <[email protected]>
      Signed-off-by: default avatarGreg Kroah-Hartman <[email protected]>