• Douglas Anderson's avatar
    kgdb: Don't round up a CPU that failed rounding up before · 87b09592
    Douglas Anderson authored
    If we're using the default implementation of kgdb_roundup_cpus() that
    uses smp_call_function_single_async() we can end up hanging
    kgdb_roundup_cpus() if we try to round up a CPU that failed to round
    up before.
    Specifically smp_call_function_single_async() will try to wait on the
    csd lock for the CPU that we're trying to round up.  If the previous
    round up never finished then that lock could still be held and we'll
    just sit there hanging.
    There's not a lot of use trying to round up a CPU that failed to round
    up before.  Let's keep a flag that indicates whether the CPU started
    but didn't finish to round up before.  If we see that flag set then
    we'll skip the next round up.
    In general we have a few goals here:
    - We never want to end up calling smp_call_function_single_async()
      when the csd is still locked.  This is accomplished because
      flush_smp_call_function_queue() unlocks the csd _before_ invoking
      the callback.  That means that when kgdb_nmicallback() runs we know
      for sure the the csd is no longer locked.  Thus when we set
      "rounding_up = false" we know for sure that the csd is unlocked.
    - If there are no timeouts rounding up we should never skip a round
    NOTE #1: In general trying to continue running after failing to round
    up CPUs doesn't appear to be supported in the debugger.  When I
    simulate this I find that kdb reports "Catastrophic error detected"
    when I try to continue.  I can overrule and continue anyway, but it
    should be noted that we may be entering the land of dragons here.
    Possibly the "Catastrophic error detected" was added _because_ of the
    future failure to round up, but even so this is an area of the code
    that hasn't been strongly tested.
    NOTE #2: I did a bit of testing before and after this change.  I
    introduced a 10 second hang in the kernel while holding a spinlock
    that I could invoke on a certain CPU with 'taskset -c 3 cat /sys/...".
    Before this change if I did:
    - Invoke hang
    - Enter debugger
    - g (which warns about Catastrophic error, g again to go anyway)
    - g
    - Enter debugger
    ...I'd hang the rest of the 10 seconds without getting a debugger
    prompt.  After this change I end up in the debugger the 2nd time after
    only 1 second with the standard warning about 'Timed out waiting for
    secondary CPUs.'
    I'll also note that once the CPU finished waiting I could actually
    debug it (aka "btc" worked)
    I won't promise that everything works perfectly if the errant CPU
    comes back at just the wrong time (like as we're entering or exiting
    the debugger) but it certainly seems like an improvement.
    NOTE #3: setting 'kgdb_info[cpu].rounding_up = false' is in
    kgdb_nmicallback() instead of kgdb_call_nmi_hook() because some
    implementations override kgdb_call_nmi_hook().  It shouldn't hurt to
    have it in kgdb_nmicallback() in any case.
    NOTE #4: this logic is really only needed because there is no API call
    like "smp_try_call_function_single_async()" or "smp_csd_is_locked()".
    If such an API existed then we'd use it instead, but it seemed a bit
    much to add an API like this just for kgdb.
    Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
    Acked-by: default avatarDaniel Thompson <daniel.thompson@linaro.org>
    Signed-off-by: default avatarDaniel Thompson <daniel.thompson@linaro.org>
debug_core.h 2.47 KB