Skip to content

Delay showing errors when MR Notes polling fails

Thomas Randolph requested to merge tor/feature/delay-polling-errors into master

What does this MR do?

For #292008 (closed)

This MR adds a small "delay" (really more of a debounce) to showing the polling errors that may occur on the Merge Request page. It also retries the first and second failures (for a total of three tries, including two retries).

It uses the new recurrence utility (linked below in the Where are we? table) to store a count of the number of times this particular request has failed.
When it fails twice (and only twice), the current error message that shows for every failure is displayed.

If the network request succeeds, the count is reset to 0.

Why?

The theory is that most of the instances of the polling errors we see are caused by things like:

  • the tab being idle and thus the JS thread being suspended in the middle of a request,
  • navigating away from the page in the middle of a request (and thus it is killed mid-flight, causing a network error), or
  • transient 500 errors from the API.

In all of these cases, it's most likely that the next request - which is scheduled for 6 seconds later by the polling logic - will succeed (or, in the case of navigating away, simply won't happen).
Because of this, we don't need to pester the user about it until we're sure it's not transient (e.g., it happens twice in a row).

Concerns

This actually introduces two retries to a network request failure that previously halted permanently on the first failure.

This is probably fine as it shows the failure message the same amount (just once), and the retries give the opportunity for the polling to

  • Succeed and not halt
  • Fail, display the error, but recover and continue polling (and also automatically clear the displayed error).

The expectation is that errors with notes polling are transient and these recovery safeguards will solve this error display for 80% (or more) of users.

However, if the error is not transient and the result of back end problems, this does introduce 3 retries which could compound a problem by unnecessarily hitting the API. At the moment, however, I don't think it's possible to determine what's transient or not without retrying, so this is probably the best solution in terms of user experience.

Testing

After checking out this branch, load a Merge Request.

Once the page has finished and it begins polling for notes, modify app/controllers/concerns/notes_actions.rb on line 39 with:

    abort("ಠ_ಠ")

(Look of Disapproval error message optional).

The context of that change should be something like:

    ::Gitlab::EtagCaching::Middleware.skip!(response) if notes.present?

    abort("ಠ_ಠ")

    render json: meta.merge(notes: notes)
  end

(screenshot of changes in my editor below)

Once you save this change, the next notes poll will fail (it will also take an EXTREMELY long time to complete, have patience!).

If you're quick, you can comment that line out again and save it before the next poll completes:

    # abort("ಠ_ಠ")

This will allow the next poll to succeed, demonstrating that the error message never appears if the notes failure is transient.

Allow it to fail twice in a row and you'll see the original error message displayed.

Error Clearing

This change also introduces a small QoL improvement for when errors are shown. The polling will retry a second time (for a total of three tries), and - if the third try is successful - it will clear the existing error message display.
To test this change, allow the polling to fail twice, but then comment out the abort() call, allowing the third try to succeed. The displayed error message should clear automatically.

Where are we?

MR Description
!60536 (merged) Add a small utility that adds a stateful memory bank for tracking repeated occurrences of [thing]. This is a small, generic utility that's not specific to error handling, it just remembers how many times you tell it something happened, and it will perform your requested behaviors for certain counts.
!60537 (merged) Make Flash messages programmatically dismissable
We're here 👉🏻 When the Notes app is polling for new notes and that request fails, wait to display the error until we're sure the failure isn't transient (it happens twice in a row)

Screenshots (strongly suggested)

I haven't included any screenshots, because this change isn't changing any visuals. It only changes when they appear, which is hard to visually show.

Please note the rest of the description, and the Testing section for descriptions of how these scenarios play out.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team
Edited by Thomas Randolph

Merge request reports