Skip to content

Improve fail-open strategy for Arkose to prevent single point of failure

We should review how we can remove arkose as a single point of failure. In particular:

Problem

  • During the incident arkose's token verification endpoint was returning false negatives i.e. even though users were solving the challenge they weren't able to proceed. The issue was that arkose challenges were rendered during phone and credit card verification. Which meant that the phone verification request wasn't even being sent to Telesign.
  • We can disable arkose using application settings, but when it’s disabled, then we disable phone and credit card verification too.

Current implementation

Existing docs: https://internal.gitlab.com/handbook/engineering/identity-verification/#arkose-integration

When we fail-open, we assume the user is "low risk" and let them proceed with Identity Verification requiring only email verification

  1. Fail-open when Arkose is down according to their status endpoint
  2. Fail-open when Arkose token verification request fails for unknown reasons

Proposal

  1. Fail-open when there are no successful Arkose token verification/signups/other signal in the last X minutes/hours
    • Consistent with current fail-open implementation, assume all users that go through Identity Verification are low risk (or just disable Arkose which results to the same behavior?)
    • After X minutes/hours window, (auto) revert back to the normal behavior
  2. Optional
    1. Disabling Arkose does not also disable phone verification and credit card verification
    2. Introduce a "default" arkose_risk_band (can be defaulted to 'Unavailable') setting to allow selecting the default behavior. This requires (1) so when "default" arkose_risk_band, for example, is set to 'Medium' phone verification is operational even without Arkose.

Implementation plan

WIP

Edited by Eugie Limpin