Fix invalid cached values for `BasePolicy#admin?` [RUN AS-IF-FOSS]
What does this MR do?
Fixes: #332983 (closed)
This MR ensures that the GitLab Rails application is compatible with upstream changes in gitlab-org/declarative-policy>.
Doing so uncovered a very subtle bug involving a number of interacting components, which should be fixed, regardless of any other changes.
What went wrong?
When the new changes from https://gitlab.com/gitlab-org/declarative-policy/-/merge_requests/14 were integrated, failures were observed when evaluating the
admin
condition, and only this condition.
Diagnosing the causes of this took a long time, and the search went down many blind alleys, because under normal circumstances this only occurred once every hour (but every time in CI).
What was happening was:
- Upon sessionless login (for all api and GraphQL requests), in the warden callback chain
the method
User#update_tracked_fields!(request)
is called. - As part of this method, we call
Users::UpdateService
to update the user state. - (Note: This call is throttled, and is only executed once every hour).
- During
Users::UpdateService#execute
, the permission:update_name
is checked (see:Users::UpdateService#name_updatable?
). - This permission check evaluates the following rule in the
EE::UserPolicy
:rule { updating_name_disabled_for_users & ~admin }.prevent :update_name
. - The admin condition is defined in the base policy, where it calls
Gitlab::Auth::CurrentUserMode.new(@user).admin_mode?
. - When this is evaluated, the return value is cached under two keys (both in the
RequestStore
).- The
CurrentUserMode
caches it underadmin_mode_rs_key
- The policy framework caches it under a condition key against the admin condition (something like:
"/db/condition/Base/admin/user:5"
)
- The
- However, during the execution of this method,
CurrentUserMode#admin_mode?
never returns true for sessionless requests. The reason for this that during the warden callback chain the session bypass has not been set yet (this is tested atCurrentUserMode#session_with_admin_mode?
).
As a result, if we test the admin
condition for the authenticating user during the warden callback chain, then we will erroneously, and in two separate locations, cache the wrong state. When we try and read it again, it will incorrectly return false
, even if the value of bypass_session_admin_id
has changed.
This represents a logic error that could have been triggered at any time if the admin
condition was evaluated before the bypass admin ID was set.
Up until now, this never happened, because the rule to check this was defined as rule { updating_name_disabled_for_users & ~admin }.prevent :update_name
. An upstream change in declarative-policy> optimizes AND
and OR
nodes of rules (such as a & b
). These have always been executed in declaration order.
Now however, following a change to the way the optimizer works, the library is able to re-order these rules depending on the known cost of evaluation.
admin
is a very cheap condition - it is given a score of 0
, which is equivalent to a cached value, and guarantees very eager evaluation. It requires no I/O to compute. The declarative-policy> library now chooses to evaluate ~admin
before updating_name_disabled_for_users
(which is correct, since the latter requires DB access).
This was not detected before, since to trigger this same bug in current code, we would need to run admin checks in sessionless API calls on instances where the updating_name_disabled_for_users
setting is true. But only once per-hour! A regression test is added for that case (see: 8010d177)
The fix
There are a couple of fixes here: One involves cache invalidation (with domain knowledge) and the other requires discarding the cached values we don't want (which is more fragile, but works in this case).
We clear the erroneous state whenever we set or unset the bypass session admin ID. This means that whenever the computation might change, we allow it to be re-computed.
In the case of the declarative-policy> ability condition cache, we deliberately discard the cached value for the admin
condition during execution of the warden hooks.
Additionally, since this was masked by the throttling of the method call, this throttling is abstracted and made more obvious, and a spec initializer is added to clear any throttles during test execution. This increases redis calls during test runs, but it should mean that things like this will be easier to detect.
Potential future work
See: gitlab-org/declarative-policy!24 and gitlab-org/declarative-policy!25
Upstream work in the DP library adds a more robust mechanism for cache invalidation. We will still need
Ability.forgetting
to capture the cache-keys we wish to discard, but will have a much more robust way of
invalidating dirty state.
Does this MR meet the acceptance criteria?
Conformity
-
I have included a changelog entry, or it's not needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)