Document why "percentage of time" rollout is discouraged and we should use "percentage of actors" instead. For percentage of time to be used correctly we need to ensure that the feature works correctly in between feature checks. If we use Feature.enabled? multiple times in the code, the "% of time" has good chance of introducing inconsistent behaviour. #331627 (closed)
When /chatops run feature set my_feature_flag N is used it should return an error saying the percentage of time is discouraged, pointing to the doc page describing the problem.
However, we could allow users to force setting the feature flag if they are aware of the risks and are sure that the behaviour (such as A/B testing) is implemented correctly: /chatops run feature set my_feature_flag N --random (or Actor?). gitlab-com/chatops#107 (closed)
Longer Term
Implement feature flags should always be used with a context so that we could (1) ensure that Feature.enabled? is consistently used with the expected actor and (2) have ChatOps validating the command against the YAML definition. E.g. if the feature flag has a global scope we can't use % of actors.
From @mbobin => Would it be possible to cache the FF in the request store to have it return the same value every time in a request/job?
From @grzesiek => We should document this / add a Rubocop cop / something.
From @fabiopitino => Maybe that’s not difficult to add and cascade it down to the objects. E.g. the feature flag could be defined and memoized in Chain::Command , etc.
From @ayufan => I think we should not do percentage of time rollout
I was thinking on adding this, but I never got to this point. The current YAMLs are prepared to have this defined. In my original evaluation it was named thing: #221042
I think actor rollouts are also prone to errors when there is more than one feature flag check because the change could occur between the checks. This could happen only when the flag is flipped and it's quite likely since we have a lot of builds happening at the same time. Here is a PoC for caching feature flag results into RequestStore: !58892 (diffs)
Interesting. I'm unsure if this would happen though, as we have ProcessCache.
If we would like to avoid that, we could add additional CacheStorage to Flipper, which could use RequestStore, in a similar way as we do Process and Rails.cache ones.
# Thread-local L1 cache: use a short timeout since we don't have a# way to expire this cache all at onceflipper_adapter=Flipper::Adapters::ActiveSupportCacheStore.new(redis_cache_adapter,l1_cache_backend,expires_in: 1.minute)defl1_cache_backendGitlab::ProcessMemoryCache.cache_backendend
@tkuah I wonder if we actually need to retain "percentage of time", this will always be error prone and inaccurate, see how it is implemented in Flipper.
But we can turn almost anything into an actor through FeatureGate:
Yeah, percentage of time is really for specialised cases (such as A/B testing).
Suggestion: We should encourage percentage of actors both in docs, and in chatops.
Bonus if we can stop using percentage of time.
On trying to make percentage of time consistently return the same per request - I don't know if that's worth it, if we very seldom use it. Also, it's changing the semantics really from "percentage of time" to "percentage of requests" /cc @fabiopitino@mbobin
@grzesiek@tkuah I would be in favour of discouraging the use of "percentage of time" through the ChatOps command for the time being.
When /chatops run feature set my_feature_flag N is used it should return an error saying the percentage of time is discouraged, pointing to a doc page describing the problem. For percentage of time to be used correctly we need to ensure that the feature works correctly in between feature checks. If we use Feature.enabled? multiple times in the code, the "% of time" has good chance of introducing inconsistencies in the behavior.
However, we could allow users to force setting the feature flag if they are aware of the risks and are sure that the behavior (such as A/B testing) is implemented correctly: /chatops run feature set my_feature_flag N --force.
In the meantime we can decide whether to use a Request or correlation_id instead of the global scope.
I think #221042 would still be needed since could allow ChatOps command to validate whether the right actor is being used as well as enabling consistent usage in development (e.g. ensure we always use Feature.enabled? with the expected actor). Depending on whether we decide to use Request/correlation_id as default scope instead of nothing we could ensure that is defined in the YAML definition too.
But we can turn almost anything into an actor through FeatureGate:
@grzesiek I think this would be very powerful. I'm not sure how we would enable that via /chatops though. Today we only allow group, project, user and runner as actors because they include FeatureGate module.
Are the /chatops run feature commands implemented in a way that the actor (e.g. --project=gitlab-org/gitlab) is expected to be an ActiveRecord model? If we start passing in an object of MyCustomActorI think it may be possible to use "% of actor" but not enabling the flag for a single actor instance through ChatOps.
I'm happy with the current proposal. It's worth noting that Gitaly doesn't have the ability to toggle feature flags per actor in a lot of cases: gitaly#3386 (closed) I added this in a limited form in !53387 (merged), but it didn't work for a case @pks-t was working on the other day. It might be worth getting feedback from the Gitaly team here too.
I see above we have an idea about making more things actors, but I think that should probably be out of scope here as we also need to handle that in chatops (cc @qmnguyen0711 as you were talking about making worker classes actors in another issue).
The Percentage of Time gate is totally fine as long as it's used in a right situation. In fact, we can see that the same strategy exists in the other FF tool too, such as gradualRolloutRandom in Unleash. These docs emphases that the actor/strategy is for load testing like Gitaly case described above, and NOT for what's visible to users.
I've been feeling that the chatops command (/chatops run feature set my_feature_flag N) is a bit too ambiguous. We should always specify the gate type whether it's --actor or --random. It's a bit weird that the --random is being default flag even though it's rather rare case.
Percentage of time rollout is not a good idea if what you want is to make sure a feature is always on or off to the users. In that case, Percentage of actors rollout is a better method.
I was surprised to hear it worked this way i.e. it being truly random per-invocation of Feature.enabled?. My understanding was that when not specifying an actor, the actor would be "current anon session" which should be unique across all feature toggle checks? I feel like there should always be an implied actor, since where there is a request, there is an actor.
My 2 cents on caching, we should consider also that a request may continue as one or more sidekiq jobs. In that case, the feature flag may be evaluated again.
I've been feeling that the chatops command (/chatops run feature set my_feature_flag N) is a bit too ambiguous. We should always specify the gate type whether it's --actor or --random. It's a bit weird that the --random is being default flag even though it's rather rare case.
I agree. Maybe we can return an error when the command is ambiguous and have explicit flag --random or --actor rather than relying on the default behaviour.
We should document this / add a Rubocop cop / something.
Could we have a cop to catch cases where the flag is scoped to the instance?
There are legitimate cases where the flag is only used once or a 'percentage of time' rollout is not used, but this could call attention to the danger of this approach when paired with '% of time rollout' and the developer and reviewer could examine whether using a scope other than instance is possible/desired and ignore the rubocop rule if it's a legitimate case.
This could be in addition to the current proposal, with the advantage that it gets the developer thinking about this potential issue earlier before the relevant code is merged to master.
This allows you to ensure that for a given customer, you can always ensure they get the same flag variant, regardless of how the state of the customer may change.
This can handle providing consistent flag variations in various situations regardless of the different scenarios - e.g., even if a user goes from a non-logged-in guest state to a logged-in known-user state.
I would propose we take a step back and look at the problem holistically from that perspective, rather than getting stuck within the limitations of our current framework+Flipper and what it provides (or doesn't).