Inconsistent feature flag rollouts without an actor
Background
When rolling out the feature flag ci_workflow_rules_variables
(#300997 (closed)) by 10%, we got an incident "2021-04-07: CI variables are not being passed to jobs in some cases" (gitlab-com/gl-infra/production#4155 (closed)).
We could not reproduce it on the staging
env, but we got some error messages in the production
env from users, (including from us)
related: gitlab-com/gl-infra/delivery#1677 (moved)
Problem
- We check the feature flag in 7 different places.
- The feature flag rollout was 10%, not scoped to any project or user. (No actor)
- Some
Feature.enabled?(...)
checks returntrue
, some returnfalse
based on a "chance".
Easy reproduction for the specific incident:
- The FF returns false when having job_variables.
- The FF returns true when calling recalculate_yaml_variables!.
Proposal (iterations)
Short Term
- Document why "percentage of time" rollout is discouraged and we should use "percentage of actors" instead. For percentage of time to be used correctly we need to ensure that the feature works correctly in between feature checks. If we use
Feature.enabled?
multiple times in the code, the "% of time" has good chance of introducing inconsistent behaviour.➡ #331627 (closed) - When
/chatops run feature set my_feature_flag N
is used it should return an error saying the percentage of time is discouraged, pointing to the doc page describing the problem. However, we could allow users to force setting the feature flag if they are aware of the risks and are sure that the behaviour (such as A/B testing) is implemented correctly:/chatops run feature set my_feature_flag N --random
(or Actor?).➡ gitlab-com/chatops#107 (closed)
Longer Term
- Implement feature flags should always be used with a context so that we could (1) ensure that
Feature.enabled?
is consistently used with the expected actor and (2) have ChatOps validating the command against the YAML definition. E.g. if the feature flag has a global scope we can't use % of actors. - Expand on @grzesiek idea of using a default actor such as Request or correlation_id as global scope so that we could use percentage of actors safely when we can't scope a feature flag to any official actors (namespace/user/project/runner).
➡ #331630
Other proposals raised during the incident investigation:
- From @mbobin =>
Would it be possible to cache the FF in the request store to have it return the same value every time in a request/job?
- From @grzesiek =>
We should document this / add a Rubocop cop / something.
- From @fabiopitino =>
Maybe that’s not difficult to add and cascade it down to the objects. E.g. the feature flag could be defined and memoized in Chain::Command , etc.
- From @ayufan =>
I think we should not do percentage of time rollout
Edited by Cheryl Li