Improve documentation on feature flag process/rollout
Right now, documentation on feature flags is spread out over three documents:
- https://docs.gitlab.com/ee/development/feature_flags.html
- https://docs.gitlab.com/ee/development/rolling_out_changes_using_feature_flags.html
- https://gitlab.com/gitlab-org/gitlab-ce/blob/master/PROCESS.md#feature-flags
I think it makes sense to have separate documentation on developing using feature flags, and the surrounding process, but that boundary is not currently clearly enforced, and the documents duplicate information in some places, and contradict each other in others.
This issue focuses on the process/rollout documentation, the questions it would ideally answer, and where those answers are currently inconsistent or lacking entirely. Improvements to the feature flag development documentation are discussed in https://gitlab.com/gitlab-org/gitlab-ce/issues/59706.
Should feature flags be on by default, or off?
https://docs.gitlab.com/ee/development/feature_flags.html implies feature flags should usually be off by default:
In the rare case that you need the feature flag to be on automatically, use
default_enabled: true
when checking...
https://docs.gitlab.com/ee/development/rolling_out_changes_using_feature_flags.html does too:
We don’t need to revert a release, and because feature flags are disabled by default we don’t need to revert and pick any Git commits.
It also explains why we prefer this, and recommends an incremental rollout:
However, in general we recommend rolling out changes incrementally, instead of enabling them for everybody right away. We also recommend you to not enable a feature before the code is being deployed. This allows you to separate rolling out a feature from a deploy, making it easier to measure the impact of both separately.
Between every step you’ll want to wait a little while and monitor the appropriate graphs
And it implies that until a feature flag is removed, the change is not actually available to all users:
Once a change is deemed stable, submit a new merge request to remove the feature flag. This ensures the change is available to all users and self-hosted instances.
https://gitlab.com/gitlab-org/gitlab-ce/blob/master/PROCESS.md#feature-flags tells a different story:
In most other cases, the feature flag can be enabled by default.
At the same time, it implies that until a feature flag is removed, the change is not actually available to all users:
In order to build the final package and present the feature for self-hosted customers, the feature flag should be removed.
This inconsistency should of course be cleared up.
It would also be useful to explicitly cover the pros and cons of default on vs off, to help people make an informed decision. @engwan listed some of them in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/25575#note_145442452.
I think the timing in the month should also be a factor in the decision. The documentation currently implies that features can be merged all the way up to the 19th, as long as they have a feature flag, even if it is on by default:
Merge requests that make changes hidden behind a feature flag, or remove an existing feature flag because a feature is deemed stable, may be merged (and picked into the stable branches) up to the 19th of the month.
For GitLab.com, that's not a problem, because we can always turn off the flag if problems arise, but there would be no time to disable the flag, fix the bug, or remove the offending code before the package is shipped to customers, which means we lose the value of the 2 week gap between the feature freeze and the release date to iron out regressions in new features before customers get their hands on them. That close to the release date, I think we should only pick features with flags that are off by default.
When a feature is deemed stable, should the flag be removed immediately?
The documents are consistent on this one. https://docs.gitlab.com/ee/development/rolling_out_changes_using_feature_flags.html#the-cost-of-feature-flags says:
Once a change is deemed stable, submit a new merge request to remove the feature flag. This ensures the change is available to all users and self-hosted instances.
And:
In most cases this will translate to a feature (with a feature flag) being shipped in RC1, followed by the feature flag being removed in RC2.
https://gitlab.com/gitlab-org/gitlab-ce/blob/master/PROCESS.md#feature-flags says:
In order to build the final package and present the feature for self-hosted customers, the feature flag should be removed.
This last one is a little confusing, because assuming the feature flag is on by default (as recommended by this same document), it would already be available to self-hosted customers, but either way the recommendation is to remove the feature flag entirely as soon as possible, ideally by picking it into the same release.
This can be problematic though. In cases where a feature flags is used to not just hide a new feature, but to switch between old and new implementations or behaviors, removing the feature flag can involve removing a lot of old code, which makes the MR harder to prepare and unfit for being picked. In cases like this, I think it's acceptable to just flip the default of the flag from off to on once it's stable (assuming it wasn't on by default before already), and to wait to actually remove the flag and the old code in the next release.
How do we document features behind feature flags?
None of the documents currently answer this question, but https://docs.gitlab.com/ee/development/feature_flags.html does say the following about the changelog entry:
Features that are developed and are intended to be merged behind a feature flag should not include a changelog entry. The entry should be added in the merge request removing the feature flags.
I don't think we should wait to document a feature until the feature flag is removed, though, since it could already be enabled on GitLab.com, or even to self-hosted customers if the flag was on by default, and I think we'd want to have documentation on any "live" features. Of course, as we're finding bugs and fixing them, we can also turn the feature off again, so the documentation would need to reflect the potentially intermittent availability of the feature somehow.
A related question is whether we should give instructions on how to enable or disable a feature flagged feature, if it's disabled by default and the customer would like to "beta test" it, or if it's enabled by default and proves to be buggy.
How do we communicate the state of a feature flag to stakeholders (product, support)?
The lack of a defined process here has already led to some problems; in one example, a PM was under the impression a feature flag was going to be enabled by default by the time of the release and announced the feature in the release blogpost, while in fact the developers involved had run into some critical bugs after enabling the flag on .com, and fixing them had taken more time than expected leading to the flag still being disabled by default once the release came around. The PM was aware of these bugs, but hadn't realized engineering was treating them as blocking, and that they couldn't be fixed in time for the release.
The Support team has also expressed that they would like to have more visibility into feature flags state changes:
Cynthia: the crux of the matter is when the feature flag is turned on for this feature. If nothing else, the support team would really appreciate a ping when it's turned on. On Slack, it's
@support-dotcom
or on GitLab,@dotcom-support
One option would be to always leave the main feature issue open until the feature is deemed stable and the flag is removed (or enabled by default if removing it in the same release is not practical), and to communicate the state changes and blocking bugs there, pinging the PM and Support team as appropriate. Another option would be to have a dedicated "Enable feature flag X by default" issue where these blockers are tracked, like https://gitlab.com/gitlab-org/gitlab-ee/issues/10684.
Who's responsibility is managing the flag?
The documents imply this is the developer, but this isn't as explicit as it can be. @victorwu has made an attempt at clarifying this in gitlab-com/www-gitlab-com!20654 (closed), but ideally this would be in one of the existing documents rather than a new one.
In what Slack channel should feature flag chatops commands be sent?
https://docs.gitlab.com/ee/development/rolling_out_changes_using_feature_flags.html talks about chatops, but doesn't recommend a channel. I've seen people use #production
as well as their own group channel (#g_<group>
). I don't have a preference, but I think we should make a recommendation sine this is a question everyone reading these instructions for the first time will have.
How does one get access to chatops?
https://docs.gitlab.com/ee/development/rolling_out_changes_using_feature_flags.html says the following, but doesn't actually say what one can do to fix this:
If you get an error “Whoops! This action is not allowed. This incident will be reported.” that means your Slack account is not allowed to change feature flags.
We should definitely clarify that, but I think it's also worth considering giving every engineer chatops access by default as part of their onboarding, just like we do with dev and staging access.
Other concerns
- https://docs.gitlab.com/ee/development/rolling_out_changes_using_feature_flags.html contains an interesting story on "The cost of feature flags", but I don't think this is relevant anymore at this point, at least not to people who are just trying to find out how to work with feature flags.
- https://gitlab.com/gitlab-org/gitlab-ce/blob/master/PROCESS.md#feature-flags is under "Feature freeze on the 7th for the release on the 22nd", which makes the recommendations appear to relate only to the feature freeze, rather than to feature flags in general.