Limiting the impact of new features via feature flags

We currently use feature flags (via Flipper) in a few places so that we can quickly toggle some behavior on or off.

As we move toward the goal of more frequent deploys to GitLab.com, it's important that anything introducing a significant regression can be quickly disabled in order to limit its impact.

To that end, we likely want to begin shipping more things behind feature flags. For now I want to keep the scope of this discussion fairly limited, so I'm starting by seeking input from @marin @meks @dhavens @rymai @DouweM @smcgivern @yorickpeterse.

Current considerations:

Backend
- How do we ensure that the changes to code required by the feature being added behave in a backwards-compatible way so that everything still works when we have to turn the feature flag off?
Database
- Is our current migration strategy capable of handling a case where something shipped behind a feature flag can perform some migrations, and then still have the old behavior work (for at least one release) if we have to disable the feature?
Frontend
- To be determined! I'm sure there's some.
Testing
- In discussion with @meks, it seems like we'd have to perform testing twice: once with the feature enabled, and once with it disabled, in order to ensure that the enabled feature doesn't negatively interact with other parts of the application in unexpected ways.
  - I'm already concerned about what this will do to our test run times. Doesn't this effectively double it each time we add a feature flag?
  - How do we handle cases where multiple features behind flags can interact with each other? It doesn't seem feasible to run tests with Feature A and Feature C enabled but Feature B disabled, then Feature A and Feature B enabled, then all enabled, etc.
Process
- How do we ensure that feature flags get cleaned up in a timely manner? These flags should be short-lived, ideally ~~one minor release~~ one or two release candidates.
- How do we enable people to iterate on their work that's behind a feature flag, keeping in mind our current release cadence?
  - ~~We might allow more things to go into patch releases (and post-freeze RCs without an exception request) if it's a change to something that's behind a flag.~~ Features that are behind a feature flag can be merged at any point, though very large changes might still be rejected close to the 22nd. We may also stop merging changes on for example the 20th, to give some time for preparing the final release.
- Do we need to modify our current process so that shipping something behind a feature flag does not mean it's done? Its removal needs to be treated like a P1 issue in the following release.
  - Upon merging, creating a follow-up issue assigned to the same milestone, and assigned to the MR author. They should then take care of making sure the feature is stable, then remove the feature flag.

2018-09-04 Update

Testbed MR: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/4213

Proposed workflow (per September 14th)

Billy creates a merge request adding support for Windows CI runners. Let's assume for a moment this only requires changes in CE, and not the runner code. Billy made sure the feature is only available when the windows_runners feature is enabled (both frontend and backend code). On the 10th the MR's tests pass, all review comments have been addressed, and the changes are good to go. Alice, a colleague of Billy, is satisfied and merges the changes into master. Because the changes are hidden behind a feature flag, Alice can add the appropriate "Pick into X" label, and an additional label called "uses feature flag" (name up for discussion).

José, this release's release manager, sees the MR has the appropriate labels and picks it straight into the stable branch, because no exception request is necessary. The changes end up going into RC2.

On the 12th, Billy concludes the feature is stable (after rolling it out incrementally) and submits an MR to remove the feature flag. Alice (or somebody else for that matter) reviews it again, and follows the same procedure as before: assign the "Pick into X" label, and the "uses feature flag" label. José again picks the changes into the stable branch, and as of RC3 the feature flag is gone, and the feature is enabled for everybody.

By the time the new .0 stable version is released on the 22nd, the feature is available for everybody, including on-premises.

Edited Sep 14, 2018 by Yorick Peterse