<!-- Design Documents often contain forward-looking statements -->
<!-- vale gitlab.FutureTense = NO -->
<!-- This renders the design document header on the detail page, so don't remove it-->
{{<design-document-header>}}
## Summary
Feature flags are a powerful tool that provide SREs and developers a way to
control behavior of the application in a rapid manner.
They are also commonly being used in the industry to increase personalisation and as go to market tools e.g. [LaunchDarkly](https://launchdarkly.com/).
Currently, with a single production instance, setting of feature flags is an
operation that is performed with a single ChatOps command, and immediately
rolled out to production with an API call.
With the introduction of multiple cells, engineers setting feature flags in
each cell will be unmanageable.
Reference the previous iteration of this ADR:
[Cells: Feature Flags](../feature_flags.md/)
## Motivation
With the change of Cells milestones and their iterations, the premise that we
will manage feature flags only on the legacy cell and not on any of the cells
is no longer true.
Feature flags are commonly used to introduce features that
are considered risky, or features that are not considered production ready.
In addition, feature flags are regularly used to mitigate incidents.
Feature flags will need
to be available for use on Cells before the first cell has a customer.
### Goals
This document describes at a high-level how feature flags will be set and rolled
out to Cells.
Implementation details are left out of this document. In some cases, options for
implementing the design are mentioned.
- Describe the interface for engineers and SREs to set feature flags in Cells.
- Define rollout strategies for feature flags in Cells.
### Non-Goals
Other aspects of feature flags, though they might need to be revisited in preparation
for Cells, will not be discussed here.
- Visualizing current feature flag state.
- Feature flag lifecycle changes; Limiting number of feature flags, reducing long lived feature flags.
| Tissue can trigger API calls to each cell itself. | Tissue can trigger an Instrumentor pipeline to set feature flags. |
| Tissue will need access to admin tokens for every cell. | Instrumentor has access to the toolbox pod for each cell, and the feature flag can be set using that. |
| An admin token can be added to Vault on cell creation, as part of the bootstrapping procedure. | It will involve booting a rails console, which will slow down the process. |
### Storing feature flag configuration
Feature flag configuration currently lives in the database of each GitLab instance.
We should treat feature flag configuration as infrastructure-as-code, so that changes
can be easily tracked and managed. With Cells, we will need feature flag configuration
to live in a file somewhere.
- Feature flag configuration can be stored in tenant models, or in a separate file
per ring or per cell.
- Changes to the tenant model are usually applied by instrumentor.
- Keeping this information outside the tenant model could result in information
about the cell being stored in two locations. However, we could link the other
file to the tenant model.
- One of the factors in consideration should be what happens when a cell is moved?
Information in the tenant model will move along with the cell, but if feature
flag configuration is stored outside the cell, it will also need to be moved.
### Caching current feature flag state
One of the most common operations performed with feature flags is to query the
gate values of a feature flag. Querying every cell to get this information
every time an engineer requests the current state of a feature flag will be slow
and resource intensive.
Tissue can cache persisted feature flags and their gate values per cell.
At regular intervals, Tissue can pull all feature flags and their gate values from each
cell. This can also be done after a deployment to a ring.
## Alternative Solutions
### Third party feature flag service
A feature flag service will need to be able to mimic our ring
architecture, ideally without needing the application to be aware of its
cell ID.
We will need to self-host a third party service, or if using a SaaS service,
it will need strong guarantees on availability since we will be using the
service for incident mitigation.
Switching to a third party service is potentially a large undertaking and not something
that can be achieved in the short or medium term. Adding in another unknown in the
form of a third party service for feature flags during the Cells migration that already contains a
lot of unknowns doesn't seem to be the best path forward. We could revisit this
conversation at a later stage.
We can also consider using GitLab's own Feature Flag product, which is based on Unleash.
However, this will require extensive development to make the feature set useful
for us. This might make the product too specific to Cells and gitlab.com.
### Pull model instead of push model
We can have cells poll the feature flag service (Tissue) at regular intervals to pull any
feature flag changes. This might require the application to know its cell ID, which can be
included in the request so that the feature flag service can identify the requester
and respond appropriately.
This can be more resource intensive for the feature flag service. Depending on the frequency
of polling it can also add time and unpredictability to the propagation of feature flag
changes to cells. Increasing the frequency also increases the resource usage for
the FF service.
### Replicated feature flag database tables
This alternative solution is based on two assumptions regarding the Cells architecture:
-[We'll use a replication mechanism at the database-level for all Cells to have access to the same data](https://gitlab.com/gitlab-org/gitlab/-/issues/477608)(e.g. all Cells should have the same `plans` and `plan_limits` data etc.)
- All feature flag actors (user, project, group, organization) will be unique across the cluster, it's not a problem for all Cells to share feature flags data
Overview of the solution:
1. Mark `feature_gates` and `features` tables as `gitlab_main_clusterwide` instead of `gitlab_main_cell`, so that all cells have all the feature flags data at all time.
1. Introduce a new `Gitlab::Cell` actor.
1. In `Feature.enabled?`, check if the feature flag is enabled for the current cell as a fallback (that will allow the enablement of a feature flag for a whole cell).
1. In ChatOps, introduce a new `--cell <cell-id>` or `--cell <cell-name>` option.
**Pros:**
- For the ChatOps `--production flag`, feature flags continue to be set on the "legacy cell" and the feature flags data is replicated to all cells at the database-level (no need for chatops to have "the ability to talk to some service to gather requisite information").
- No specific setup when a new cell is introduced in the cluster since the feature flags data is replicated to all cells at the database-level.
**Cons:**
- Supporting "percentage of time/actors" per cell would require a bit more changes, but wouldn't be impossible.
- No support for enabling feature flags per deployment ring, but this could be supported in ChatOps with `--ring <ring-id>` and a mapping of `ring => [list of cells]`, so that the `--ring <ring-id>` option would be transformed in several `--cell <cell-id>` calls.