Feature flags implementation
As discussed with the Rollout Plan for Error Tracking backed by Click... (#1728 - closed) rollout issue, we need a way of feature-gating the error tracking ingestion API to better control testing and rollout.
While we have strict feature flag controls within the rails application, there are no such controls on the opstrace side and should look into methods to protect our infrastructure against traffic spikes during the initial rollout phases.
We should have feature flags available to all our containers so we can gradually roll out new features, as GitLab generally does https://about.gitlab.com/handbook/product-development-flow/feature-flag-lifecycle/. The key difference here is we have a distributed system by design, not primarily a single application, so we must handle the propagation of feature flags throughout.
We are now starting to introduce basic feature flags (FFs) as part of !1863 (merged) and !1867 (merged).
GOUI seems also to have a FF implementation (opstrace-ui#71) but it's not clear how much we will need to interact with this.
Goals:
- Short-term:
- Feature flags can be enabled at
Cluster
(i.e. global) level per-instance and at aGitLabNamespace
(i.e. GitLab group) level. - Feature flag definitions are predefined as part of the GOB distribution (i.e. baked into main scheduler image).
- Easily defined in development, through some script or other means.
- Allow us to deliver off-by-default features across multiple smaller MRs.
- Allow us to revert experimental production changes without rolling back changes/revert MRs.
- Feature flags can be enabled at
- Long-term:
- Ability to set through GitLab ChatOps.
- More advanced rollout features (e.g. percentage of users).
- Allow us to follow the GitLab FF dev process.
GitLab Feature Flags
Read more about FF usage here.
GitLab uses the flipper library. Feature flag definitions are stored on disk, in https://gitlab.com/gitlab-org/gitlab/-/tree/master/config/feature_flags.
Advanced Features
At the moment, we probably only need Cluster
and GitLabNamespace
level FFs that are a simple boolean state, but most libraries provide a standard set of features:
- Percentage rollout, based on deterministic hash thresholds of particular IDs, or a random distribution.
- Target User IDs.
- Multiple custom rules.
It's also very useful to have FF state in metrics/traces/logs. At some point we'd want to include their state in telemetry so we can see usage in certain paths or see quickly if an error is related to a feature being enabled.
Technical Proposal
FF Definition
We should follow the same FF structure as GitLab uses, even if we don't use all the fields initially:
Field | Required | Description |
---|---|---|
name | yes | Name of the feature flag. |
type | yes | Type of feature flag. |
default_enabled | yes | The default state of the feature flag. |
introduced_by_url | no | The URL to the merge request that introduced the feature flag. |
rollout_issue_url | no | The URL to the Issue covering the feature flag rollout. |
milestone | no | Milestone in which the feature flag was created. |
group | no | The group that owns the feature flag. |
Initially the empty group
implies observability. Eventually the group may be split into other functional areas!
We may not need a type
enum as used in Gitlab for a while (development
, ops
or experiment
). We could assume development
here until we have a valid use case.
Feature flag yaml files can be embedded centrally in a shared Go lib that will be included in each of our built components. This way feature flag checks can be validated against the known list, emitting a warning if a flag is being used that has not been predefined.
FF Toggles
At the moment we need toggles at the Cluster
and GitLabNamespace
(aka GitLab group) level.
We can use these CRs as they are not "owned" by another controller, so we can freely modify them via the API without them being overwritten later.
Cluster
is currently created in Terraform. We could enable cluster level flags here, or have a patch operation so they can be defined in both terraform and on-the-fly.
GitLabNamespace
is created by the gatekeeper when provisioning. Flags can only be enabled once these CRs have been created.
Initially we can have a map in each CR that represents the explicit toggling of FFs, like:
spec:
features:
my_feature_flag: {}
Presence of a feature flag key this map indicates the FF is enabled. The flag must already have a definition, against which it is validated for existence.
The Golang map can be map[string]Flag
or something like this, where the Flag
type is initially empty. Eventually this type could contain flag specific rules, like a user ID or rollout percentage strategy. This will allow the API to be forwards compatible.
One problem with this approach is propagating FFs to other resources, such as tenant-operator
Groups
and Tenants
, which in turn create other resources. We can have the FFs propagated into these CRDs from GitLabNamespace
. This leads to some repetition and duplication, but is simpler than replacing a shared ConfigMap
.
Propagation
As discussed above, propagating the FFs into other components is a key design challenge.
When we control third-party resources, we can at best decide how to configure them based on our FFs, if at all necessary. We should always attempt to do this before forking a component to add extra functionality.
For our own non-operator containers, e.g. error tracking, we should provide a simple way of deserializing the enabled FFs from the environment. This could be an environment variable format or config file format.
Remember, in the case of our own Go applications, we will include the FF package that embeds all FF definitions.
Go API
We might consider a Features
type definition, with a useful API similar to that of flipper:
type Features map[string]Flag
func (Features) Enabled(name string) bool {
// check name is valid and in definitions before checking existence
}
func NewFromEnv() Features {
// load toggled features from environment
// perhaps variable names prefixed with "FF_"
// env values are serialized flag options
}
func NewFromFile(path string) Features {
// just yaml decode the file
}