Investigate handling Application Settings "sync" at the Terraform level

Based on a discussion that happened in Slack on 2024-11-08.

Sam:

Because settings may be something that is a rollout concern and could even be API/feature flag driven in some guise

Steve:

This is one of my problems with the current application settings they are either UI or API based and not configuration as code based. If we really want to have a rollout strategy I think we’ll need to focus on having these as configuration as code, which we can then do the rollout strategy with ring that you’ve mentioned.

Configuration as code is achievable without too many changes to the aplication if the GitLab terraform provide adds support to it and it’s exposed via the API

Sam:

Yeah - I think that we should aim to eventually roll these things into the "feature flags" conversation and larger plan which we can iterate towards. There are likely a lot of shared problems, particularly with things like visualisation and understanding state that may make sense to share across solutions.

Alessio:

I think we have to think very well about terraform, especially how to stucture the state file (or files), if we use a single one we are creating a shared lock that will slow down the rollout. A boring solution could be to use the toolbox to set the desired value using a rails console or a specific rake task

Or we can use a state file for each tenant to avoid locking

Steve:

I think if we go the tf route, we might not need to replicate application settings at all potentially, since they would be just another config we set in intrusmentor

Rémy:

That's a good point and idea!

The only thing is that not all application settings are exposed/updatable through the API at the moment (but I think it's fine to add them). Also, in terms of authentication/authorization, I think it would require an admin token unless we add mTLS support.

Thong:

This will require:

  1. No one is allowed to update application setting in UI.
  2. No code can use application setting until it is "synced" by TF

Pros

  • No need to introduce an external "synchronization" mechanism that would go through the Topology Service: everything would be handled centrally by the deployment tool (Instrumentor?)
    • That would also reduce the point of failure and remove responsibility from the Topology Service.
  • Conceptually, it makes sense to use Infrastructure as Code and consider Application Settings important enough that we want to manage them through IaC and avoid Admin UI usage for that.
    • This would also remove the need for decryption with leader cell key + transit encryption => decryption from transit encryption + encryption with follower cell key, since the updates would go through the API so that the decryption/encryption are handled transparently by the application on each Cell.
  • The application doesn't need to know that some of its settings are enforced/synced between Cells (unless we need to prevent updates through the UI, see the drawbacks below).
  • SREs will no longer have to do change management issues like gitlab-com/gl-infra/production#1385 (closed)
  • Allow us to do ring propagation of admin settings, for example changing a setting will be rollout progressively instead of set in 1 go.

Cons

  • We'd probably need to prevent updates through the UI, or at least add a disclaimer that some settings are handled/enforced through IaC? That also means we still need to keep the list of "cluster-level" attributes that are enforced through Terraform (unless we follow the proposal just below).
    • We could also just follow the spirit of eventual consistency and regularly run checks to ensure settings aren't drifting from the expected state?
  • In terms of authentication/authorization, we'd need to use an Admin token per Cell (unless we use mTLS?). Note that an external synchronization solution would have the same requirement.