Design: Approach for iterative tables classification between clusterwide and pod local databases

Discussion

Kamil: Everything will move to gitlab_main_pod or gitlab_main_cluster. This is the difference to what is being described. The difference is that gitlab_main won't exist in the end and we want to start converting tables one at a time to explicitly _pod and anything without _pod or _cluster is still "undefined".

If a query uses gitlab_main and gitlab_main_cluster then this query is not undefined yet. If your query uses gitlab_main_cluster and gitlab_main_pod then this is clearly not allowed. You can assign the users table to gitlab_main_cluster and you can assign the namespaces table gitlab_main_pod so it should it should allow either of them to be joined to gitlab_main or their same schema but never allow gitlab_main_cluster and gitlab_main_pod. You can target a single table at a time.

If we start with users and namespaces this should give a smaller list of violations to fix straight away. For example we don't need to tackle interaction between users and personal_access_tokens. If we record all tables being touched in the workflow of group creation then we could focus only on these tables and this should shrink the problem space. This should make it easier but we'll need to try it out to be sure.

SELECT * from namespaces inner join projects .... => gitlab_main_pod,gitlab_main => only relevant information is that this is gitlab_main_pod as this is the most specific context you are working in. tables_to_schemas(['namespaces','projects']) => ['gitlab_main_pod']. This works most places except cross-modifications builds up array of schemas over time and this needs to evict gitlab_main when it sees a more specific gitlab_main_... before detecting violations. That still needs to be added to !108462 (0d566e7e) .

Conclusion

Some conclusions from this discussion: (from @DylanGriffith)

The choice of new schema gitlab_main_cluster or gitlab_cluster is mostly just down to naming considering how we're implementing
This approach does require adapting our schema violation tooling but likely we'd need updates no matter what we choose. It seems like the approach described as tables_to_schemas(['namespaces','projects']) => ['gitlab_main_pod'] will simplify the implementation and Kamil will continue this and getting it working in !108462 (closed)
Likely we won't have much to fix in application_settings decomposition and it's mostly done in this MR anyway. Things will only start to get more complicated when we mark users: gitlab_main_cluster, namespaces:gitlab_main_pod as this will reveal a bunch of violations. We plan to only cascade this further to tables that are written during the creation of a new namespace and along with the trick in (2) above this will hopefully narrow our focus to a smaller set of tables for now
Already application_settings has raised a few important issues to be resolved by other teams and we'd like to start engaging these teams early to brainstorm ideas on how to proceed
Development on the "user creates namespace" workflow should be concurrent with working with other teams to evaluate the user-facing impacts of the problems we discover along the way as this helps derisk some major unknowns
We don't want to fan out tasks to too many teams too widely initially until we have some clear documented patterns for how specific problems can be solved (a template of sorts)

Recording

From this recorded discussion https://www.youtube.com/watch?v=NLZvSZpb3fQ