_gitlab_session=cell2-5cf865011acc4f85f7d4842bfbb94ecf # `cell2` is the prefix for this cookie
We will use this prefix to be able to route a request via the HTTP router. We need to configure the existing cells to specify the sessionCookieTokenPrefix
Action Items
Update Instrumetor to support SessionCookieTokenPrefix for GCP merged
@andrewn@ktchernov I would like your opinion on this (and the rest of the foundations' team!), how should we go about handling this specific configuration field, that we'll need to enable for every cell that we'll create? Ideally, this should be done automatically.
Tenant Model: Would this fit in the tenant model? My concern around setting this in the tenant model is that we'll overload the tenant model with too many configuration fields, so maybe we'll just have cells: true in the tenant model and it will configure this automatically?
Reference Architecture: Another option, which I'm not sure is viable is to use the reference architecture, and we specify a reference architecture for cells, which set this field, but I'm not if its overload the reference architecture term since that is mostly talking about how "big/small" the deployment of the GitLab instance should be rather then how to configure it.
Tenant Model: Would this fit in the tenant model? My concern around setting this in the tenant model is that we'll overload the tenant model with too many configuration fields, so maybe we'll just have cells: true in the tenant model and it will configure this automatically?
This would be my preference. Wherever possible, we should favour convention-over-configuration, avoiding exposing unnecessary configuration through the tenant model when it doesn't add value.
In this case, having cell: true is useful information and will help clearly define a tenant as being a Cell.
For backwards compatibility, I think that the default should be cell: false, meaning that the existing Tenant Models are all cell: false without having to set this explicitly.
Reference Architecture: Another option, which I'm not sure is viable is to use the reference architecture, and we specify a reference architecture for cells, which set this field, but I'm not if its overload the reference architecture term since that is mostly talking about how "big/small" the deployment of the GitLab instance should be rather then how to configure it.
To me, this feels like information hiding when we actually want to make it clear on the model whether or not a tenant is a cell.
Additionally, in future, it's likely that there will be additional fields which only apply to cells, so having a cell field will allow JSON-Schema conditionals to be applied for the Cell-only fields.
This would be my preference. Wherever possible, we should favour convention-over-configuration, avoiding exposing unnecessary configuration through the tenant model when it doesn't add value.
In this case, having cell: true is useful information and will help clearly define a tenant as being a Cell.
For backwards compatibility, I think that the default should be cell: false, meaning that the existing Tenant Models are all cell: false without having to set this explicitly.
I like this approach; in which case would it be beneficial to update the description of this issue to highlight as such?
We need to decide on a prefix for the session so that we can then configure the router and topology service to know what prefix to look at and to which origin we need to route it to, so the following configurations need to be in sync:
HTTP Router: Rule set to look at same value as global.raills.sessionStore.sessionCookieTokenPrefix in the rule set
Topology Service: To start off with we'll use a key/value config which will take ``global.raills.sessionStore.sessionCookieTokenPrefix` and the origin IP/DNS (the cell)
I think we have the following options:
Create prefix on cell-$TENANTID-: Currently looking at the Cell we've provisioned this would be _gitlab_session=cell-c01j2gdw0zfdafxr6-xxxxx
Pro: Consistent identifier of the Cell
Con: We'll only know the prefix at the provisioning stage
I don't think we will use $TENANTID as this would require adding this identifier to Topology Service.
@ayufan do you have a suggestion for an identifier other than $TENANTID? Or are you suggesting Option 2 for the prefix?
My preference would be consistency utilising the $TENANTID where possible, so it's one consistent ID used when referencing resources related to a specific cell.
I don't think we will use $TENANTID as this would require adding this identifier to Topology Service.
@ayufan whether or not the topology service knows about the $TENANT_ID, it will need to know about the prefix. Having the two be the same value seems to be the best option and would also help with diagnostics and debugging in future, since we don't need to do any additional lookup to obtain the TENANT_ID.
@andrewn@tkhandelwal3 This identifier will also filter into (personal access) tokens, so it has to be truly immutable for a long time. This alone is a long string to put. This is less of a concern about _gitlab_session.
OK. So the value is managed by generated by us. Can we then use a next sequential number?
@ayufan what system is going to own the sequential number? From an infra side, we try to avoid sequential numbers because we face problems with naming conventions across multiple systems and there isn't a sequential number increment like a postgres database (since we don't provision infrastructure stored in a relational database) which is why we opt of a randomly generated string (this is also how dedicated works).
Depending on sequential number on infra is also hard when there are "gaps" in the sequential number for example if we have 1..100 some engineers might be tempted to use a loop for that index, but then if we remove cell 55 that would start failing, using random string forces us to think as cattle so we filter by labels instead (example).
Also, it depends on what owns the ID, either the topology service or the system provisioning cells, because I assume there needs to be some ordering in the beginning.
@ayufan what system is going to own the sequential number? From an infra side, we try to avoid sequential numbers because we face problems with naming conventions across multiple systems and there isn't a sequential number increment like a postgres database (since we don't provision infrastructure stored in a relational database) which is why we opt of a randomly generated string (this is also how dedicated works).
This is unavoidable anyway due to database ID range which to some extent will be sequential.
We will need to have process anyway to assign next database ID range as defined by Topology Service. Allocating a next database ID range, and Cell ID could simply originate by first configuring Topology Service, and indicating the Cell provisioning state. Cells added to config.toml are never removed, but rather marked as decomissioned.
So, everything would originate in Topology Service deployer.
We will need to have process anyway to assign next database ID range as defined by Topology Service. Allocating a next database ID range, and Cell ID could simply originate by first configuring Topology Service, and indicating the Cell provisioning state.
Just to make sure my understanding is correct: we are expecting Topology Service to give cells their identifiers?
Apparently not according to the 2nd document you linked.
The currently agreed upon naming convention for cells already makes use of ULID (this is what the $TENANT_ID we mentioned above is). Perhaps we could exploit that to compute the ranges? Since the document talks about doing that in the future anyway:
To allow us to switch to a variant of ULID ID allocation in future without interfering with the existing IDs.
The currently agreed upon naming convention for cells already makes use of ULID (this is what the $TENANT_ID we mentioned above is). Perhaps we could exploit that to compute the ranges? Since the document talks about doing that in the future anyway:
We need to re-evaluate this then.
Perhaps we could exploit that to compute the ranges?
Changing to ULID is not feasible now, since we create a big discrepancy between first cell and additional cells that we are not ready for.
Can you explain more on why there is a need to re-evaluate?
Changing to ULID is not feasible now, since we create a big discrepancy between first cell and additional cells that we are not ready for.
The first dev and prod cells have already been provisioned using a ULID for the $TENANTID as per the naming convention for cells. For my own understanding, can you also explain where the discrepancy will come from from the first to the additional cells?
Can you explain more on why there is a need to re-evaluate?
We have a desire to keep minimal amount of distinct identifiers. So far we were inclined to use Cell ID as an integer due to database sequences, etc. The TENANTID introduces another model. This is why we need to reevaluate the solution, as the integer Cell ID will still not go away. We need to define exactly how the process of provisioning cell will happen: an order of operations.
We do conflate here two different ULIDs. One ULID is used by TENANTID, second ULID-alike is to have database unique sequences across cluster. We not gonna implement ULID for database unique sequences as the database on gitlab.com is not ready for this model.
The problematic part here is how the TENANTID is generated. We cannot use this information to guide database sequences generation. So, we need to figure out a way to move forward and figure out if we can have single ID or whether we gonna have different IDs.
I'm pretty sure that we will cannot use TENANTID when generating personal access tokens (and alike), as this identifier is too long compared to existing generated identifier. If I'm correct the random part of PAT is 16 characters.
So far we were inclined to use Cell ID as an integer due to database sequences, etc. The TENANTID introduces another model. This is why we need to reevaluate the solution, as the integer Cell ID will still not go away. We need to define exactly how the process of provisioning cell will happen: an order of operations.
The current format of the TENANTID is what it is because it's used as the basis for naming all of the underlying infrastructure for a single tenant (cell) - GCP project, service accounts, etc. The underlying systems have various constraints on what constitutes an appropriate name, like it can't be more than 30 characters long, can't be shorter than 3 characters, and must start with a letter. We also need enough flexibility so that future use cases for naming (such as additional GCP projects for runners for a cell, etc.) can be accommodated.
I personally am not against rethinking the naming convention since it's still early enough that we can change it easily, but I'm just wary of twisting ourselves into knots trying to solve too many problems at once with one identifier to rule them all, and ultimately ending up with something that does not make things easy to identify.
So, we need to figure out a way to move forward and figure out if we can have single ID or whether we gonna have different IDs.
I was under the impression that there would be some service (Topology Service, maybe?) that would register a new cell as online when it completes provisioning, and provide the mapping between tenant ID and database sequence. This is actually the first time I'm hearing of the idea to derive a database sequence directly from the tenant ID.
I'm pretty sure that we will cannot use TENANTID when generating personal access tokens (and alike), as this identifier is too long compared to existing generated identifier. If I'm correct the random part of PAT is 16 characters.
We could try hashing the tenant ID and truncating the result. We would need to do this at some point anyway if the tenant ID was just an integer, because as soon as we reach cell 10000 we would not be able to just stick the tenant ID into the PAT (if that's what you're thinking). Worst comes to worst we could ask for the format of PATs to be changed to accommodate our use cases.
I was under the impression that there would be some service (Topology Service, maybe?) that would register a new cell as online when it completes provisioning, and provide the mapping between tenant ID and database sequence. This is actually the first time I'm hearing of the idea to derive a database sequence directly from the tenant ID.
No. Cells is a static list. We don't want to implement dynamic functions as this introduces a bunch of complexity and security challenge.
I think that the process of adding Cell more or less will look like this:
As part of Topology Service Deployer a new Cell is being added:
Admin runs a script or process to add a new cell under a new address.
The script claims next ID, assigns next DB range, provisions TLS certificate that Cell can use, generates identifiers. Sets initial state of cell to be state = "provisioning".
Admin creates MR, MR is merged and change to config.toml is deployed.
The information generated by the deployer are put into Tenant provisioner: IDs, names, certificates.
A new Cell is provisioned with all information in place.
A provisioned Cell will perform as part of provisioning a configuration validation against Topology Service. It will use TLS certificate, configure database ranges, ensure that all identifiers are in sync.
Once Cell is provisioned we will change state = "deployed".
Cell once claimed will never be removed from config.toml, instead we will mark it as state = "decomissioned", this way we will have a single source of truth.
Some other remarks:
The Cell ID can be ensured by the 1. and the fact that Cells will never be removed from the config.toml.
The generated TENANTID can be second degree identifier that encodes all information that can be easily used to be consumed by the admins.
ULID does not provide any value, since timestamp part is more or less useless for us.
We should consider that TENANTID to contain as much info about Cell as possible: c<Cell-ID>-gcp-uswest-<random-string>. This way we could easily find later all requests for the particular criteria by using c*-*-uswest-*.
We could try hashing the tenant ID and truncating the result
This is usually bad idea. We can do different way: hash Cell ID with a salt to generate more complex random string. Doing this other direction is super likely to generate collisions.
It seems we have multiple ID options to choose from, and I agree with the sentiment of not trying to twist ourselves into keeping one ID to rule them all.
Coming back to the original question for this thread given all of the above: What prefix to specify?
From what I see, we have two options:
Cell ID generated as part of Topology Service Deployer.
Tenant ID.
Separating the discussion out for what the $TENANTID looks like long term (as that feels like a separate discussion), say we were to use the Cell ID coming from Topology Service Deployer for the prefix to move forward.
What does that look like in practice?
Is there a timeline for getting that Cell ID generation in place?
Cell ID is stable identifier that never changes for the particular set of data. It is a unique number. Cell ID 100 is the same forever regardless in which region it is deployed.
Tenant ID is derivative of Cell ID that represents a particular deployment in a given place, be it GCP or region.
Each Cell ID can have many Tenant IDs, as for the particular Cell ID we might have many Geo replicas that might become authoritative at some point that represent the same set of data, but in a different places.
The Tenant ID is a structured identifier that extends on top of Cell ID, contains a information describing what and where, and some randomness to indicate uniqueness. It could be of the form proposed above, ex.: c100-gcp-uswest-12356789afcdef.
The Topology Service holds information about cell_id and tenant_id. Each Tenant does get own set of TLS certificates. Each Tenant for the given Cell ID generates the same cookies and tokens in the same format.
The Topology Service is aware of all Tenants holding the data of a given Cell ID and can instruct HTTP Router to balance some load accordingly.
I have concerns about a Git-managed .toml file being used as a state-tracking single source of truth for our infra, but those aside (since I don't want to derail the conversation again)...
Each Cell ID can have many Tenant IDs, as for the particular Cell ID we might have many Geo replicas that might become authoritative at some point that represent the same set of data, but in a different places.
Wouldn't the tenant itself be responsible for managing and routing traffic between Geo replicas? I note that Dedicated already supports Geo (but doesn't use it by default).
Also, I (and I'm sure many others) was working under the assumption that 1 Cell == 1 Dedicated tenant. It seems likely that we'd need to make many changes to the underlying tooling to support any other model, since each tenant believes they are a standalone GitLab instance.
Tenant ID is derivative of Cell ID that represents a particular deployment in a given place, be it GCP or region.
We explicitly ruled this out for the naming convention given the strict length limits we're working with. For the us-* regions that will be fine, but the more verbosely-named regions like europe-west4 (12 chars) or northamerica-northeast2 (23 chars) will seriously limit our ability to use the identifier anywhere else. -gcp- would be redundant as well since there are no plans for AWS cells. We are planning to surface pertinent information, such as region, as metadata elsewhere.
What if we were to say that CELLID == TENANTID (assuming that we can continue with the 1 Cell == 1 tenant model), and we start at c00001? The initial 0s are padding to meet the 345 6 character limit, which we can forgo for cell 100 onward. This seems like the simplest option.
I have concerns about a Git-managed .toml file being used as a state-tracking single source of truth for our infra, but those aside (since I don't want to derail the conversation again)...
Cells are not meant be dynamically provisioned and decommissioned. This process is too heavy - there's no easy and fast way to move data out. They are meant to be stable entities that live over long period of times, that might be decommissioned at some point once all information is migrated away. This is contrary to solving local scaling problem where you dynamically scale a fleet of nodes to serve a particular Cell, like Sidekiq nodes.
Also, I (and I'm sure many others) was working under the assumption that 1 Cell == 1 Dedicated tenant. It seems likely that we'd need to make many changes to the underlying tooling to support any other model, since each tenant believes they are a standalone GitLab instance.
In principle this is correct assumption and a good simplification initially.
However, I'm sure that we will be face the DR problem and requirement where some Cells might need to be Geo-replicated. This could also be a special perk of the architecture for our Enterprise customers. We gonna put you (customer) on this Cell that is replicated across Atlantic, and we gonna always serve data from the region closest to you.
-gcp- would be redundant as well since there are no plans for AWS cells
I would never rule this out. It is redundant now, but Cells architecturally can work multi-cloud with limited egress impact. I'm not tied to gcp as well, unless other folks are.
We explicitly ruled this out for the naming convention given the strict length limits we're working with. For the us-* regions that will be fine, but the more verbosely-named regions like europe-west4 (12 chars) or northamerica-northeast2 (23 chars) will seriously limit our ability to use the identifier anywhere else.
Where the 16 characters limit is coming from?
we start at c00001
In general we were thinking that at most we would be having 1024 cells. So, 4 digits gives us 9999 cells. We can go on safe side and allocate as you propose 99999.
I have concerns about a Git-managed .toml file being used as a state-tracking single source of truth for our infra, but those aside (since I don't want to derail the conversation again)...
This is likely to change in the far future with the introduction of the Database to Topology Service, we won't add this right for iteration purposes.
Tenant ID is derivative of Cell ID that represents a particular deployment in a given place, be it GCP or region.
This has already been pointed out, but a tenant will/can run in 2 regions in 1 go, since Geo is baked into a tenant so if we enable Geo we'll be running into 2 regions.
Also, I (and I'm sure many others) was working under the assumption that 1 Cell == 1 Dedicated tenant.
I'm working under the same assumption here, and I think we should stick with that, having 2 tenant models trying to "serve" the same subset of user traffic will lead to confusion, split brain, and conflicts.
If we need to "fix a mistake" it's better to provision a new Cell and register that new cell with topology service.
Easy mapping between application logs and infrastructure
Infrastructure and application are tightly coupled (still can't decide if this is a pro or con)
Con:
Length might still be long for PAT, unless we extend the length of PATs which can be considered a breaking change since we have scanners looking at specific lengths with token prefixes
"Expose" infrastructure information to the user, making it easier to reverse engineer. We use the tenant id as a fingerprint of all infrastructure, exposing this to the user seems dangerous because they can try and use that information to extract information around of infrastructure.
Biggest con Right now we are going to use Cell ID as an identifier for routing, however what will happen when we migrate one of the organizations to another cell, we said on the router layer we'll do full token checks so we route to the new cell, but we'll no longer have cell_id == tenante_id but we'll have a potential 1 to many.
In #25621 (comment 2011549850) we discussed having a new field in the tenant model called cells: true, what if instead we have cell_id: [0-9]+? When cell_id is present we know this is going to be a cell so it will act as if we set cells: true.
Pros
Addresses all the cons from option 1
Decoupling infrastructure and application. In gitaly we created a naming convention where we coupled the name of the Gitaly storage in application with the name of the node in GCP so we have an easy mapping. This didn't turn out as expected because for DR reasons we have provision new nodes when we fallback, and they can't have the same name which resulted in a weirder naming convention specifying b,c at the end of the node name, so we lost the 1:1 mapping. The problem here is storage name in the application has a different lifecycle than the infrastructure. A Cell and tenant right now seem to have the same lifecycle.
Different IDs for different uses, using the same ID will end up having us jump into hoops like strange naming conventions, application support, and such.
Cons
Split in ID, which to use when. We could end up using cell_id as an identifier for some stuff and tenant_id or other things in infrastructure which could end up being a problem. For example the domain does it end up following tenant_id or cell_id?
Provisioning and bootstrapping of a cell becomes a chicken and egg, in the sense do we create tenant_id and cell_id at the same time, which one do we create first?
Loose 1:1 mapping from logs. If inside of the rails application we just log cell_id, but we still have the cell_id inside of the tenant model so we still find our way to it.
We can easily address this problem by configuring a tenant_id that we add to logs in the rails application (or inject it via hostname) since we'll know which pod/cluster will serve the request anyways.
@ayufan for my sake I want to try and have more concrete examples of the prefix for sessions and tokens:
Some of the current tokens on GitLab.com (invalid tokens):
Decoupling infrastructure and application. In gitaly we created a naming convention where we coupled the name of the Gitaly storage in application with the name of the node in GCP so we have an easy mapping. This didn't turn out as expected because for DR reasons we have provision new nodes when we fallback, and they can't have the same name which resulted in a weirder naming convention specifying b,c at the end of the node name, so we lost the 1:1 mapping. The problem here is storage name in the application has a different lifecycle than the infrastructure. A Cell and tenant right now seem to have the same lifecycle.
I'm working under the same assumption here, and I think we should stick with that, having 2 tenant models trying to "serve" the same subset of user traffic will lead to confusion, split brain, and conflicts.
It is funny that you write this those two cases seems to contradict each other. Cell ID is meant to describe data set, Tenant ID is meant to describe a deployment. Why we equal those two together?
If we assume that Tenant ID is always derivative of Cell ID there's no confusion there. The confusion is when Tenant ID is completely random string. In practice this is a common pattern (to have one ID to be derivative of another ID), like in Kubernetes pods: replica-set-name = <deployment-name>-<random-string>.
By that pattern we can have c100-abcdef and c100-fedcba that represent the same data set that is tightly coupled (with Geo) but different deployments, even in different regions.
It is funny that you write this those two cases seems to contradict each other. Cell ID is meant to describe data set, Tenant ID is meant to describe a deployment. Why we bundle those two together?
@ayufan I think what I mean by cell here is from an infrastructure perspective not from the application perspective.
Lets explore a case of DR where we have to recreate the Cell from backup:
@ayufan If we do this, the DR recovery will be done within the tenant, and not provision a brand new cell, for speed reasons. It would take a while to create DNS records, GKE clusters, brand new postgres clusters, it's faster to do in backup recovery for this because it's much faster to recover from a disk snapshot then creating everything from scratch.
We will have 2 DR strategies in Cells:
Geo replication: Hot replication, fallback via DNS records
Backup: In-place disk snapshot recovery for storage systems (gitaly + postgres)
It also depends what kind of DR we are talking about since there are few scenarios: Zonal outage, regional outage, application logic bug self-deleting everything.
I think what I mean by cell here is from an infrastructure perspective not from the application perspective.
I think we should resolve this confusion and talk about tenants as a deployment of a cell, but not use cell interchangeably. If we need we should refer to Tenant Cell from the infra perspective that represent a particular deployment of a Cell.
If we do this, the DR recovery will be done within the tenant, and not provision a brand new cell, for speed reasons
Correct:
If we restore snapshots it is gonna likely be within a Tenant Cell.
If we use Geo it is across two Tenant Cells, but both referencing the sharing the same dataset as defined by Cell ID.
If we use Geo it is across two Tenant Cells, but both referencing the sharing the same dataset as defined by Cell ID.
@ayufan Geo is done within a tenant so a single tenant will be deployed in 2 regions, and then we'll fallback to the appropriate region, we won't have 2 tenants for multi regional, but we'll have a single tenant with a primary region and a secondary region.
_gitlab_session=000015cf865011acc4f85f7d4842bfbb94ecf # where `00001` is the prefix for this cookie
@donnaalexandra I don't think we need the prefix at all, we can with either:
cell-1-xxx
cell1xxx
We add this cell_id to the tenant model as suggested in Option 2 of this comment.
Agreed.
One feedback we got from @ayufan is regarding our naming convention his feedback was that the random string is hard to identify and doesn't mean anything. Maybe we should try and follow something similar to what other infra tools do:
docker uses container names, or container names + random string
kubernetes uses deploy name + random string
google cloud uses the same for project name + random string
I don't think we need the prefix at all, we can with either:
cell-1-xxx
cell1xxx
@sxuereb sorry if it was not clear, I meant that the number prefix for the _gitlab_session comes from the cell ID. I'd added the "padding" as per @ayeungs comment above:
The initial 0s are padding to meet the 345 6 character limit, which we can forgo for cell 100 onward. This seems like the simplest option.
But we can probably forgo the padding if the cell_id != tenant_id
One feedback we got from @ayufan is regarding our naming convention his feedback was that the random string is hard to identify and doesn't mean anything.
Should we separate the naming convention for tenants into a separate thread or issue?
I've assigned myself to this Issue and am just diving in on pushing it forward.
Thanks to @sxuereb for helping to get my feet under me, I think to make this happen we'll need to modify some of the charts that are generated in Instrumentor as well as how they are received by GET
I'm brand new to GitLab and Dedicated's tooling, so if anyone wants to throw more directional info at me to help me hone in and unblock this Issue faster, I'd appreciate the guidance.
I haven't been directly involved with Cells for a quarter, but I would encourage you to go through the sandbox setup and the rest of the Dedicated onboarding steps (minus the team-specific things of course) before trying to understand the nitty gritty of how Instrumentor, Amp and GET fit together, otherwise you will get lost very quickly
@a_richter@sxuereb I realize we should use cell2 for the Dev/Prod Pre-QA cell. cell1 should be reserved for the legacy cell. Have updated the issue description.
If it's the expected result, can we close this issue and open a follow-up to configure cellsprod similarly once teamDelivery-Deployments is ready to upgrade Instrumentor?
If I understand it right, the session that is being currently generated in $CELL_ID-XXXX which doesn't match the HTTP Router Rule which expected cell-$CELL_ID-XXX, @bmarjanovic@godfat-gitlab is that correct?
I think we have two options:
Update routing rule: This has drawbacks because it makes $CELL_ID-XXX hard to match with regex and prone to false matches and less clear
Update instrumentor rails to set cell-4-XXXX, making it clearer as a session prefix.
Update routing rule: This has drawbacks because it makes $CELL_ID-XXX hard to match with regex and prone to false matches and less clear
Since the session id is a hex string, which only contains [0-9a-f], it doesn't include -, as long as the cell id is always at the beginning separated by -, it should not have false matches.
However I do think it's less clear and I had been under the impression that we'll use cell-$CELL_ID-XXXX so I would prefer we stick to that.
That said, I think we can change the HTTP router as well, if that's easier to change.