Update Cell Database Sequence ID Docs

172c7141 · Prabakaran Murugesan · Steve Xuereb · c9d68014 · 172c7141 · 172c7141
Commit 172c7141 authored 3 weeks ago by Prabakaran Murugesan 2️⃣ Committed by Steve Xuereb 3 weeks ago
--- a/content/handbook/engineering/architecture/design-documents/cells/_index.md
+++ b/content/handbook/engineering/architecture/design-documents/cells/_index.md
@@ -44,7 +44,6 @@ This section links all different technical proposals that are being evaluated.
  - Planned: Indexing Service
 - [Mutual authentication between Cell services](mutual_authentication_between_cell_services.md)
 - [Feature Flags](./infrastructure/feature_flags.md) - ([Previous iteration](feature_flags.md))
- [Cluster wide unique sequences](unique_sequences.md)
 - [Cells: Infrastructure](./infrastructure/_index.md)
 - [Organization migration](migration.md)
 - [Routable Tokens](routable_tokens.md)
@@ -194,4 +193,4 @@ The Tenant Scale team sees an opportunity to use GitLab Dedicated as a base for
 - [Database group investigation](../../../infrastructure-platforms/data-access/database-framework/doc/root-namespace-sharding/)
 - [Shopify Pods architecture](https://shopify.engineering/a-pods-architecture-to-allow-shopify-to-scale)
 - [Opstrace architecture](https://gitlab.com/gitlab-org/opstrace/opstrace/-/blob/main/docs/architecture/overview.md)
- [Adding Diagrams to this blueprint](diagrams/index.md)
+- [Adding Diagrams to this blueprint](diagrams/_index.md)
--- a/content/handbook/engineering/architecture/design-documents/cells/decisions/008_database_sequences.md
+++ b/content/handbook/engineering/architecture/design-documents/cells/decisions/008_database_sequences.md
@@ -13,28 +13,15 @@ and different solutions were discussed in <https://gitlab.com/gitlab-org/core-pl

 ## Decision

-All Cells will have bigint IDs on creation. While provisioning, each of them will get a
-large range of sequences to use from the [Topology Service](../topology_service.md).
-On decommissioning the cell, these ranges will be
-returned back to the topology service. If the returned range is large enough for another cell, it could be handed out to
-them so that the short-lived cells won't exhaust large parts of the key range.
+All cells will have bigint IDs on creation. While provisioning, each of them will get a
+range of sequences to use from the [Topology Service](../topology_service.md). This range is used to set
+`minval`, `maxval` for all existing and newly created sequence IDs.

-We will update the Legacy Cell's sequence to have a `maxval`, it will be a minimum possible range to make sure it
-won't collide with any Cells.
-
-## Consequences
-
-The above decision will support till [Cells 1.5](../iterations/cells-1.5.md) but not [Cells 2.0](../iterations/cells-2.0.md).
-
-To support Cells 2.0 (i.e: allow moving organizations from
-Cells to the Legacy Cell), we need all integer IDs in the Legacy Cell to be converted to `bigint`. Which is an
-ongoing effort as part of [core-platform-section/data-stores/-/issues/111](https://gitlab.com/gitlab-org/core-platform-section/data-stores/-/issues/111)
-and it is estimated to take around 12 months.
+Topology service uses the logic explained in [here](../topology_service.md#logic-to-compute-the-range) to compute the sequence range.

 ## Alternatives

-In addition to the [earliest proposal](../rejected/impacted_features/database_sequences.md), we evaluated
-below solutions before making the final decision.
+Below are the different solutions considered for this problem.

 - [Solution 1: Global Service to claim sequences](https://gitlab.com/gitlab-org/core-platform-section/data-stores/-/issues/102#note_1853252715)
 - [Solution 2: Converting all int IDs to bigint to generate uniq IDs](https://gitlab.com/gitlab-org/core-platform-section/data-stores/-/issues/102#note_1853260434)

--- a/content/handbook/engineering/architecture/design-documents/cells/rejected/impacted_features/database_sequences.md
+++ b/content/handbook/engineering/architecture/design-documents/cells/rejected/impacted_features/database_sequences.md
---
-stage: enablement
-group: Tenant Scale
-title: 'Cells: Database Sequences'
-status: rejected
-toc_hide: true
---
-
-{{< design-document-header >}}
-
-_This was surpassed by the [Cells: Unique sequences](../../unique_sequences.md) blueprint._
-
-{{% alert %}}
-This document is a work-in-progress and represents a very early state of the
-Cells design. Significant aspects are not documented, though we expect to add
-them in the future. This is one possible architecture for Cells, and we intend to
-contrast this with alternatives before deciding which approach to implement.
-This documentation will be kept even if we decide not to implement this so that
-we can document the reasons for not choosing this approach.
-{{% /alert %}}
-
-GitLab today ensures that every database row create has a unique ID, allowing to access a merge request, CI Job or Project by a known global ID.
-Cells will use many distinct and not connected databases, each of them having a separate ID for most entities.
-At a minimum, any ID referenced between a Cell and the shared schema will need to be unique across the cluster to avoid ambiguous references.
-Further to required global IDs, it might also be desirable to retain globally unique IDs for all database rows to allow migrating resources between Cells in the future.
-
-## 1. Definition
-
-## 2. Data flow
-
-## 3. Proposal
-
-These are some preliminary ideas how we can retain unique IDs across the system.
-
-### 3.1. UUID
-
-Instead of using incremental sequences, use UUID (128 bit) that is stored in the database.
-
- This might break existing IDs and requires adding a UUID column for all existing tables.
- This makes all indexes larger as it requires storing 128 bit instead of 32/64 bit in index.
-
-### 3.2. Use Cell index encoded in ID
-
-Because a significant number of tables already use 64 bit ID numbers we could use MSB to encode the Cell ID:
-
- This might limit the amount of Cells that can be enabled in a system, as we might decide to only allocate 1024 possible Cell numbers.
- This would make it possible to migrate IDs between Cells, because even if an entity from Cell 1 is migrated to Cell 100 this ID would still be unique.
- If resources are migrated the ID itself will not be enough to decode the Cell number and we would need a lookup table.
- This requires updating all IDs to 32 bits.
-
-### 3.3. Allocate sequence ranges from central place
-
-Each Cell might receive its own range of sequences as they are consumed from a centrally managed place.
-Once a Cell consumes all IDs assigned for a given table it would be replenished and a next range would be allocated.
-Ranges would be tracked to provide a faster lookup table if a random access pattern is required.
-
- This might make IDs migratable between Cells, because even if an entity from Cell 1 is migrated to Cell 100 this ID would still be unique.
- If resources are migrated the ID itself will not be enough to decode the Cell number and we would need a much more robust lookup table as we could be breaking previously assigned sequence ranges.
- This does not require updating all IDs to 64 bits.
- This adds some performance penalty to all `INSERT` statements in Postgres or at least from Rails as we need to check for the sequence number and potentially wait for our range to be refreshed from the ID server.
- The available range will need to be stored and incremented in a centralized place so that concurrent transactions cannot possibly get the same value.
-
-### 3.4. Define only some tables to require unique IDs
-
-Maybe it is acceptable only for some tables to have a globally unique IDs. It could be Projects, Groups and other top-level entities.
-All other tables like `merge_requests` would only offer a Cell-local ID, but when referenced outside it would rather use an IID (an ID that is monotonic in context of a given resource, like a Project).
-
- This makes the ID 10000 for `merge_requests` be present on all Cells, which might be sometimes confusing regarding the uniqueness of the resource.
- This might make random access by ID (if ever needed) impossible without using a composite key, like: `project_id+merge_request_id`.
- This would require us to implement a transformation/generation of new ID if we need to migrate records to another Cell. This can lead to very difficult migration processes when these IDs are also used as foreign keys for other records being migrated.
- If IDs need to change when moving between Cells this means that any links to records by ID would no longer work even if those links included the `project_id`.
- If we plan to allow these IDs to not be unique and change the unique constraint to be based on a composite key then we'd need to update all foreign key references to be based on the composite key.
-
-## 4. Evaluation
-
-## 4.1. Pros
-
-## 4.2. Cons
--- a/content/handbook/engineering/architecture/design-documents/cells/topology_service.md
+++ b/content/handbook/engineering/architecture/design-documents/cells/topology_service.md
@@ -134,24 +134,27 @@ Topology Service will make sure that the given range is not overlapping with oth
 #### Logic to compute the range

 ```mermaid
-graph TD
+flowchart TD
  A[64 bits] --> |1 bit - MSB| B[Sign]
  A -->|6 bits| C[Reserved]
-  A -->|16 bits| D[CellID]
-  A -->|41 bits| E[Sequence]
+  A -->|57 bits| D[Sequence]
+  D --> E{Legacy Cell?}
+  E --> |Yes|F[min: 1, max: 10^12 - 1]
+  E --> |"No (new cells)"| G{'QA' bucket?}
+  G --> |Yes| H[min: currentMaxId + 1, max: min + 10^9 - 1]
+  G --> |No| I[min: currentMaxId + 1, max: min + 10^11 - 1]
 ```

 - **Sign**: Always 0 for positive numbers.
 - **Reserved**: Currently always `0`, reserved for 2 purposes.
  1. To increase the number of cells, if needed.
  1. To allow us to switch to a variant of ULID ID allocation in future without interfering with the existing IDs. Since
-   ULID based ID allocator will have the `timestamp` value in the  most significant bits,
+   ULID based ID allocator will have the `timestamp` value in the most significant bits,
   reserving only one bit would have been sufficient but
   more bits are reserved to have the sequence bits at minimum.
- **CellID**: A unique auto-incrementing [unique identifier for a Cell](decisions/012_cell_unique_identifier.md) starting with `1`, can support up to 65,535 Cell IDs.
- **Sequence**: The sequence that will be used for each table in the database.
-  41 bits can support ~2 trillion IDs (2199,023,255,551) per cell (per sequence).
-  At the time of writing, the largest ID is 11,098,430,930 (primary key of `security_findings` table), so it's 200 times the current largest ID, which is sufficient.
+- **Sequence**:
+  - Legacy cell gets the first trillion IDs. QA cells get 1 billion IDs and other new cells get 100 billion IDs each.
+  - Assuming all the new cells created are non-QA and excluding the legacy cell, this will support 1,441,141 cells (using 57 bits).

 Example `config.toml` of Topology Service:

@@ -159,28 +162,60 @@ Example `config.toml` of Topology Service:
 [[cells]]
 id = 1
 address = "legacy.gitlab.com"
-sequence_range = [0, 2199023255551]
+sequence_range = [1, 999999999999] # 1 trillion
+buckets = ["paid", "free"]
+status = "active"

 [[cells]]
 id = 2
 address = "cell-2-example.gitlab.com"
-sequence_range = [2199023255552, 4398046511103]
+sequence_range = [1000000000000, 1099999999999] # 100 billion
+buckets = ["paid", "free"]
+status = "active"
+
+[[cells]]
+id = 3
+address = "cells-3-test.gitlab.com"
+sequence_range = [1100000000000, 1100999999999] # 1 billion
+buckets = ["QA"]
+status = "active"
+
+[[cells]]
+id = 4
+address = "cells-4-example.gitlab.com"
+sequence_range = [1101000000000, 1200999999999] # 100 billion
+buckets = ["free"]
+status = "active"
 ```

-Calculation for `id = 1`:
+- Status:
+  - ready: Cell is not yet ready to accept traffic, but we hold a slot.
+  - online: Cell is accepting traffic and is part of cluster discovery.
+  - offline: Cell is valid but not accepting traffic and is still part of cluster discovery.
+  - removed: Cell is removed and will never be active again.
+
+Once the cell gets `removed`, we will update `sequence_range` with the _maxval_ consumed by the cell.
+So that if a normal cell gets removed (decommissioned), new QA cells can get IDs from those unused IDs (if it's more than 1 billion).
+
+##### Sequence Saturation
+
+At the time of writing the largest ID in the legacy cell was ~11 billion (PK of `security_findings` table), so
+the legacy cell and new non-QA cells will have sufficient IDs to grow within their sequence_range.
+
+QA cells might need more IDs as they are given 1 billion IDs. Cells sequence data are monitored regularly,
+and TS can provide an additional 1 billion IDs (from currentMaxId) to the cell, if their consumption is over 99%.

- Sequences per cell: `2^41 -> 2199023255552`
- Sequence `min`: `(CellId - 1) * SequencesPerCell` -> `(1 - 1) * 2199023255552` -> `0`
- Sequence `max`: `(CellId * SequencesPerCell) - 1` -> `(1 * 2199023255552) - 1` -> `2199023255551`
+[Issues#517296](https://gitlab.com/gitlab-org/gitlab/-/issues/517296) handles this.

-Calculation for `id = 2`:
+NOTE:

- Sequences per cell: `2^41 -> 2199023255552`
- Sequence `min`: `(CellId - 1) * SequencesPerCell` -> `(2 - 1) * 2199023255552` -> `2199023255552`
- Sequence `max`: `(CellId * SequencesPerCell) - 1` -> `(2 * 2199023255552) - 1` -> `4398046511103`
+- The above decision will support till [Cells 1.5](iterations/cells-1.5.md) but not [Cells 2.0](iterations/cells-2.0.md).
+  - To support Cells 2.0 (i.e: allow moving organizations from
+  Cells to the Legacy Cell), we need all integer IDs in the Legacy Cell to be converted to `bigint`.
+  Which is an ongoing effort as part of [core-platform-section/data-stores/-/issues/111](https://gitlab.com/gitlab-org/core-platform-section/data-stores/-/issues/111)
+  and it is estimated to take around 12 months.

-More details on the decision taken and other solutions evaluated can be found [here](decisions/008_database_sequences.md)
-and the reasoning behind choosing the logic to generate sequence ranges can be found [here](https://gitlab.com/gitlab-org/gitlab/-/issues/465809).
+More details on the decision taken and other solutions evaluated can be found [here](decisions/008_database_sequences.md).

 ```proto
 // sequence_request.proto
@@ -711,7 +746,7 @@ Citations:

 1. Google (n.d.). Using private service connect with cloudrun services. Google Cloud. Retrieved Nov 11, 2024, from <https://cloud.google.com/vpc/docs/private-service-connect>
 1. Google (n.d.). How multi-region with cloud spanner works. Google Cloud. Retrieved Nov 11, 2024,<https://cloud.google.com/blog/topics/developers-practitioners/demystifying-cloud-spanner-multi-region-configurations>
-1. [ADR for private service connect](..q/decisions/004_vpc_subnet_design/)
+1. [ADR for private service connect](decisions/004_vpc_subnet_design.md)

 ### Performance


--- a/content/handbook/engineering/architecture/design-documents/cells/unique_sequences.md
+++ b/content/handbook/engineering/architecture/design-documents/cells/unique_sequences.md
---
-stage: core platform
-group: database
-title: 'Cells: Unique sequences'
-status: accepted
-toc_hide: true
---
-
-GitLab today ensures that every database row create has a unique ID, allowing to access a merge request, CI Job or Project by a known global ID.
-Cells will use many distinct and not connected databases, each of them having a separate ID for most entities.
-
-At a minimum, any ID referenced between a Cell and the shared schema will need to be unique across the cluster to avoid ambiguous references.
-Further to required global IDs, it might also be desirable to retain globally unique IDs for all database rows to allow moving organizations between Cells.
-
-## 1. Goal
-
-Is to have non-overlapping sequences across the cluster, so that there will not be a problem while moving organizations between cells.
-
-## 2. Decision
-
-Cells will have bigint IDs while provisioning and each cell will reach out to the Topology Service to get
-the sequence range, TS will ensure that the sequence ranges are not colliding with other cells.
-
-The range got from the SequenceService will be used to set `maxval` and `minval` for all existing ID sequences and any
-newly created IDs.
-
-Logic to compute to the sequence range and the interactions between cells and the topology service can be found [here](topology_service.md#workflow).