Commit b23f6a4e authored by Sangwoo Han's avatar Sangwoo Han Committed by Tarun Khandelwal
Browse files

Adds Path-Based Routing to HTTP Router's design doc

parent e383fabe
Loading
Loading
Loading
Loading
+37 −219
Original line number Diff line number Diff line
@@ -187,7 +187,7 @@ The Routing Service implements the following design guidelines:

1. Simple:
   - Routing service does not buffer requests.
   - Routing service can only proxy to a single Cell based on request headers.
   - Routing service can only proxy to a single Cell based on the incoming request.
1. Stateless:
   - Routing service does not have permanent storage.
   - Routing service uses multi-level cache: in-memory, external shared cache.
@@ -199,9 +199,7 @@ The Routing Service implements the following design guidelines:
   - Routing service is configured with a static list of Cells.
   - Routing service configuration is applied as part of service deployment.
1. Rule-based:
   - Routing rules are a static JSON file that is part of routing service.
   - Configured rules needs to be made compatible with all versions of GitLab running in a cluster.
   - Rules allows to match by any criteria: header, content of the header, or route path.
   - The routing rules are configurable.
1. Agnostic:
   - Routing service is not aware of high-level concepts like organizations.
   - The classification is done per-specification provided in a rules, to find the classification key.
@@ -236,11 +234,14 @@ graph TD;

### Routing rules

- The router applies a hierarchy of rules to route a request.
  - Incoming requests are first matched against path patterns configured in the router. Claims are extracted from the paths (e.g. /:NAMESPACE_PATH) and used to determine the target cell via [Topology Service](topology_service.md).
  - The router falls back to token based routing if the claim in the path is not found, or a path does not contain a claim (e.g. `/api/graphql` and `cable`).
- The routing rules describe how to decode the request, find the classification key, and make the routing decision.
- The routing rules are static and defined ahead of time as part of HTTP Router deployment.
- The routing rules are defined as a JSON document describing in-order a sequence of operation.
- Apart from the Path routing rule, other routing rules are defined as a JSON document describing in-order a sequence of operations.
- The routing rules might be compiled to application code to provide a way faster execution scheme.
- Each routing rule is described by the `cookies`, `headers`, `path`, `method`, and `action`.
- Each routing rule is described by the `cookies`, `headers`, `method`, and `action`.
- The `action` can be `classify` as a way to indicate that the Topology Service should be used
  to perform dynamic classification.
- The `action` can be `proxy` as a way to indicate to perform passthrough to the fixed
@@ -269,14 +270,11 @@ The routing rules JSON structure describes all matchers:
                    "match_regex": "<regex_match>"
                },
            },
            "path": {
                "match_regex": "<regex_match>"
            },
            "method": ["<list_of_accepted_methods>"],

            "action": "classify",
            "classify": {
                "type": "session_prefix|project_path|...",
                "type": "session_prefix|...",
                "value": "string_build_from_regex_matchers"
            },

@@ -322,25 +320,6 @@ Example of the routing rules that makes routing decision based session cookie, a
}
```

Example of the routing rules published by all Cells that makes routing decision based on the path:

```json
{
    "rules": [
        {
            "path": {
                "match_regex": "^/api/v4/projects/(?<project_id_or_path_encoded>[^/]+)(/.*)?$"
            },
            "action": "classify",
            "classify": {
                "type": "project_id_or_path",
                "value": "${project_id_or_path_encoded}"
            }
        }
    ]
}
```

### Classification

The classification is implemented by [the Classify Service of the Topology Service](topology_service.md#classify-service).
@@ -359,57 +338,40 @@ The classification is implemented by [the Classify Service of the Topology Servi
  wipe a particular type of cache on edge.
- The cache is controlled by Topology Service, but the HTTP Router might force some response into the cache.

For the above example:
#### Path-Based Routing example

1. The router sees request to `/api/v4/projects/1000/issues`.
1. It selects the above `rule` for this request, which requests `classify` for `project_id_or_path_encoded`.
1. It decodes `project_id_or_path_encoded` to be `1000`.
1. Checks the cache if there's `project_id_or_path_encoded=1000` associated to any Cell.
1. Sends the request to `/api/v1/classify` (`type=project_id_or_path`, `value=1000`) if no Cells was found in cache.
1. Topology Service responds with the Cell holding the given project, and also all other equivalent classification keys
   for the resource that should be put in the cache.
1. Routing Service caches for the duration specified in configuration, or response.
For a request to `/o/swh/dashboard/groups`:

```json
# POST /api/v1/classify
## Request:
{
    "type": "project_id_or_path",
    "value": 1000
}

## Response:
{
    "action": "proxy",
    "proxy": {
        "address": "cell1.gitlab.com"
    },
    "other_classifications": [ // list of all equivalent keys that should be put in the cache
        { "type": "session_prefix", "value": "cell1" },
        { "type": "project_full_path", "value": "gitlab-org/gitlab" },
        { "type": "project_full_path", "value": "gitlab-org/gitlab" },
        { "type": "namespace_full_path", "value": "gitlab-org" }
    ]
}
```mermaid
sequenceDiagram
    participant user as User
    participant router as Router
    participant ts as Topology Service
    participant cell_2 as Cell 2
    user->>router: GET /o/swh/dashboard/groups
    router->>router: Router matches /o/:ORGANIZATION_PATH<br/>Extract ORGANIZATION_PATH: "swh"
    router->>+ts: Classify({bucket: {type: "ORGANIZATION_PATH", value: "swh"}})
    ts->>-router: Proxy(address="cell-2.gitlab.com")
    router->>cell_2: /o/swh/dashboard/groups
    cell_2->>user: <h1>...
```

The following code represents a negative response when a classification key was not found:
#### JSON rule classification example

```json
# POST /api/v4/internal/cells/classify
## Request:
{
    "type": "project_id_or_path",
    "value": 1000
}
For a request to `/api/v4/job/allowed_agents` with a `JOB-TOKEN` header encoding `cell-2`:

## Response:
{
    "action": "reject",
    "reject": {
        "http_status": 404
    }
}
```mermaid
sequenceDiagram
    participant client as Client
    participant router as Router
    participant ts as Topology Service
    participant cell_2 as Cell 2
    client->>router: GET /api/v4/job/allowed_agents<br/>JOB-TOKEN with cell-2 encoded
    router->>router: Does not contain a claim<br/>Falls back to token based routing<br/>Extracts the cell from the header
    router->>+ts: Classify({session_prefix: "cell-2"})
    ts->>-router: Proxy(address="cell-2.gitlab.com")
    router->>cell_2: /api/v4/job/allowed_agents
    cell_2->>client: <h1>...
```

### Configuration
@@ -482,150 +444,6 @@ Note: It is important for this rollout strategy to follow the timeline. You will
   `25`, `50`, `75`, `100` percents. Keep `CHANGE_LOCK_OVERRIDE` and `OVERRIDE_LAST_PERCENTAGE` set to `true` through entire rollout cycle.
1. Once 100% of traffic is rollout out, open MR on [deploy-worker.sh](https://gitlab.com/gitlab-com/gl-infra/cells/http-router-deployer/-/blob/main/scripts/deploy-worker.sh) script to set the value back to the full sequence `"5 25 50 75 100"`. Example: `ROLLOUT_PERCENTAGES="5 25 50 75 100"`. Remove the `OVERRIDE_LAST_PERCENTAGE` and `CHANGE_LOCK_OVERRIDE` environment variables in [`.gitlab-ci.yml`](https://gitlab.com/gitlab-com/gl-infra/cells/http-router-deployer/-/blob/main/.gitlab-ci.yml).

## Request flows

1. There are two Cells.
1. `gitlab-org` is a top-level namespace and lives in `Cell US0` in the `GitLab.com Public` organization.
1. `my-company` is a top-level namespace and lives in `Cell EU0` in the `my-organization` organization.

### Router configured to perform the following routing

1. The Cell US0 supports all other public-facing projects.
1. The Cell EU0 configured to generate all secrets and session cookies with a prefix like `cell_eu0_`.
   1. The Personal Access Token is scoped to Organization, and because the Organization is part only of a single Cell,
      the PATs generated are prefixed with Cell identifier.
   1. The Session Cookie encodes Organization in-use, and because the Organization is part only of a single Cell,
      the session cookie generated is prefixed with Cell identifier.
1. The Cell EU0 allows only private organizations, groups, and projects.
1. The Cell US0 is a target Cell for all requests unless explicitly prefixed.

Router rules:

```json
{
    "rules": [
        {
            "cookies": {
                "_gitlab_session": {
                    "regex_match": "^(?<cell_name>cell.*:)"
                }
            },
            "action": "classify",
            "classify": {
                "type": "session_prefix",
                "value": "${cell_name}"
            }
        },
        {
            "headers": {
                "GITLAB_TOKEN": {
                    "regex_match": "^(?<cell_name>cell.*-)"
                }
            },
            "action": "classify",
            "classify": {
                "type": "token_prefix",
                "value": "${cell_name}"
            }
        },
        {
            "action": "classify",
            "classify": {
                "type": "first_cell",
            }
        }
    ]
}
```

#### Goes to `/my-company/my-project` while logged in into Cell EU0

1. Because user switched the Organization to `my-company`, its session cookie is prefixed with `cell_eu0_`.
1. User sends request `/my-company/my-project`, and because the cookie is prefixed with `cell_eu0_` it is directed to Cell EU0.
1. `Cell EU0` returns the correct response.

```mermaid
sequenceDiagram
    participant user as User
    participant router as Router
    participant cache as Cache
    participant ts as Topology Service
    participant cell_eu0 as Cell EU0
    participant cell_eu1 as Cell EU1
    user->>router: GET /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9
    router->>+cache: GetClassify(type=session_prefix, value=cell_eu0)
    cache->>-router: NotFound
    router->>+ts: Classify(type=session_prefix, value=cell_eu0)
    ts->>-router: Proxy(address="cell-eu0.gitlab.com")
    router->>cache: Cache(type=session_prefix, value=cell_eu0) = Proxy(address="cell-eu0.gitlab.com"))
    router->>cell_eu0: GET /my-company/my-project
    cell_eu0->>user: <h1>My Project...
```

#### Goes to `/my-company/my-project` while not logged in

1. User visits `/my-company/my-project`, and because it does not have session cookie, the request is forwarded to `Cell US0`.
1. User signs in.
1. GitLab sees that user default organization is `my-company`, so it assigns session cookie with `cell_eu0_` to indicate that
   user is meant to interact with `my-company`.
1. User sends request to `/my-company/my-project` again, now with the session cookie that proxies to `Cell EU0`.
1. `Cell EU0` returns the correct response.

NOTE:
The `cache` is intentionally skipped here to reduce diagram complexity.

```mermaid
sequenceDiagram
    participant user as User
    participant router as Router
    participant ts as Topology Service
    participant cell_us0 as Cell US0
    participant cell_eu0 as Cell EU0
    user->>router: GET /my-company/my-project
    router->>ts: Classify(type=first_cell)
    ts->>router: Proxy(address="cell-us0.gitlab.com")
    router->>cell_us0: GET /my-company/my-project
    cell_us0->>user: HTTP 302 /users/sign_in?redirect=/my-company/my-project
    user->>router: GET /users/sign_in?redirect=/my-company/my-project
    router->>cell_us0: GET /users/sign_in?redirect=/my-company/my-project
    cell_us0-->>user: <h1>Sign in...
    user->>router: POST /users/sign_in?redirect=/my-company/my-project
    router->>cell_us0: POST /users/sign_in?redirect=/my-company/my-project
    cell_us0->>user: HTTP 302 /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9
    user->>router: GET /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9
    router->>ts: Classify(type=session_prefix, value=cell_eu0)
    ts->>router: Proxy(address="cell-eu0.gitlab.com")
    router->>cell_eu0: GET /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9
    cell_eu0->>user: <h1>My Project...
```

#### Goes to `/gitlab-org/gitlab` after last step

User visits `/gitlab-org/gitlab`, and because it does have a session cookie, the request is forwarded to `Cell EU0`.
There is no need to ask Topology Service, since the session cookie is cached.

```mermaid
sequenceDiagram
    participant user as User
    participant router as Router
    participant cache as Cache
    participant ts as Topology Service
    participant cell_eu0 as Cell EU0
    participant cell_eu1 as Cell EU1
    user->>router: GET /my-company/my-project<br/>_gitlab_session=cell_eu0_uwwz7rdavil9
    router->>+cache: GetClassify(type=session_prefix, value=cell_eu0)
    cache->>-router: Proxy(address="cell-eu0.gitlab.com"))
    router->>cell_eu0: GET /my-company/my-project
    cell_eu0->>user: <h1>My Project...
```

### Performance and reliability considerations

- It is expected that there will be penalty when learning new classification key. However,
  it is expected that multi-layer cache should provide a very high cache-hit-ratio,
  due to low cardinality of classification key. The classification key would effectively be mapped
  into resource (organization, group, or project), and there's a finite amount of those.

## Alternatives

### Buffering requests