Validate on GDK that a logged in user on the 2nd cell have their requests routed to the correct cell. Note: As we do not have login flow yet, we may need to a implement a special rule to force requests to route to the special cell given a special cookie. So that we can get to the login page for Cell 2.
What does classify mean in this context? It looks like it'll POST to Topology service, and the service will simply say proxy it. My assumption is that classification is the way for the http router to ask the topology service which cell it should fetch from. Do we support proxying more than one cells yet?
for (construleofrules){returnapplyRule(modifiedRequest,env,ctx,rule);}
If so does that mean having src/rules/passthrough.json can already proxy all the requests? Or did I misunderstand how TypeScript works here, that it'll not iterate through all the rules?
I didn't read the type carefully. Now I see this is configured via GITLAB_RULES_CONFIG, so we'll only pick one rules set in a JSON file. We set to passthrough so only src/rules/passthrough.json is used.
Now I think that the above loop should be updated because otherwise we can't have more than one rules. match is where conditional rules are introduced. It looks like it's partially implemented in gitlab-org/cells/http-router@92e89766
What does classify mean in this context? It looks like it'll POST to Topology service, and the service will simply say proxy it. My assumption is that classification is the way for the http router to ask the topology service which cell it should fetch from. Do we support proxying more than one cells yet?
We're not applying all rules, currently we'll apply the first and only matching rule.
What is src/rules/firstcell.json rule used for?
firstcell is a routing rule performing dynamic classification to route to the FirstCell defined by Topology Service. Requires GITLAB_TOPOLOGY_SERVICE_URL to be configured.
There are also some docs here that might be helpful.
@OmarQunsulGitlab Thanks a lot! The context helped a bunch. I did read the doc but didn't read into my mind. Now with better context in mind I understand much more while re-reading it.
@bmarjanovic Yes, thank you for the specific sections! Re-reading them closer now, I am a bit surprised to see this line:
Sends the request to /api/v1/classify (type=project_id_or_path, value=1000) if no Cells was found in cache.
Below are not relevant to this issue, and I think it must have been discussed before:
I was only thinking about project paths before, and since project paths need to be unique globally (across cells), it makes sense to have topology service to oversee this. I completely forgot the API part.
This sounds like it means we need to make project id globally unique across cells as well, and essentially any resources that we use a global id to look up via API, which doesn't feel like a good design to me, especially if we try to make this id serial and incremental.
I understand this is how GitLab is working right now, given that having a single database, a serial incremental id isn't really an issue. But do we have any plans to change this somehow? I am asking this because if we can scope everything under a group or a project, then it shouldn't be an issue if the id collides between cells.
That said, maybe if we can make sure the ids are all still unique across cells, we can more easily move organization between cells without rewriting the ids. That said, a serial is still not needed, maybe we can utilize UUID or another kind of random id maybe? This way we wouldn't need to lock on incrementing the id either.
firstcell is a routing rule performing dynamic classification to route to the FirstCell defined by Topology Service. Requires GITLAB_TOPOLOGY_SERVICE_URL to be configured.
Thanks! I think my main question is why do we need to know the first cell, and have a way to talk to it? Is it a kind of leader in the cells?
I checked my local GDK closer, and launched cell-2 to verify my understanding. It looks like:
We're actually treating the top level GitLab instance as cell-1 (listens on 3333), and ignore gitlab-cells/cell-1 (listens on 12001).
It doesn't look like we support more than one cell yet (which is my assumption as well)
So I guess maybe firstcell is just a test or experiment?
regex_match will be used to match the cookie value. Is it ok to rename at this stage? I think the name is a bit confusing. I'll probably call it value_match or value_regex_match, or match_value.
@godfat-gitlab yes, the FirstCell is a starting point, and AFAIK GDK should support multiple cells.
regex_match will be used to match the cookie value. Is it ok to rename at this stage? I think the name is a bit confusing. I'll probably call it value_match or value_regex_match, or match_value.
@bmarjanovic GDK supports managing multiple cells. I am now running 3 GitLab instances, one from top-level, and 2 from cells. However running gdk reconfigure doesn't make http router nor topology service aware of the 2 GitLab instances from cells, so they're effectively idling.
My understanding is that of course that is the case, since we're now implementing session prefix, and that's one of the prerequisites to utilize more than one cells after all.
This sounds like it means we need to make project id globally unique across cells as well, and essentially any resources that we use a global id to look up via API, which doesn't feel like a good design to me, especially if we try to make this id serial and incremental.
In initial iteration we do very simple design. We assign a big range of sequences for a Cell, that is guaranteed to be monotonic within this Cell.
Yes, we want all IDs to be unique across cluster so we can easily move them around and to not rewrite the IDs. The problem with IDs is that they are often exposed to user, over API, so if we were meant to rewrite them, we are likely to brake some access patterns that users do.
Correct, we should update the config.toml inside the topology-service to be aware of the multiple Cells. You can take a look into this, to get better understanding.
@ayufan This is so small that it's not readable by itself It's ok, if I need I can find the source of the diagram.
At the time of writing, the largest ID is 11,098,430,930 (primary key of security_findings table), so it’s 200 times the current largest ID, which should be (more than) sufficient.
Very interesting. How about jobs though? I would guess it doesn't really fit. That said, for API or web UI, it's scoped to the project: https://docs.gitlab.com/ee/api/jobs.html#get-a-single-job so maybe it doesn't matter to collide globally after all, unless, maybe we won't be able to move organizations into the same cell if jobs ids are not globally unique? Or could we make jobs use composite primary key, with the old primary id and the project id? That should be unique globally.
Correct, we should update the config.toml inside the topology-service to be aware of the multiple Cells. You can take a look into this, to get better understanding.
That design is protective in a sense that don't have full understanding if the composite IDs, or keys collisions can introduce-unsolvable-problem/or-significant-breakage-to-application to us. We do not know that. The reason for this is that, we can rather guarantee cluster-wide uniqueness, and if this becomes limiting factor we can relax this case-by-case.
I'm simply worried about a system where we deliberately introduce key collisions, and cannot authoritatively make decision which record is valid.
Second reason is that moving organizations between Cells becomes significantly harder if we allow collisions and potentially a big security risk:
all IDs needs to be remapped
this brings a lot complexity on Org Mover to rewrite all places (including ones that do not have FKs/LFKs)
us missing some rewrite can introduce nasty cross-linking of data
makes it significantly harder to perform data validation, or make Org Mover process retryable/interruptable
Also: adding composite Primary Key in all places has been a nightmare when looking at the CI partitioning effort Edit: hard and long effort.
I would guess only jobs id could be exhausted though.
I'm not afraid of it, since:
Cells are meant to be 1%-10% of the current Cell 1.
We can always change and allocate higher bits to give more headroom. We intentionally do not allocate all range to give us a way to change to ULID or other format if this is required later.
@sxuereb in this case, only about dev environment (GDK)
I am mentioning this, because the matching rules will be the same in all environments, so just to make sure that also on GDK we generate the correct prefixes as well. With the same format.
Do we need a new Classify type, or is it OK to re-use SessionPrefix ?
SessionPrefix would work since the logic for checking value of the cookie with a specified name is the same. (Provided none of the rules above this rule are matched)
@godfat-gitlab Potentially, this is something you can look at.
If you need help, or want to ask questions, @bmarjanovic and @manojmj can help. I think there previously was a draft MR, but we also need to validate the routing actually does happen to the 2nd (local GDK) cell.
It's not really a draft MR, we introduced Session prefixed based routing in the same MR that introduced multiple types of routing rules (gitlab-org/cells/http-router!122 (merged)), but it was removed within the same MR as we didn't want to ship Session prefix based routing at that point in time. You can see the commit that removed that code associated to Session prefix based routing here: gitlab-org/cells/http-router@92e89766
I introduced a new abstraction Action so that we can follow object-oriented programming patterns, instead of switch case all over the places: gitlab-org/cells/http-router!218 (merged)
This should make adding features much more easier.
Next I would want to change src/rules/index.ts so that we instantiate the right instance from the very beginning, either ClassifyRule or ProxyRule, so that we can move the only switch case there.
If we don't do this, we might be forced to use another switch case in findMatchingRule which is not ideal.
Whoops! Thanks for bringing that up. It looks like it would introduce quite some conflicts. Conceptually it could be resolved, but it'll require some changes.
And another follow up to construct the concrete object for Rule (it was named Action but renamed to Rule in this merge request): gitlab-org/cells/http-router!230 (merged)
After above are all merged, we can look into adding match method into the Rule object. All the prerequisites are used to make sure we only need a single switch case to construct the concrete objects, and the rest can just follow object-oriented programming that the rule object itself can decide what it should do.
When this rule is matched, the router will return its version number. We also need to make sure that Rails developers are aware that this route is claimed by http-router so it will never reach to Rails when http-router is on the front.
It's on top of gitlab-org/cells/http-router!232 (merged) and still a draft that I just made it compile without errors. I should be able to complete it tomorrow and ask for review.
After gitlab-org/cells/http-router!232 (merged) adding a new version action/rule should be very simple. We just need to create a new VersionRule class and override match and fetch
I believe there are more state leaks we need to fix, and I would like to get gitlab-org/cells/http-router!233 (merged) merged first before fixing other tests and adding other tests, as I don't want to block gitlab-org/cells/http-router!223 (merged) any longer and would like to iterate with shipping this new feature first before I get OOO (out of office) starting this Friday.
Not confident that I can complete all of above before going on vacation (I also have other things to do ), but I hope I could at least reach to getting gitlab-org/cells/http-router!233 (merged) merged.
I think it's clear that I can't get all of above done before I go on vacation, but I would hope I could at least get gitlab-org/cells/http-router!233 (merged) merged.
@tkuah I am now trying to write tests for gitlab-org/cells/http-router!233 (merged) and thinking that if force-cell should take priority over _gitlab_session. That is, when force-cell is presented, we classify for that, and ignore _gitlab_session.
Does that make sense? If so, we should probably reverse the order in JSON so that we always match from top to bottom.
@godfat-gitlab Interesting question. I was thinking force-cell is lower - it's only a means to allow someone to get to the login page on the cell.
There may be situations where we want it to have the highest priority, for debugging ? I think that's more debug-cell, or we just use the internal DNS for that cell instead.
@tkuah If it's lower, how do people switch to the other cell once they previously landed on another cell? Do we clear _gitlab_session when logging out, and don't write to it if the user is not logged in?
I am imagining that we'll use force-cell where appropriate, and once landed on a particular cell, the application will remove it, so next it'll follow with _gitlab_session, logging in or not.
I am imagining that we'll use force-cell where appropriate, and once landed on a particular cell, the application will remove it, so next it'll follow with _gitlab_session, logging in or not.
it('should return cached response if available',async ()=>{constcachesMock={default:{match:vi.fn(),},};caches.default.match=cachesMock.default.match;constcachedResponse=newResponse(JSON.stringify({action:'PROXY',proxy:{address:'cached.example.com'}}),{headers:{'Content-Type':'application/json'},});cachesMock.default.match.mockResolvedValue(cachedResponse);fetchMock.get('http://cached.example.com').intercept({path:'/',headers:{'x-forwarded-host':'example.com'}}).reply(200,'response from cell1');constcachesDefaultMatchSpy=vi.spyOn(caches.default,'match');constcachesDefaultPutSpy=vi.spyOn(caches.default,'put');constresponse=awaitworkerFetchURL('http://example.com')expect(awaitresponse.text()).toBe('response from cell1');expect(cachesDefaultMatchSpy).toHaveBeenCalledTimes(1);expect(cachesDefaultPutSpy).not.toBeCalled();});
If I run it with my other tests, it'll show this error which I have a hard time understanding what it means:
FAIL test/index.spec.ts > when rules are configured > with `session_prefix` rule > when force-cell cookie is matched > when value matched the format > sets correct headersTypeError: Body has already been used. It can only be used once. Use tee() first if you need to read it twice. ❯ Module.classifyFetch src/topology_service/classify.ts:25:19 23| } 24| 25| return response.json() as Promise<ClassifyResponse>; | ^ 26| } 27| ❯ ClassifyRule.fetch src/rules/classify.ts:21:22 ❯ workerFetchRequest test/index.spec.ts:62:20 ❯ test/index.spec.ts:389:28
@bmarjanovic@godfat-gitlab Can you please make a demo video about this ? Being able to route to a 2nd cell on GDK is amazing, and it is worth sharing this exciting achievement.
@bmarjanovic Yeah, I have no calls tomorrow. How would you like to proceed? I was thinking something like we did in the call, running here and there, and record it.
Amazing @godfat-gitlab, I'll schedule a call for tomorrow then.
The idea is to enable multiple Cells on the GDK and apply session_prefix rules. Based on the cookie values, we can then demonstrate how the new rule functions. WDYT?
For the http router, GITLAB_TOPOLOGY_SERVICE_URL is set to http://localhost:9096 which is the address of the topology service, and GITLAB_RULES_CONFIG is set to session_prefix.
I didn't set GITLAB_PROXY_HOST and I changed the last rule in src/rules/session_prefix.json from proxy to classify to FIRST_CELL so that I don't need to set a proxy
With all the above setting, when I visit http://localhost:3000 which is the address of the http router, I can see that the first cell is visited by looking at gdk cells cell-1 tail workhorse.
When I add a cookie force-cell=cell-2, I can see that the other cell is visited instead by looking at gdk cells cell-2 tail workhorse. When I remove it, I can see it goes back to cell-1.
Next I probably need to configure actual session prefix in cell-1 and cell-2 to verify looking at session prefix does work.
Really cool, I could reproduce this locally! If this can help, here is a more complete list of config to set:
gdk cells cell-1 config set gitlab.rails.session_store.session_cookie_token_prefix cell-1-gdk cells cell-1 config set unique_cookie_key_postfix falsegdk cells cell-1 reconfiguregdk cells cell-1 restartgdk cells cell-2 config set gitlab.rails.session_store.session_cookie_token_prefix cell-2-gdk cells cell-2 config set unique_cookie_key_postfix falsegdk cells cell-2 reconfiguregdk cells cell-2 restart# allow to use the default '3000' port for the HTTP router (change 3001 to something else if you're already using 3001)gdk config set port 3001gdk config set gitlab_http_router.port 3000gdk reconfiguregdk restart
# allow to use the default '3000' port for the HTTP router (change 3001 to something else if you're already using 3001)gdk config set port 3001gdk config set gitlab_http_router.port 3000