Fix database load balancer for newly created OauthAccessToken (!211007) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

On GitLab.com we have an architecture where there is a primary database and multiple read-only replicas. The replicas have some small amount of replication lag. We can send read-only queries to the replicas to reduce traffic to the primary and we do this whenever it is safe.

Today our database load balancing logic relies on some kind of "sticking" object. Usually the way this works is that when a user performs a write to the database we "stick" that user to the primary. This generally works for users that are authorized with cookies because subsequent requests include the user_id and we can look up exactly what their "sticking" state is.

But it does not work for newly generated OAuth tokens. As discovered in #579054 (comment 2857344904) we have an issue where we don't actually know which user the OAuth token belongs to until we load the OAuth token from the database. But if we load the token from a replica then it might not exist.

Our load balancer does already have to deal with similar edge cases elsewhere as in !136422 (merged) . The way we do this is by introducing a new sticking object. In this case we take a secure non-reversible hash of the OAuth token and store that as the sticking key in Redis. Then in future when we get a request with an OAuth token we can look up in Redis to see if it has a sticking session. If so we use the primary (or a sufficiently caught up replica for this specific sticking LSN position).

References

Screenshots or screen recordings

Before	After

How to set up and validate locally

Setup your GDK with a replica like https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md
Configure 1 minute of replication lag with https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md#simulating-replication-delay
Update your config/database.yml so that you only have the replica database in the replica pools (ie. remove the primary) so your load_balancing for main and ci look like:
```
load_balancing:
  hosts:
    - /Users/<myuser>/workspace/gitlab-development-kit/postgresql-replica
```
gdk restart
Use duo agentic chat
Without this fix:
1. it always fails
2. Look at workhorse logs and observe the 401s
3. Look at Duo Workflow Service logs and observe the 401s

With this fix:

It works
You can also see the sticking sessions in gdk redis-cli like:

redis /Users/dylangriffith/workspace/gdk/redis/redis.socket> keys database-load-balancing/*
1) "database-load-balancing/write-location/main/oauth_token/fe7bb732cb70f7c21a6bb8b169c2ff78d0676123b1cbf69868cac9449151b873"
2) "database-load-balancing/write-location/main/user/1"
3) "database-load-balancing/write-location/main/oauth_token/d12a9284a7865e617630a7d246371d12d5ebcf99665965950076dfe4e12a2821"

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #579054

Edited Oct 31, 2025 by Dylan Griffith

Fix database load balancer for newly created OauthAccessToken

What does this MR do and why?

References

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

Merge request reports