Skip to content

Fix database load balancer for newly created OauthAccessToken

What does this MR do and why?

On GitLab.com we have an architecture where there is a primary database and multiple read-only replicas. The replicas have some small amount of replication lag. We can send read-only queries to the replicas to reduce traffic to the primary and we do this whenever it is safe.

Today our database load balancing logic relies on some kind of "sticking" object. Usually the way this works is that when a user performs a write to the database we "stick" that user to the primary. This generally works for users that are authorized with cookies because subsequent requests include the user_id and we can look up exactly what their "sticking" state is.

But it does not work for newly generated OAuth tokens. As discovered in #579054 (comment 2857344904) we have an issue where we don't actually know which user the OAuth token belongs to until we load the OAuth token from the database. But if we load the token from a replica then it might not exist.

Our load balancer does already have to deal with similar edge cases elsewhere as in !136422 (merged) . The way we do this is by introducing a new sticking object. In this case we take a secure non-reversible hash of the OAuth token and store that as the sticking key in Redis. Then in future when we get a request with an OAuth token we can look up in Redis to see if it has a sticking session. If so we use the primary (or a sufficiently caught up replica for this specific sticking LSN position).

References

Screenshots or screen recordings

Before After

How to set up and validate locally

  1. Setup your GDK with a replica like https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md
  2. Configure 1 minute of replication lag with https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing.md#simulating-replication-delay
  3. Update your config/database.yml so that you only have the replica database in the replica pools (ie. remove the primary) so your load_balancing for main and ci look like:
    load_balancing:
      hosts:
        - /Users/<myuser>/workspace/gitlab-development-kit/postgresql-replica
  4. gdk restart
  5. Use duo agentic chat
  6. Without this fix:
    1. it always fails
    2. Look at workhorse logs and observe the 401s
    3. Look at Duo Workflow Service logs and observe the 401s
  7. With this fix:
    1. It works
    2. You can also see the sticking sessions in gdk redis-cli like:
    redis /Users/dylangriffith/workspace/gdk/redis/redis.socket> keys database-load-balancing/*
    1) "database-load-balancing/write-location/main/oauth_token/fe7bb732cb70f7c21a6bb8b169c2ff78d0676123b1cbf69868cac9449151b873"
    2) "database-load-balancing/write-location/main/user/1"
    3) "database-load-balancing/write-location/main/oauth_token/d12a9284a7865e617630a7d246371d12d5ebcf99665965950076dfe4e12a2821"

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #579054

Edited by Dylan Griffith

Merge request reports

Loading