Skip to content

Fix the bug that DB multi-request stickiness doesn't stick as expected

What does this MR do?

Fix the bug mentioned in gitlab-com/gl-infra/scalability#1006 (closed). In summary, Our DB Load Balancing layer has a special sticking mechanism spreading across requests. This mechanism ensures the read consistency caused by replicas' replication lag. This mechanism forces the next requests stick to the primary by storing the last write location of postgres into a redis key determined by a pair [namespace, id]. This is how to register a sticky object:

        def current_job
          id = params[:id]

          if id
            ::Gitlab::Database::LoadBalancing::RackMiddleware
              .stick_or_unstick(env, :build, id)
          end

          super
        end

Our current implementation is storing just one latest namespace and id:

        def self.stick_or_unstick(env, namespace, id)
          return unless LoadBalancing.enable?

          Sticking.unstick_or_continue_sticking(namespace, id)

          env[STICK_OBJECT] = [namespace, id]
        end

That makes multi-request stickiness doesn't stick as expected if an endpoint touches multiple objects. This MR is to store all namespaces and ids ever recorded into a set. The next requests iterate and check for all the write locations instead of the latest one. In detail:

  • After a request is done, the write locations are written into multiple keys determined by recorded namespaces and ids. As the write location is fetched from Postgres at this stage, all of the locations are all equal or greater than the write location when the actual writes are performed, hence if the following requests depend on any recorded locations, the consistency is guaranteed.
  • Before a request, the middleware checks if any of the replica catches up with the recorded write location from the watching stick objects. Therefore some cases to consider:
    • If the request doesn't watch any recorded stick object, it means the request is irrelevant to the previous ones, we can use any replica.
    • If the request watches just a subset of the recorded stick objects, the loop in unstick_or_continue_sticking forces the session to stick to the primary if any of them doesn't catch up. It may be defensive, but safe for the consistency.

In conclusion, the read consistency is strengthen after this change.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Quang-Minh Nguyen

Merge request reports