Add circuit breaker logic to control the number of outbound requests to Topology Service

What does this MR do and why?

The MR implements a circuit breaker logic to make sure that at any given time, we don't have more than N active requests to the topology service from rails.

The limit of the concurrent requests is configurable as an application setting and can be increased or decreased based on the Cell size and along with the number of connection pool available in the cells.

The primary reason for the circuit breaker is that, as every request to claim a resource is done inside a DB transaction, we want to make sure a bug in the TS or claiming framework doesn't end up taking all the available connection pool for the application, making the application choke due to that.

I added the DEFAULT_LIMIT as 200, which is well under what we can currently support(290): gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team#488 (comment 2773726223)

References

How to set up and validate locally

  • Configure GDK as a Cell
  • In the rails console, enable the claiming and ts limit flags by:
Feature.enable(:cells_unique_claims)

Feature.enable(:topology_service_concurrency_limit)
  • Update the concurrency request to 1: Gitlab::CurrentSettings.update!(topology_service_concurrency_limit: 1) (since TS is local and returns very fast, reducing to the minimum value to mimic the error).
  • Use the below k6 script to generate load

k6 run ./k6.js

Warning

The script a lot of groups in the GDK

k6.js
import http from 'k6/http';
import { check } from 'k6';

export let options = {
  stages: [
    { duration: '10s', target: 20 }, // Ramp up to 20 concurrent users
    { duration: '20s', target: 20 }, // Stay at 20 concurrent users
    { duration: '5s', target: 0 },   // Ramp down to 0
  ],
};

export default function () {
  const token = 'YOUR_API_TOKEN'; // need to have api permission to allow creating groups
  const groupName = `test-group-${Date.now()}-${Math.random()}`;

  const payload = JSON.stringify({
    name: groupName,
    path: groupName,
    visibility: 'private',
    description: 'Test group for concurrency limit'
  });

  const params = {
    headers: {
      'PRIVATE-TOKEN': token,
      'Content-Type': 'application/json',
    },
  };

  const response = http.post('http://gdk.test:3000/api/v4/groups', payload, params);

  check(response, {
    'status is 201 or 500': (r) => r.status === 201 || r.status === 500,
    'status is 201 (success)': (r) => r.status === 201,
    'status is 500 (rejected)': (r) => r.status === 500,
  });
}
  • Verify the logs are flooded with: GRPC::ResourceExhausted error with:
cat log/development.log | rg ResourceExhausted
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Tarun Khandelwal

Merge request reports

Loading