Add circuit breaker logic to control the number of outbound requests to Topology Service
What does this MR do and why?
The MR implements a circuit breaker logic to make sure that at any given time, we don't have more than N active requests to the topology service from rails.
The limit of the concurrent requests is configurable as an application setting and can be increased or decreased based on the Cell size and along with the number of connection pool available in the cells.
The primary reason for the circuit breaker is that, as every request to claim a resource is done inside a DB transaction, we want to make sure a bug in the TS or claiming framework doesn't end up taking all the available connection pool for the application, making the application choke due to that.
I added the DEFAULT_LIMIT as 200, which is well under what we can currently support(290): gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team#488 (comment 2773726223)
References
- gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team#517
- https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/cells/decisions/021_claims_in_database_transaction/
- gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team#488 (comment 2773726223)
How to set up and validate locally
- Configure GDK as a Cell
- In the rails console, enable the claiming and ts limit flags by:
Feature.enable(:cells_unique_claims)
Feature.enable(:topology_service_concurrency_limit)
- Update the concurrency request to 1:
Gitlab::CurrentSettings.update!(topology_service_concurrency_limit: 1)(since TS is local and returns very fast, reducing to the minimum value to mimic the error). - Use the below k6 script to generate load
k6 run ./k6.js
Warning
The script a lot of groups in the GDK
k6.js
import http from 'k6/http';
import { check } from 'k6';
export let options = {
stages: [
{ duration: '10s', target: 20 }, // Ramp up to 20 concurrent users
{ duration: '20s', target: 20 }, // Stay at 20 concurrent users
{ duration: '5s', target: 0 }, // Ramp down to 0
],
};
export default function () {
const token = 'YOUR_API_TOKEN'; // need to have api permission to allow creating groups
const groupName = `test-group-${Date.now()}-${Math.random()}`;
const payload = JSON.stringify({
name: groupName,
path: groupName,
visibility: 'private',
description: 'Test group for concurrency limit'
});
const params = {
headers: {
'PRIVATE-TOKEN': token,
'Content-Type': 'application/json',
},
};
const response = http.post('http://gdk.test:3000/api/v4/groups', payload, params);
check(response, {
'status is 201 or 500': (r) => r.status === 201 || r.status === 500,
'status is 201 (success)': (r) => r.status === 201,
'status is 500 (rejected)': (r) => r.status === 500,
});
}
- Verify the logs are flooded with:
GRPC::ResourceExhaustederror with:
cat log/development.log | rg ResourceExhausted
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
GRPC::ResourceExhausted (8:Topology Service concurrency limit exceeded for gitlab.cells.topology_service.claims.v1.ClaimService/BeginUpdate):
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.