Skip to content

Implement Sidekiq queue re-routing in the application

What does this MR do?

A part of gitlab-com/gl-infra/scalability#1016 (closed). This MR is to implement the ability to re-route a worker to a desirable queue based on a set of routing rules configured in gitlab.yaml. Here is an example of such routing rule:

  sidekiq:
    log_format: json # (default is the original format)
    # An array of tuples indicating the rules for re-routing a worker to a
    # desirable queue before scheduling. For example:
    routing_rules:
      - ["resource_boundary=cpu", "cpu_boundary"]
      - ["feature_category=pages", null]
      - ["*", "default"]

The routing rule set is an array tuple of queue selector and corresponding queue. Rules are evaluated from first to last, and as soon as we find a match for a given worker we stop processing for that worker (first match wins). If the worker doesn't match any rule, it falls back the queue name generated from the worker name.

The solution follows a simple approach: implement a worker router to match a worker to a queue, and set a worker's queue with sidekiq_options queue: queue_name. As the queue of a worker depends on its attributes, the queue of a worker is re-computed and updated when:

  • ApplicationWorker is included
  • A class inherits a worker having ApplicationWorker included.
  • A worker's attribute changes. When declaring a typical worker, the worker attributes change for around 5-7 times. Hence, it leads to the same amount of redundant re-computation.

Queue computation occurs when a worker is first loaded at loading time. In most use cases, there would be just around a handful of routing rules. Hence, the redundancy is acceptable. Other alternatives are mentioned in gitlab-com/gl-infra/scalability#1016 (comment 554450076).

Screenshots

I put a small debugging line in a sidekiq client middleware to printout the scheduling queue of a job in following scenarios:

Case 1: No configuration set. This is the state after this MR is merged. All of the workers should be routed to the queues generated by worker names.

Screen_Shot_2021-04-22_at_13.18.16

Case 2: Reflect the production sharding structure, but all of the queues are nil. All of the workers should be routed to the queues generated by worker names.

  sidekiq:
    routing_rules:
      - ["resource_boundary=memory", null]
      - ["feature_category=database&urgency=throttled", null]
      - ["feature_category=gitaly&urgency=throttled", null]
      - ["feature_category=global_search&urgency=throttled", null]
      - ["resource_boundary=cpu&urgency=default,low", null]
      - ["resource_boundary=cpu&urgency=high&tags!=requires_disk_io", null]
      - ["resource_boundary!=cpu&urgency=high", null]
      - ["*", null]

Screen_Shot_2021-04-22_at_13.11.54

Case 3: the fallback wildcard * is mapped to default queue. This is one of the exit criteria of gitlab-com/gl-infra&447 (closed). All workers in catchall should use default queue, others should be routed to the queues generated by worker names.

  sidekiq:
    routing_rules:
      - ["resource_boundary=memory", null]
      - ["feature_category=database&urgency=throttled", null]
      - ["feature_category=gitaly&urgency=throttled", null]
      - ["feature_category=global_search&urgency=throttled", null]
      - ["resource_boundary=cpu&urgency=default,low", null]
      - ["resource_boundary=cpu&urgency=high&tags!=requires_disk_io", null]
      - ["resource_boundary!=cpu&urgency=high", null]
      - ["*", 'default']

Screen_Shot_2021-04-22_at_13.10.46

Case 4: Rolling rollout some of the shard. The effective shard has mapped queue, others still use queue name generated by the worker name

  sidekiq:
    routing_rules:
      - ["resource_boundary=memory", 'memory-bound']
      - ["feature_category=database&urgency=throttled", 'database-throttled']
      - ["feature_category=gitaly&urgency=throttled", 'gitaly-throttled']
      - ["feature_category=global_search&urgency=throttled", 'elasticsearch']
      - ["resource_boundary=cpu&urgency=default,low", null]
      - ["resource_boundary=cpu&urgency=high&tags!=requires_disk_io", null]
      - ["resource_boundary!=cpu&urgency=high", null]
      - ["*", 'default']

Screen_Shot_2021-04-22_at_13.09.09

Case 5: All the shards are updated to use mapped queue name

  sidekiq:
    routing_rules:
      - ["resource_boundary=memory", 'memory-bound']
      - ["feature_category=database&urgency=throttled", 'database-throttled']
      - ["feature_category=gitaly&urgency=throttled", 'gitaly-throttled']
      - ["feature_category=global_search&urgency=throttled", 'elasticsearch']
      - ["resource_boundary=cpu&urgency=default,low", 'low-urgency-cpu-bound']
      - ["resource_boundary=cpu&urgency=high&tags!=requires_disk_io", 'urgent-cpu-bound']
      - ["resource_boundary!=cpu&urgency=high", 'urgent-other']
      - ["*", 'default']

Screen_Shot_2021-04-22_at_12.58.18

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Quang-Minh Nguyen

Merge request reports