Handle config concurrency deadlock with warnings and documentation

What does this MR do?

Closes Long Polling from GitLab Runners not correctly ... (gitlab#331460 - closed) • Georgi N. Georgiev | GitLab • 18.4

This "resolves" the issue through warnings and documentation. The real solution would be re-engineering the job handling and requesting from the ground up. I'll create a new issue for that but don't see it happening any time soon. This is a pretty niche configuration permutation and docs + warnings should solve most situations.

Summary

Implements comprehensive long polling deadlock detection and warning system for GitLab Runner to prevent job processing delays when GitLab CI long polling is enabled.

Problem

GitLab Runner can experience deadlock scenarios where workers get stuck in long polling requests (matching GitLab Workhorse -apiCiLongPollingDuration setting, default 50s), causing job delays. This happens when:

  • Workers are blocked waiting for jobs that can't be processed due to configuration constraints
  • Long polling timeout prevents timely job processing for other runners

Solution

Added proactive deadlock detection that analyzes runner configuration and warns about problematic scenarios with tailored solutions.

Changes

Core Implementation

  • Detection scenarios:
    1. Worker starvation: concurrent < number of runners
    2. Request bottleneck: Runners with request_concurrency=1 blocking workers
    3. Build limit saturation: Low limits (≤2) + request_concurrency=1
  • Tailored solutions: Dynamic solution generation based on detected issues
  • Memory efficient: Uses sync.Once to prevent warning spam

Comprehensive Testing

  • Tests all scenarios: Worker starvation, request bottleneck, build limit saturation
  • Tests healthy configurations: Verifies no false positives
  • Tests multiple concurrent scenarios: Complex configuration validation

Documentation

  • docs/configuration/advanced-configuration.md: Complete deadlock section with:
    • Detailed explanation of GitLab Workhorse long polling
    • Configuration examples for all problematic scenarios
    • Step-by-step solutions and best practices
  • docs/faq/_index.md: Troubleshooting section for job delays with symptoms and solutions

Configuration Examples

Scenario 1: Worker Starvation

# concurrent = 2, but 3 runners = Only 2 workers for 3 runners
concurrent = 2

[[runners]]
name = "worker-starvation-1"
url = "https://gitlab.example.com" 
token = "glrt-EXAMPLE_TOKEN_1"
executor = "shell"
request_concurrency = 3

[[runners]]
name = "worker-starvation-2" 
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_2"
executor = "shell"
request_concurrency = 3

[[runners]]
name = "worker-starvation-3"
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_3" 
executor = "shell"
request_concurrency = 3

Scenario 2: Request Bottleneck

# All runners have request_concurrency=1 (default)
concurrent = 4

[[runners]]
name = "bottleneck-runner-1"
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_1"
executor = "shell"
limit = 10
# request_concurrency = 1 (default) - THIS IS THE PROBLEM

[[runners]]
name = "bottleneck-runner-2"
url = "https://gitlab.example.com" 
token = "glrt-EXAMPLE_TOKEN_2"
executor = "shell"
limit = 8
# request_concurrency = 1 (default) - THIS IS THE PROBLEM

[[runners]]
name = "bottleneck-runner-3"
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_3"
executor = "shell"
limit = 5
# request_concurrency = 1 (default) - THIS IS THE PROBLEM

Scenario 3: Build Limit Saturation

# Low limit settings (≤2) combined with request_concurrency=1
concurrent = 4

[[runners]]
name = "limited-runner-1"
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_1" 
executor = "shell"
limit = 2
request_concurrency = 1

[[runners]]
name = "limited-runner-2"
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_2"
executor = "shell" 
limit = 1
request_concurrency = 1

[[runners]]
name = "limited-runner-3"
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_3"
executor = "shell"
limit = 2
request_concurrency = 1

Healthy Configuration (No Warnings)

# concurrent >= runners, good request_concurrency, adequate limits
concurrent = 6

[[runners]]
name = "healthy-runner-1"
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_1"
executor = "shell"
request_concurrency = 3
limit = 10

[[runners]]
name = "healthy-runner-2"
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_2"
executor = "shell"
request_concurrency = 2
limit = 5

[[runners]]
name = "healthy-runner-3"
url = "https://gitlab.example.com"
token = "glrt-EXAMPLE_TOKEN_3"
executor = "shell"
request_concurrency = 4
limit = 8

Warning Message Format

CONFIGURATION: Long polling deadlock risk detected.
Issues found:
  - Worker starvation: 'concurrent' setting (2) is less than number of runners (3)
  - Request bottleneck: 2 runners have request_concurrency=1, which can block workers during long polling
This can cause job delays matching your GitLab instance long polling timeout.
Recommended solutions:
  1. Increase 'concurrent' to at least 4 (current: 2)  
  2. Increase 'request_concurrency' to 2-4 for 2 runners currently using request_concurrency=1
This message will not be printed again until the GitLab Runner process is restarted.
See documentation: https://docs.gitlab.com/runner/configuration/advanced-configuration.html#long-polling-deadlock

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Edited by Georgi N. Georgiev | GitLab

Merge request reports

Loading