Lightz-AIO | Improve rules splitting
Background
In https://gitlab.com/gitlab-org/security-products/oxeye/product/lightz-aio/-/merge_requests/260+s, we implemented an algorithm to balance the scanning workload based on previously cached execution statistics. The initial plan, as outlined in Lightz-AIO | Implement rules splitting based on... (#520449 - closed) • Hua Yan • 17.11, relied on a typical greedy algorithm, which in theory should yield optimal results.
Problem
In practice, however, workload balancing does not appear to depend solely on the scanning time of each rule. Even when the second run distributes rules across CPU cores based on the profiled durations from a previous run, we observe uneven workload distribution. This suggests that hidden overheads are introduced when the rule groupings change, resulting in suboptimal balancing.
To mitigate this, a workaround has been implemented: a fixed overhead of 1 second per rule is empirically added during allocation.
See lightz.go:L352–357.
Illustration
Consider the case of 1000 rules and 2 CPU cores:
- 1 rule takes 100s
- 999 rules take 0.1s each
- Hidden overhead is estimated at 1s per rule
Without overhead adjustment (naive greedy):
- Core 1: 1 rule →
100 + 1 = 101s - Core 2: 999 rules →
(0.1 + 1) * 999 = 1098.9s - Result: extremely imbalanced workload
With overhead-adjusted suboptimal algorithm:
- Core 1: 100s rule + 499 other rules →
100 + 499*1 + 499*0.1 = 649.9s - Core 2: remaining 500 rules →
500*1 + 500*0.1 = 550s - Result: not perfectly balanced, but much better than the naive greedy strategy
The issue is that we do not yet understand the source or pattern of the hidden overhead, and therefore cannot model it precisely.
Previous discussion
Objective
Investigate the source of the hidden overhead that invalidates the assumptions of the greedy rule-splitting algorithm and explore more accurate workload distribution strategies.