Speed: Process work items in parallel (WebRunnerMachine)
Problem
The WebRunnerCheckStrategy
generates IWebRunnerWorkItem
s which are currently processed in series. To improve speed and allow taking advantage of a larger testing machine with multiple CPUs, the work items should be processed in parallel up to some reasonable max parallelism based on CPU/Core counts.
Proposal
Use the .NET Dataflow objects to process work items generated by WebRunnerCheckStrategy
in parallel. The existing 344864-threaded-runner
branch will be used as a reference only. The refactoring of WebRunnerCheckStrategy
should provide a much better foundation than existed in the prior working branch for threading.
NOTE: Please keep the 344864-threaded-runner
until @mikeeddington returns. Some hotpath methods were optimized in this branch.
One issue found in the existing branch was the minimum workers count on ThreadPool
should not match the vCPU, it should be higher by at least 1. On single vCPU runners, such as the GitLab shared runners, increasing the worker thread count improved performance.
To determine the correct levels for ThreadPool
MinThreads WorkerThreads
and also MaxDegreeOfParallelism
for the TransformBlock
. It's recommended to build a 4 vCPU runner in GPC to use with a custom label, and also use the shared runners with a single vCPU. Do not expose any new variables for configuring parallelism at this time.
- At start of job run each operation record to generate work items
- Work items added to
TransformBlock
for processing - Automatically check CPU/core count and update max parallelization and task limits accordingly
- Override task minimums to prevent major slow down on 1 vCPU runners
- Work Items
- Active checks
- Mutational checks
- Parameter + Check
- Documentation update for how to increase testing speed using a larger instance
Implementation
-
Remove benchmark stage from gitlab ci file -
Verify no use of custom runners left, move everything to gitlab-org,high-cpu
-
If the CPU count is 1, and the job time is > 10min, call out our parallel documentation on the console -
Getting occasional failures on ShouldReturnRequestCountAndSumResponseSize test- https://gitlab.com/gitlab-org/security-products/analyzers/api-fuzzing-src/-/jobs/2810807453
- Does this test really work? What about when we test path parameters?
- #371086 (closed)
-
tests_int_config_fuzzing-headers-quick
- failing due to two missing response analysis with "Internal server error" response (500)
- Need to make these two vulnerabilities optional in nature, so changes to the checkreportfaultcounts.sh needed
-
Rename tuning variable API_SECURITY_WORKER_MAX_THREADS
toDAST_API_TUNE_MAX_CONCURRENCY
-
Using a Dataflow
block from the Task Parallel Library, parallelize the work items produced by the strategy. -
Automatically check CPU/core count and update max parallelization and task limits accordingly -
Good enough is - Shared runner (1 vCPU) is not allot slower. And multi-cpu gets allot faster. -
Allow override via configuration as a potential workaround
-
-
Fix any concurrency issues caused by parallelization of the work items -
Wait for record to complete before starting other work items - Prevent issues with Resources, EndPoints in terms of adding PathTrie's and recorded operations
-
WebRunnerCheckStrategy.PerformMutationCheckWork -- Re-check configurations, I think we are safe, but.. -
Create a new instance of check for each work item to avoid concurrency issues -
Fix broken tests that mock Repository.LoadCheck causing them to never return
- To be safe we could request a new instance of a check in the strategy work item. This would also make checks safer to implement. Maybe this is follow on work?
- Goes away with this
FrameworkDebugModeCheck
-- private FrameworkChecks - Goes away with this
FuzzingCheck
-- LOTS OF STUFF HERE!! All sorts of private storage that changes between uses - Any issues with Assertion/Check initialization goes away
-
-
PathTrie -- Used by EndPoint, has Dictionary - Mikes branch commit: 8c1d027836fc57a537e2ee993732fe41f4371961
-
WebApiService -- Has several Dictionary types that will be used cross-thread - Mikes branch commit: 8c1d027836fc57a537e2ee993732fe41f4371961
-
Resource -- List -> ConcurrentBag - Mikes branch commit: 8c1d027836fc57a537e2ee993732fe41f4371961
-
FaultCollector -- Add a new FaultDetectedThreadSafe method with locking - Mikes branch commit: 8c1d027836fc57a537e2ee993732fe41f4371961
-
BlindInjectionAssertion
-- associated checks are also still stateful.-
OsCommandInjectionCheck
-
ShellShockCheck
-
SqlInjectionCheck
-
-
ResponseAnalysisAssertion
-
Cache needs locking -
_entries needs a solution, changes based on config
-
-
Repository.FindOrCreateFinding
-- Needs lock to avoid duplicate findings -
XmlExternalEntityCheck
-- Cached XML document on local private variable-
Put a fix in place -
Have herb review fix for perf
-
-
CustomBodyContainsAssertionConfig
-
IsMatch needs locking on setting _searchStringRegex? -
Can we cache the regex and still work w/concurrency
-
-
CustomParameterContainsAssertionConfig
-
IsMatch needs locking on setting _matchRegex? -
Can we cache the regex and still work w/concurrency
-
-
LogAnalysisAssertionConfigEntry
-
Do we need locking in Regex property? Mike: No we do not.
-
-
ResponseAnalysisAssertionConfigEntry
-
Do we need locking in Regex property? Mike: No we do not.
-
-
SessionCookieCheckConfigEntry
-
Do we need locking in Regex property? Mike: No we do not.
-
-
InsecureHttpMethodsCheck
-
Has configuration dependent internal state
-
-
TlsConfigurationCheck
-- Caches configuration, if we test two hosts at once, could collide -
XmlExternalEntityCheck
-- Looks like it has some internal state -
RegexCache -- Can we move to a concurrent dictionary and remove locking? Lots of calls to this class. -
Issues exposed through test failures
-
-
Perform concurrency audit -
RegexCache -
Assertion initializationNot needed if we use a new check instance in WebRunnerCheckStrategy -
Check initializationNot needed if we use a new check instance in WebRunnerCheckStrategy-
Make sureInitialize
is only called once
-
-
Assertions - Check assertion and also configurations, look for private variables in configs -
Checks - Check main check and also configurations, look for private variables in configs -
Review check configurations -
Checks include base classes
-
-
Fault reporting flow (just lock all of it) -
Send request networking flow -
Review Mike's old work branch to identify other concurrency issues -
If check/assert configurations can change by route, that's a concurrency problem because each work item could be a different route operating in parallel. - Checks and Assertions are
InstancePerDependency
for autofac. Always new instances. - Check instances created by Repository always have a unique instance
- It looks like we are safe here, but...
- To be safe we could request a new instance of a check in the strategy work item
- Checks and Assertions are
-
-
Update tests and add any new tests needed. -
Set max concurrent requests in integration tests and confirm that multiple requests are made to the target concurrently -
Set max concurrent requests in worker-entry tests and confirm that scanner receives the setting
-
-
HttpConnectionSettings - This can change with the route (external proxy)! Will need a unique instance per-operation to use. -
Optimizations -
Benchmark if making _faultCollector.FaultDetectedThreadSafe call a background task improves perf- This is a can of worms. The variables we pass in change if tasked off. -
Modify Repository.LoadCheck to not get new instances of all checks, just the one check#371088 -
Http11Connection -- Could remove locking if we used thread safe bool pattern already in use#371089 -
Request.WriteHeaders -- Use stringbuffers append over string interpolation
-
-
Documentation#371085 (closed)
David's transition notes as of 7/17:
-
Configuration via environment variable is implemented and tested - During testing it was found that setting the value to 0 causes worker-entry to not serialize the value in the create session request (presumably because it is a default value, but I was not able to track down specifically where this determination was made). This effectively makes 0 equal to not setting it at all, which causes the DOP setting to be calculated based on CPUs. That seemed like a reasonable meaning for 0 - particularly because actually setting MaxDop to 0 in the Dataflow blocks disables them - so it is documented and tested that way. Although it cannot be set from worker-entry, if a 0 does show up in the runner options (because .NET integration tests do not strip the 0 value for example), it is explicitly recognized as being equivalent to null. See this code comment.
- Setting the value to -1 means "Unbounded" to Dataflow. While initially it seemed like it might be a good idea to support this value, integration testing with unbounded DOP showed that it can easily overwhelm the target. Since it seemed likely to cause more problems than it solves, I decided not to support it. See this code comment and this one.
- Initially I named this
MAX_CONCURRENT_REQUESTS
, trying to tie it to something the user would understand. However, while I think it is accurate for now (though hard to prove), I realized that might be a stronger commitment than we are willing to make. So I changed it toMAX_CONCURRENCY
as recommended in the task list.
- Testing
- A few tests are not passing:
-
tests_int_config_fuzzing-headers-quick
andtests_int_config_fuzzing-quick
need to have expectations updated. StatusCode-based vulnerabilities coming from our Flask target are now unreliable, since they depend on the order of the incoming requests. The solution (for now) is to ignore status code vulnerabilities in our integration tests. -
ShouldFindVulnerabilityRemoteFileInclusionInJsonBodyWithEmbeddedXml
integration test is failing intermittently becauseXmlExternalEntityCheck
is stateful and not safe for concurrent execution1. - There is a
Debug.Assert
inBlindInjectionAssertion
that is causing multiple integration tests to fail becauseBlindInjectionAssertion
is stateful and not safe for concurrent execution1. TheDebug.Assert
is currently commented out to prevent it from masking other issues.
-
-
New integration tests specifically for concurrent execution have been added - These tests use a
RunnerOptions
property which is not intended to be exposed to consumers which waits to run checks until after all recording is done. We could make this the default behavior instead, which would simplify the code slightly; but since we previously talked about keeping the current behavior, I made it an option for now. See this code comment. - These tests are sending more requests than necessary, and could be faster if the requests could be reduced. See this code comment.
- These tests use a
-
Worker-entry tests for configuration of concurrent execution have been added - Testing that parallel execution is actually happening at this level would be quite difficult, as well as redundant with the integration tests. So these tests do as little as possible and focus on confirming that the setting is passed correctly from worker-entry to the scanner.
- A few tests are not passing:
-
The spike branch created a MutationContext which could hold the state from these checks in order to make them safe for concurrent execution. However that solution caused large changes throughout the entire architecture. Instead we discussed creating concurrency-safe caches inside the checks, keyed to each work item, to store the needed state.
↩ ↩ 2