Snip navigation paths to improve crawl performance

Problem

The DAST browser-based analyzer takes a long time to crawl applications. For example, DAST benchmark shows that on 1st November 2023 for a scan of DVWA:

Analyzer	Crawl time	Crawl coverage
Proxy-based	`17m 25s`	`84.91%`
Browser-based	`1h 27s`	`84.91%`

A recent scan of the OWASP Benchmark took 27 hours, when it's a relatively simple application. It's hard to justify using browser-based DAST when it takes so much longer to scan than proxy-based/ZAP.

Why the DAST browser-based analyzer is slow

Summary

The DAST browser-based analyzer is slow to crawl because it spends so much time crawling pages it's already crawled. While this approach leads to high confidence that the page can be navigated to again, this is not a good use of crawl time.

Example

The DAST browser-based analyzer crawls using navigation paths. For every discovered action to crawl next, the entire navigation path is followed in a new browser. For example,

Browser i is crawling the user home page, https://site.com/user.
- It got there by following the path LoadURL[https://site.com] -> LeftClick[.navbar] -> LeftClick[.user].
- It finds two new items to crawl on the user home page, clicking on the user activity .activity, and clicking on the user's home location .location.
- The browser closes.
Browser ii opens, and crawls the activity page.
- It follows the entire navigation path: LoadURL[https://site.com] -> LeftClick[.navbar] -> LeftClick[.user] -> LeftClick[.activity].
- It finds no new items to crawl.
- The browser closes.
Browser iii opens, and crawls the user's home location page.
- It follows the entire navigation path: LoadURL[https://site.com] -> LeftClick[.navbar] -> LeftClick[.user] -> LeftClick[.location].
- It finds no new items to crawl.
- The browser closes.

Note that opening and closing of browsers is not the problem. This has been measured in real scans and has been shown to be inconsequential to the overall scan time.

Proposal

Summary

The browser-based DAST analyzer should "snip" navigation paths that contain full-page loads. Replace the most recent full-page load navigation with a LoadURL. Only GET request methods should be snipped, form posts should not.

Navigations are determined to be full-page loads when the Chrome DevTools Protocol Network.ResponseReceived type is Document.

Example

Navigations paths can theoretically be shortened when there is a full-page load. For example, the path:

LoadURL[https://site.com] -> LeftClick[.navbar] -> LeftClick[.user] -> LeftClick[.activity]

If LeftClick[.user] is a full-page load, and resulted in the browser having the URL https://site.com/user, then path snipping turns the above path into:

LoadURL[https://site.com/user] -> LeftClick[.activity]

Challenges

Need to make sure we're not losing coverage
When paths are compared to see if they have already been seen, the snipped paths are what should be compared

Future work

Could potentially stop creating new navigation crawl entries for HTML focus, mouse in/out/over, or mouse wheel events.

Implementation plan

Rename browserk.BrowserkEvt.Nav to browserk.BrowserkEvt.Path
Change the BrowserkEvt.Nav type from []*Navigation to browserk.Path. Use .Values() on the Path type to convert to []*Navigation where necessary.
Create a new type, OptimizeCrawlPathService. It should contain one method, Optimize, which has one browserk.Path parameter and two return values, browserk.Path, error.
Using dependency injection, inject OptimizeCrawlPathService into CrawlNavigationPathJob.
Prior to iterating through navigations (for i, nav := range crawlEvnt.Nav {) in the CrawlNavigationPathJob, call job.optimizePathSvc.Optimize(crawlEvnt.Path). Iterate through the resulting path instead of crawlEvnt.Nav.
Implement navigationResult.ContainsLoadRequest. A navigation result is a load request when the NavigationResult.LoadRequestID is equal to NavigationResult.Messages[0].Request.RequestID.
Implement OptimizeCrawlPathService.Optimize.
- Need to have access to store.NavigationResult
- Loop backwards - navigations from the end of path to the start of path.
  - If the current navigation's state is something other than NavVisited, include navigation in the resulting path (continue)
  - Load the navigation result for the navigation, navigationResultStore.FindForNavigationID.
  - If the navigation result contains a load request:
    - Copy of the current navigation in the result path
    - Update the action to browser.NewLoadURLAction, where the URL is the navigation result load request URL.
    - Return optimized paths, where the current navigation is the start of the optimized path
    - Log on debug using the LogBrowser that the path has been optimized
  - If no navigations with load requests are encountered, return the original path as the optimized path
Allow a feature flag to turn off this behaviour. This can be done by injecting config into the OptimizeCrawlPathService, and using cfg.FeatureFlags.IsEnabledOrDefault(xxx).
Use the end-to-end test test_pancakes to manually verify this is working. Prior to this work, searching for navigation executed.*LoadURL shows that the home page is loaded 11 times. It is expected that there will still be 11 LoadURL actions, however, some URLs should change to /pancakes and /my-pancakes.

Edited Jan 02, 2024 by Cameron Swords