Snip navigation paths to improve crawl performance
Problem
The DAST browser-based analyzer takes a long time to crawl applications. For example, DAST benchmark shows that on 1st November 2023 for a scan of DVWA:
Analyzer | Crawl time | Crawl coverage |
---|---|---|
Proxy-based | 17m 25s |
84.91% |
Browser-based | 1h 27s |
84.91% |
A recent scan of the OWASP Benchmark took 27 hours, when it's a relatively simple application. It's hard to justify using browser-based DAST when it takes so much longer to scan than proxy-based/ZAP.
Why the DAST browser-based analyzer is slow
Summary
The DAST browser-based analyzer is slow to crawl because it spends so much time crawling pages it's already crawled. While this approach leads to high confidence that the page can be navigated to again, this is not a good use of crawl time.
Example
The DAST browser-based analyzer crawls using navigation paths. For every discovered action to crawl next, the entire navigation path is followed in a new browser. For example,
- Browser
i
is crawling the user home page,https://site.com/user
.- It got there by following the path
LoadURL[https://site.com] -> LeftClick[.navbar] -> LeftClick[.user]
. - It finds two new items to crawl on the user home page, clicking on the user activity
.activity
, and clicking on the user's home location.location
. - The browser closes.
- It got there by following the path
- Browser
ii
opens, and crawls the activity page.-
It follows the entire navigation path:
LoadURL[https://site.com] -> LeftClick[.navbar] -> LeftClick[.user] -> LeftClick[.activity]
. - It finds no new items to crawl.
- The browser closes.
-
It follows the entire navigation path:
- Browser
iii
opens, and crawls the user's home location page.-
It follows the entire navigation path:
LoadURL[https://site.com] -> LeftClick[.navbar] -> LeftClick[.user] -> LeftClick[.location]
. - It finds no new items to crawl.
- The browser closes.
-
It follows the entire navigation path:
Note that opening and closing of browsers is not the problem. This has been measured in real scans and has been shown to be inconsequential to the overall scan time.
Proposal
Summary
The browser-based DAST analyzer should "snip" navigation paths that contain full-page loads. Replace the most recent full-page load navigation with a LoadURL
. Only GET
request methods should be snipped, form posts should not.
Navigations are determined to be full-page loads when the Chrome DevTools Protocol Network.ResponseReceived type is Document
.
Example
Navigations paths can theoretically be shortened when there is a full-page load. For example, the path:
LoadURL[https://site.com] -> LeftClick[.navbar] -> LeftClick[.user] -> LeftClick[.activity]
If LeftClick[.user]
is a full-page load, and resulted in the browser having the URL https://site.com/user
, then path snipping turns the above path into:
LoadURL[https://site.com/user] -> LeftClick[.activity]
Challenges
- Need to make sure we're not losing coverage
- When paths are compared to see if they have already been seen, the snipped paths are what should be compared
Future work
- Could potentially stop creating new navigation crawl entries for HTML
focus
,mouse in/out/over
, ormouse wheel
events.
Implementation plan
-
Rename browserk.BrowserkEvt.Nav
tobrowserk.BrowserkEvt.Path
-
Change the BrowserkEvt.Nav type from []*Navigation
tobrowserk.Path
. Use.Values()
on the Path type to convert to[]*Navigation
where necessary. -
Create a new type, OptimizeCrawlPathService
. It should contain one method,Optimize
, which has onebrowserk.Path
parameter and two return values,browserk.Path, error
. -
Using dependency injection, inject OptimizeCrawlPathService
intoCrawlNavigationPathJob
. -
Prior to iterating through navigations ( for i, nav := range crawlEvnt.Nav {
) in theCrawlNavigationPathJob
, calljob.optimizePathSvc.Optimize(crawlEvnt.Path)
. Iterate through the resulting path instead ofcrawlEvnt.Nav
. -
Implement navigationResult.ContainsLoadRequest
. A navigation result is a load request when theNavigationResult.LoadRequestID
is equal toNavigationResult.Messages[0].Request.RequestID
. -
Implement OptimizeCrawlPathService.Optimize
.- Need to have access to
store.NavigationResult
- Loop backwards - navigations from the end of path to the start of path.
- If the current navigation's state is something other than
NavVisited
, include navigation in the resulting path (continue
) - Load the navigation result for the navigation,
navigationResultStore.FindForNavigationID
. - If the navigation result contains a load request:
- Copy of the current navigation in the result path
- Update the action to
browser.NewLoadURLAction
, where the URL is the navigation result load request URL. - Return optimized paths, where the current navigation is the start of the optimized path
- Log on debug using the
LogBrowser
that the path has been optimized
- If no navigations with load requests are encountered, return the original path as the optimized path
- If the current navigation's state is something other than
- Need to have access to
-
Allow a feature flag to turn off this behaviour. This can be done by injecting config into the OptimizeCrawlPathService
, and usingcfg.FeatureFlags.IsEnabledOrDefault(xxx)
. -
Use the end-to-end test test_pancakes
to manually verify this is working. Prior to this work, searching fornavigation executed.*LoadURL
shows that the home page is loaded 11 times. It is expected that there will still be 11 LoadURL actions, however, some URLs should change to/pancakes
and/my-pancakes
.