DAST Browser based crawler should use sitemap.xml to find urls
Problem
The sitemap.xml lists the URLs on a website, typically for use by search engines. Browser-based DAST can harness this by crawling the URLs in the file, leading to better coverage of the target site.
Example sitemap entry:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Spec: https://sitemaps.org/protocol.html
Implementation Plan
Assumptions
- The sitemap doesn't need to be parsed fully as XML, we just need the
<loc>...</loc>URLs. Sitemaps can be large (max 50K URLs). - The sitemap is available at the location
/sitemap.xmland its in the format outlined by the protocol. It is not a sitemap index, typically located at/sitemap_index.xml - DAST will crawl the target URL, followed by all URLs listed in the sitemap, followed by new navigations found using the normal breadth-first crawl strategy.
Handling Invalid Sitemaps
- If the sitemap is not available i.e. the server responds with >= 400 & <600 response code:
- Log a warning and continue the scan
- If the sitemap contains
<sitemapindex>- Log a warning "URLs not parsed from sitemapindex file"
- If the sitemap contains invalid XML
- Log a warning "URLs not parsed from invalid sitemap"
- If the sitemap contains no URLs or after checking no URLs are found:
- Log a warning "no URLs found in the sitemap"
Plan
- Create a feature flag. Only create the sitemap LoadURL navigation if the FF is enabled, only parse sitemaps navigation results for new LoadURL navigations when the FF is enabled.
- Sitemaps are located at
/sitemap.xmlaccording to the protocol, so the URL for the sitemap will be$DAST_WEBSITE/sitemap.xml.- Be careful of situations where
DAST_WEBSITEends in a slash, i.e. don't look for the sitemap at$DAST_WEBSITE//sitemap.xml.
- Be careful of situations where
- In
initializers.go#addNavigation, insert aLoadURLnavigation for$DAST_WEBSITE/sitemap.xmlafter theLoadURLnavigation for$DAST_WEBSITE. - Create a service to post-process navigation results. The service should return new navigations that are found in the navigation result.
- Refactor BrowserkCrawler.Process
newNavigationFinderto be an implementation of the post-processor. - Create an implementation that creates LoadURL navigations for each URL in the sitemap.
- Use
xmlquery.Parseto parse the XML. An error result would indicate that the response does not contain XML. Reference: https://gitlab.com/-/snippets/3683911#L41. In case the response is valid XML but not a valid sitemap, the XPath query would return an empty list of nodes.
- Use
- Refactor BrowserkCrawler.Process
- Ignore Chrome's interactions with the sitemap page, Chrome seems to be clicking a button on the sitemap that folds/unfolds an XML tag (example: https://gitlab.com/gitlab-org/security-products/analyzers/browserker/-/merge_requests/1333#note_1891766108)
- Add unit test for the sitemap parser
- Add integration and E2E tests to make sure sitemaps are automatically parsed during a crawl.
Out of scope
It is expected that loading a URL in a sitemap will not be considered a duplicate to clicking a link found on a page with the same URL. Please create a new maintenanceperformance issue for this.
The new issue will need to update documentation to mention the support for sitemaps, once the feature flag has been enabled.
- Example:
- Site that has a sitemap linking to
/page/1, and the home page links to/page/1. Verify that when the crawler attempts to follow the home page link, it bypasses it because it's already been followed as part of the sitemap. - What happens if you click on a div that results in the page attempting to load
/page/1, can we detect it as a duplicate?
- Site that has a sitemap linking to
Edited by Arpit Gogia