DAST Browser based crawler should use sitemap.xml to find urls

Problem

The sitemap.xml lists the URLs on a website, typically for use by search engines. Browser-based DAST can harness this by crawling the URLs in the file, leading to better coverage of the target site.

Example sitemap entry:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset>

Spec: https://sitemaps.org/protocol.html

Implementation Plan

Assumptions

The sitemap doesn't need to be parsed fully as XML, we just need the <loc>...</loc> URLs. Sitemaps can be large (max 50K URLs).
The sitemap is available at the location /sitemap.xml and its in the format outlined by the protocol. It is not a sitemap index, typically located at /sitemap_index.xml
DAST will crawl the target URL, followed by all URLs listed in the sitemap, followed by new navigations found using the normal breadth-first crawl strategy.

Handling Invalid Sitemaps

If the sitemap is not available i.e. the server responds with >= 400 & <600 response code:
- Log a warning and continue the scan
If the sitemap contains <sitemapindex>
- Log a warning "URLs not parsed from sitemapindex file"
If the sitemap contains invalid XML
- Log a warning "URLs not parsed from invalid sitemap"
If the sitemap contains no URLs or after checking no URLs are found:
- Log a warning "no URLs found in the sitemap"

Plan

Create a feature flag. Only create the sitemap LoadURL navigation if the FF is enabled, only parse sitemaps navigation results for new LoadURL navigations when the FF is enabled.
Sitemaps are located at /sitemap.xml according to the protocol, so the URL for the sitemap will be $DAST_WEBSITE/sitemap.xml.
- Be careful of situations where DAST_WEBSITE ends in a slash, i.e. don't look for the sitemap at $DAST_WEBSITE//sitemap.xml.
In initializers.go#addNavigation , insert a LoadURL navigation for $DAST_WEBSITE/sitemap.xml after the LoadURL navigation for $DAST_WEBSITE.
Create a service to post-process navigation results. The service should return new navigations that are found in the navigation result.
- Refactor BrowserkCrawler.Process newNavigationFinder to be an implementation of the post-processor.
- Create an implementation that creates LoadURL navigations for each URL in the sitemap.
  - Use xmlquery.Parse to parse the XML. An error result would indicate that the response does not contain XML. Reference: https://gitlab.com/-/snippets/3683911#L41. In case the response is valid XML but not a valid sitemap, the XPath query would return an empty list of nodes.
Ignore Chrome's interactions with the sitemap page, Chrome seems to be clicking a button on the sitemap that folds/unfolds an XML tag (example: https://gitlab.com/gitlab-org/security-products/analyzers/browserker/-/merge_requests/1333#note_1891766108)
Add unit test for the sitemap parser
Add integration and E2E tests to make sure sitemaps are automatically parsed during a crawl.

Out of scope

It is expected that loading a URL in a sitemap will not be considered a duplicate to clicking a link found on a page with the same URL. Please create a new maintenanceperformance issue for this.

The new issue will need to update documentation to mention the support for sitemaps, once the feature flag has been enabled.

Example:
- Site that has a sitemap linking to /page/1, and the home page links to /page/1. Verify that when the crawler attempts to follow the home page link, it bypasses it because it's already been followed as part of the sitemap.
- What happens if you click on a div that results in the page attempting to load /page/1, can we detect it as a duplicate?

Edited May 09, 2024 by Arpit Gogia