DAST Browser based crawler should use sitemap.xml to find urls

Problem

The sitemap.xml lists the URLs on a website, typically for use by search engines. Browser-based DAST can harness this by crawling the URLs in the file, leading to better coverage of the target site.

Example sitemap entry:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset>

Spec: https://sitemaps.org/protocol.html

Implementation Plan

Assumptions
  1. The sitemap doesn't need to be parsed fully as XML, we just need the <loc>...</loc> URLs. Sitemaps can be large (max 50K URLs).
  2. The sitemap is available at the location /sitemap.xml and its in the format outlined by the protocol. It is not a sitemap index, typically located at /sitemap_index.xml
  3. DAST will crawl the target URL, followed by all URLs listed in the sitemap, followed by new navigations found using the normal breadth-first crawl strategy.
Handling Invalid Sitemaps
  • If the sitemap is not available i.e. the server responds with >= 400 & <600 response code:
    • Log a warning and continue the scan
  • If the sitemap contains <sitemapindex>
    • Log a warning "URLs not parsed from sitemapindex file"
  • If the sitemap contains invalid XML
    • Log a warning "URLs not parsed from invalid sitemap"
  • If the sitemap contains no URLs or after checking no URLs are found:
    • Log a warning "no URLs found in the sitemap"
Plan
  • Create a feature flag. Only create the sitemap LoadURL navigation if the FF is enabled, only parse sitemaps navigation results for new LoadURL navigations when the FF is enabled.
  • Sitemaps are located at /sitemap.xml according to the protocol, so the URL for the sitemap will be $DAST_WEBSITE/sitemap.xml.
    • Be careful of situations where DAST_WEBSITE ends in a slash, i.e. don't look for the sitemap at $DAST_WEBSITE//sitemap.xml.
  • In initializers.go#addNavigation , insert a LoadURL navigation for $DAST_WEBSITE/sitemap.xml after the LoadURL navigation for $DAST_WEBSITE.
  • Create a service to post-process navigation results. The service should return new navigations that are found in the navigation result.
    • Refactor BrowserkCrawler.Process newNavigationFinder to be an implementation of the post-processor.
    • Create an implementation that creates LoadURL navigations for each URL in the sitemap.
      • Use xmlquery.Parse to parse the XML. An error result would indicate that the response does not contain XML. Reference: https://gitlab.com/-/snippets/3683911#L41. In case the response is valid XML but not a valid sitemap, the XPath query would return an empty list of nodes.
  • Ignore Chrome's interactions with the sitemap page, Chrome seems to be clicking a button on the sitemap that folds/unfolds an XML tag (example: https://gitlab.com/gitlab-org/security-products/analyzers/browserker/-/merge_requests/1333#note_1891766108)
  • Add unit test for the sitemap parser
  • Add integration and E2E tests to make sure sitemaps are automatically parsed during a crawl.
Out of scope

It is expected that loading a URL in a sitemap will not be considered a duplicate to clicking a link found on a page with the same URL. Please create a new maintenanceperformance issue for this.

The new issue will need to update documentation to mention the support for sitemaps, once the feature flag has been enabled.

  • Example:
    • Site that has a sitemap linking to /page/1, and the home page links to /page/1. Verify that when the crawler attempts to follow the home page link, it bypasses it because it's already been followed as part of the sitemap.
    • What happens if you click on a div that results in the page attempting to load /page/1, can we detect it as a duplicate?
Edited by Arpit Gogia