DAST Crawler Improvements
### Scope To the degree that a human user can load a target site in their browser, authenticate if necessary, and then browse and otherwise interact with the site, we want the Browser-Based DAST scanner to be able to do the same. We cannot possibly cover every possible circumstance; websites and browsers are far too complex, and there will always be a long tail of edge cases. This epic will cover the scenarios that we believe to be important enough to enough customers that we want to proactively ensure that they work; but where it is currently known they they do not work or are unstable, or it is not known if they work. #### Out of Scope This epic is focused on the ability of the crawler to interact with a target site. The following areas are out of scope: - Performance: This epic will focus on behavioral changes, not on improving the performance of the crawler. For any changes we make, a baseline goal will be to not make performance worse. However, we may decide on a case-by-case basis that the behavioral gains are worth sacrificing performance for. Related epics: [DAST Crawler Performance Improvements](https://gitlab.com/groups/gitlab-org/-/epics/12194#top) - Authentication improvements: We want the authentication methods that we currently support to work as broadly as possible. But supporting new methods of authentication (such as [MFA](https://gitlab.com/groups/gitlab-org/-/epics/13633) or [user-scripted flows](https://gitlab.com/gitlab-org/gitlab/-/issues/508553)) are out of scope. Related epics: [DAST authentication improvements](https://gitlab.com/groups/gitlab-org/-/epics/8094#top) - Communication: Improving logs, error messages, or documentation independent of behavioral changes (e.g. "this is expected behavior but it confuses users, we should communicate it better") is out of scope. Related epics: [Decrease need for support escalation for DAST](https://gitlab.com/groups/gitlab-org/-/epics/11230#top), [DAST default logging improvements](https://gitlab.com/groups/gitlab-org/-/epics/15051#top) - Non-crawler behavior: Problems caused by elements of the scanner outside the crawler itself, such as the template (https://gitlab.com/gitlab-org/gitlab/-/issues/429823) or On-Demand configuration (https://gitlab.com/gitlab-org/gitlab/-/issues/502259), are out of scope. ### Roadmap Items | Roadmap Issue | Deliverable | Milestone | Status | | ------------- | ----------- | --------- | ------ | | https://gitlab.com/gitlab-org/gitlab/-/issues/514728+s | ~Deliverable | %"18.5" | | | https://gitlab.com/groups/gitlab-org/-/epics/16580+s | ~Deliverable | %"18.5" | | | https://gitlab.com/groups/gitlab-org/-/epics/16003+s | ~Deliverable | %"18.5" | | | https://gitlab.com/groups/gitlab-org/-/epics/16579+s | ~Deliverable | %"18.5" | | | https://gitlab.com/gitlab-org/gitlab/-/issues/514729+s | ~Deliverable | %"18.6" | | | https://gitlab.com/groups/gitlab-org/-/epics/19087+s | ~Deliverable | %"18.6" | | | https://gitlab.com/gitlab-org/gitlab/-/issues/566243+s | ~Deliverable | %"18.7" | | | https://gitlab.com/groups/gitlab-org/-/epics/19264+s | ~Deliverable | %"18.7" | | | https://gitlab.com/gitlab-org/gitlab/-/issues/434083+s | ~Stretch | | | | https://gitlab.com/gitlab-org/gitlab/-/issues/554125+s | ~Stretch | | | | https://gitlab.com/groups/gitlab-org/-/epics/19086+s | Out of Scope | | | ### Analysis Challenges that we currently face with the crawler fall into roughly three buckets: - Site instability caused by the crawler interfering with the operation of the browser. (https://gitlab.com/gitlab-org/gitlab/-/issues/478596) - Site uses functionality not recognized or supported by the crawler (https://gitlab.com/gitlab-org/gitlab/-/issues/482769) - Crawler incorrectly interprets data from the browser (https://gitlab.com/gitlab-org/gitlab/-/issues/480909) - Crawler is unable to consistently navigate the site, i.e. follow the same path through the site repeatedly as it probes for new navigations (https://gitlab.com/groups/gitlab-org/-/epics/16243) ### Improvements #### Crawler interfering with the operation of the browser [DAST Crawler: Minimize disruption to the browser](https://gitlab.com/groups/gitlab-org/-/epics/16579#top) - there are a number of places where the crawler can do a better job at allowing the browser and the crawl to continue in the face of errors or unexpected states ##### Site uses functionality not recognized or supported by the crawler Since a number of these problems have been caused by the crawler's lack of awareness of certain features, let's see if we can proactively identify similarly unaccounted for features that we haven't dealt with yet. - Get up to date - [DAST w3c specifications gap analysis](https://gitlab.com/groups/gitlab-org/-/epics/16002#top) - [DevTools protocol gap analysis](https://gitlab.com/gitlab-org/gitlab/-/issues/514728#top) - Stay up to date * [Document process for reviewing DevTools Protocol changes during Chromium upgrade](https://gitlab.com/gitlab-org/gitlab/-/issues/514729#top) #### Crawler incorrectly interprets data from the browser Ideally, any logic that we have about dealing with anticipated behavior from the browser should be accompanied by tests which show the browser exhibiting that behavior and the crawler handling it correctly. In the case of https://gitlab.com/gitlab-org/gitlab/-/issues/480909 there were no such tests (there were only unit tests that _assumed_ that the browser would behave in a certain way, which was not accurate). A full test gap analysis of all of the crawler logic about interacting with the browser would be a very large undertaking. #### Crawler is unable to consistently navigate the site [Improve DAST's ability to consistently detect elements previously observed during crawl](https://gitlab.com/groups/gitlab-org/-/epics/16580#top "Improve DAST's ability to consistently detect elements previously observed during crawl") - the crawler currently uses an attribute matching heuristic for identifying the same element on subsequent page visits. That heuristic can be improved, and also there are other heuristics that we can try.
epic