Skip to content

Spike: Experiment for the page similarity algorithm to be used in DAST

Spike and experiment with various techniques to compute page similarity scores to be used in DAST's crawler to help deduplicate pages.

Open Questions

  1. Comparing pages based on: content, DOM structure, visual appearance?
  2. Do we need a discrete answer (pages are similar/not similar), or a continuous score (pages are similar with a score of n)?
  3. One page might need to be compared with numerous other pages, how can we do this without significant increase in memory and CPU?