Skip to content

Using --robots=off / --no-robots downloads the robots.txt file and scans it for sitemaps

Archit Pandey requested to merge archit-p/wget2:robot-fix into master

Hello maintainers!

This merge request closes #456 (closed)

Description of files changed:

  1. src/wget.c (add_url_to_queue, add_url): robots.txt is downloaded when config.recursive option is set, without checking for config.robots option.
  2. src/wget.c (add_url): config.robots option is checked when updating URLs not to follow
  3. tests/test-robots-off.c: Most of the file was borrowed from tests/test-robots.c. It tests whether robots.txt is downloaded even with robots=off, and that the disallowed URLs are not respected.
  4. tests/test-iri-percent.c: changing the robots=off behavior broke the test-iri-percent testcase since it wasn't expecting robots.txt to be downloaded. Adding robots.txt to expected files ensures it passes now.

I have done a clean install after making these changes. Also run make check, 66/69 test cases PASS, 3/69 are skipped.

It appears to me that this was a very quick fix, there might be better ways to do the same. Please point out any gaps to this approach, or suggestions on how to improve.

Thanks!

Approver's checklist:

  • The author has submitted the FSF Copyright Assignment and is listed in AUTHORS
  • There is a test suite reasonably covering new functionality or modifications
  • Function naming, parameters, return values, types, etc., are consistent with existing code
  • This feature/change has adequate documentation added (if appropriate)
  • No obvious mistakes / misspelling in the code
Edited by Tim Rühsen

Merge request reports