Using --robots=off / --no-robots downloads the robots.txt file and scans it for sitemaps (!454) · Merge requests · Wget / wget2

Archit Pandey requested to merge archit-p/wget2:robot-fix into master Oct 22, 2019

Hello maintainers!

This merge request closes #456 (closed)

Description of files changed:

src/wget.c (add_url_to_queue, add_url): robots.txt is downloaded when config.recursive option is set, without checking for config.robots option.
src/wget.c (add_url): config.robots option is checked when updating URLs not to follow
tests/test-robots-off.c: Most of the file was borrowed from tests/test-robots.c. It tests whether robots.txt is downloaded even with robots=off, and that the disallowed URLs are not respected.
tests/test-iri-percent.c: changing the robots=off behavior broke the test-iri-percent testcase since it wasn't expecting robots.txt to be downloaded. Adding robots.txt to expected files ensures it passes now.

I have done a clean install after making these changes. Also run make check, 66/69 test cases PASS, 3/69 are skipped.

It appears to me that this was a very quick fix, there might be better ways to do the same. Please point out any gaps to this approach, or suggestions on how to improve.

Thanks!

Approver's checklist:

The author has submitted the FSF Copyright Assignment and is listed in AUTHORS
There is a test suite reasonably covering new functionality or modifications
Function naming, parameters, return values, types, etc., are consistent with existing code
This feature/change has adequate documentation added (if appropriate)
No obvious mistakes / misspelling in the code

Edited Oct 22, 2019 by Tim Rühsen

Using --robots=off / --no-robots downloads the robots.txt file and scans it for sitemaps

Approver's checklist:

Merge request reports