Using --robots=off / --no-robots downloads the robots.txt file and scans it for sitemaps
Hello maintainers!
This merge request closes #456 (closed)
Description of files changed:
-
src/wget.c (add_url_to_queue, add_url)
: robots.txt is downloaded when config.recursive option is set, without checking for config.robots option. -
src/wget.c (add_url)
: config.robots option is checked when updating URLs not to follow -
tests/test-robots-off.c
: Most of the file was borrowed fromtests/test-robots.c
. It tests whether robots.txt is downloaded even with robots=off, and that the disallowed URLs are not respected. -
tests/test-iri-percent.c
: changing the robots=off behavior broke thetest-iri-percent
testcase since it wasn't expecting robots.txt to be downloaded. Adding robots.txt to expected files ensures it passes now.
I have done a clean install after making these changes. Also run make check
, 66/69 test cases PASS, 3/69 are skipped.
It appears to me that this was a very quick fix, there might be better ways to do the same. Please point out any gaps to this approach, or suggestions on how to improve.
Thanks!
Approver's checklist:
-
The author has submitted the FSF Copyright Assignment and is listed in AUTHORS -
There is a test suite reasonably covering new functionality or modifications -
Function naming, parameters, return values, types, etc., are consistent with existing code -
This feature/change has adequate documentation added (if appropriate) -
No obvious mistakes / misspelling in the code
Edited by Tim Rühsen