Incompatible Behavior: -p (--page-requisites) and -np (--no-parent)
Wget1 and Wget2 behaves differently when:
- A page requisite (images/CSS etc.) and the original page are on the same host (shares the same domain)
- A page requisite exists outside the directory that contains the original page (HTML file)
Especially, this behavior affects recursive downloading. For instance, on a website (
http://example.com/) with following files:
/style.css: global style for a website
/category/index.html: local page index (refers
/style.cssand links to
/category/page.html: local page (but refers
wget -r -l 0 -p -np http://example.com/category/index.html downloads all three files but
wget2 -p -r -l 0 -p -np http://example.com/category/index.html doesn't download global
style.css. This is the simple example but the website I want to crawl is far more complex (which makes
--reject-rejex nearly unusable).
While this behavior is consistent in some way (works just like
--span-hosts]) but not being able to retrieve page requisites in the recursive download is not desirable for me (and in general).
I think it can be resolved by using
link_inline somehow but I'm not sure:
- Whether using
link_inlinecan fix the issue
- Whether changing the behavior of Wget2 just like Wget1 is good or not (is there any better behavior than Wget1 [and current Wget2]? can we have a command-line option?)
...partly because I first saw the source code of wget (1 and 2) today.