Restricting domains with host-spanning does not work
I'm running this command in the hope of crawling subdomains under kedo.gov.cn:
wget2 -r -w 8 --filter-mime-type="text/html" -a wget_log -H -D kedo.gov.cn http://www.kedo.gov.cn
If my assumptions are correct, when combined, -H
enables host-spanning and -D
restricts the domains. However, after a minute of operation, I end up with the following folder structure:
.
├── story.kedo.gov.cn
│ ├── index.html
│ ├── stories
│ │ └── kxr
│ │ └── index.html
│ └── story
│ └── legend
│ └── classics
│ └── index.html
├── wget_log
├── www.kedo.gov.cn
│ └── index.html
└── www.kepuchina.cn
├── index.html
└── public
└── 201710
└── t20171031_253123.shtml
While the www.kedo.gov.cn
and story.kedo.gov.cn
folders, and their contents are desirable, the www.kepuchina.cn
is not. It should clearly be excluded by -D
. I'm familiar with these two flags from the original wget
documentation, and have used them in the past.
How do I get wget2 to honor -D
?
Edit: I've also tried omitting space between -D
and kedo.gov.cn
(as given in an example in the old wget docs), and also the long form --domains
and also tried passing *.kedo.gov
, wrapping the domain name in quotes, etc. No success.