`wget2 -rc` doesn't parse localfile foo.shtml
For foo.html
$ wget2 -rc --stats-site localhost
[1] Downloading 'http://localhost/robots.txt' ...
Saving 'localhost//robots.txt'
HTTP response 200 OK [http://localhost/robots.txt]
[1] Downloading 'localhost' ...
Saving 'localhost//index.html'
HTTP response 200 OK [localhost]
URI content encoding = 'CP1252' (default, encoding not specified)
[0] Downloading 'http://localhost/a.html' ...
Saving 'localhost//a.html'
HTTP response 200 OK [http://localhost/a.html]
URI content encoding = 'CP1252' (default, encoding not specified)
[1] Downloading 'http://localhost/b.html' ...
Saving 'localhost//b.html'
HTTP response 200 OK [http://localhost/b.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Downloaded: 4 files, 296 bytes, 0 redirects, 0 errors
Site Statistics:
http://localhost:
Status No. of docs
200 4
http://localhost/robots.txt 0 (identity) : 0 (decompressed)
localhost 108 (gzip) : 136 (decompressed)
http://localhost/a.html 105 (gzip) : 129 (decompressed)
http://localhost/b.html 83 (gzip) : 98 (decompressed)
http://localhost:
localhost
|--http://localhost/robots.txt
|--http://localhost/a.html
| |--http://localhost/b.html
Above run shows the site structure.
$ la localhost/
total 28K
7231393 -rw-r--r-- 1 rootmonk rootmonk 129 Sep 1 21:30 a.html
7231394 -rw-r--r-- 1 rootmonk rootmonk 98 Sep 1 21:50 b.html
7231392 -rw-r--r-- 1 rootmonk rootmonk 136 Sep 1 21:52 index.html
7231391 -rw-r--r-- 1 rootmonk rootmonk 0 Sep 1 21:49 robots.txt
$ rm -f localhost/b.html
$ la localhost/
total 20K
7231393 -rw-r--r-- 1 rootmonk rootmonk 129 Sep 1 21:30 a.html
7231392 -rw-r--r-- 1 rootmonk rootmonk 136 Sep 1 21:52 index.html
7231391 -rw-r--r-- 1 rootmonk rootmonk 0 Sep 1 21:49 robots.txt
b.html
deleted.
$ wget2 -rc --stats-site localhost
[0] Downloading 'http://localhost/robots.txt' ...
Saving 'localhost//robots.txt'
HTTP response 200 OK [http://localhost/robots.txt]
[0] Downloading 'localhost' ...
HTTP ERROR response 416 Requested Range Not Satisfiable [localhost]
URI content encoding = 'iso-8859-1' (set by server response)
[1] Downloading 'http://localhost/a.html' ...
HTTP ERROR response 416 Requested Range Not Satisfiable [http://localhost/a.html]
URI content encoding = 'iso-8859-1' (set by server response)
[0] Downloading 'http://localhost/b.html' ...
Saving 'localhost//b.html'
HTTP response 200 OK [http://localhost/b.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Downloaded: 2 files, 1.62K bytes, 0 redirects, 2 errors
Site Statistics:
http://localhost:
Status No. of docs
200 2
http://localhost/robots.txt 0 (identity) : 0 (decompressed)
http://localhost/b.html 83 (gzip) : 98 (decompressed)
416 2
localhost 790 (identity) : 389 (decompressed)
http://localhost/a.html 790 (identity) : 389 (decompressed)
http://localhost:
localhost
|--http://localhost/robots.txt
|--http://localhost/a.html
| |--http://localhost/b.html
$ la localhost/
total 28K
7231393 -rw-r--r-- 1 rootmonk rootmonk 129 Sep 1 21:30 a.html
7231394 -rw-r--r-- 1 rootmonk rootmonk 98 Sep 1 21:50 b.html
7231392 -rw-r--r-- 1 rootmonk rootmonk 136 Sep 1 21:52 index.html
7231391 -rw-r--r-- 1 rootmonk rootmonk 0 Sep 1 21:49 robots.txt
b.html
reappears thanks to -c
Now, for a.shtml
$ wget2 -rc --stats-site localhost
[0] Downloading 'http://localhost/robots.txt' ...
Saving 'localhost//robots.txt'
HTTP response 200 OK [http://localhost/robots.txt]
[0] Downloading 'localhost' ...
Saving 'localhost//index.html'
HTTP response 200 OK [localhost]
URI content encoding = 'CP1252' (default, encoding not specified)
[0] Downloading 'http://localhost/a.shtml' ...
Saving 'localhost//a.shtml'
HTTP response 200 OK [http://localhost/a.shtml]
URI content encoding = 'CP1252' (default, encoding not specified)
[1] Downloading 'http://localhost/b.html' ...
Saving 'localhost//b.html'
HTTP response 200 OK [http://localhost/b.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Downloaded: 4 files, 299 bytes, 0 redirects, 0 errors
Site Statistics:
http://localhost:
Status No. of docs
200 4
http://localhost/robots.txt 0 (identity) : 0 (decompressed)
localhost 111 (gzip) : 138 (decompressed)
http://localhost/a.shtml 105 (gzip) : 129 (decompressed)
http://localhost/b.html 83 (gzip) : 98 (decompressed)
http://localhost:
localhost
|--http://localhost/robots.txt
|--http://localhost/a.shtml
| |--http://localhost/b.html
site structure. Also wow! b.html
got downloaded. That's because we decide to parse file (a.shtml) from it's Content-Type
and not extension.
$ la localhost/
total 28K
7231393 -rw-r--r-- 1 rootmonk rootmonk 129 Sep 1 21:30 a.shtml
7231394 -rw-r--r-- 1 rootmonk rootmonk 98 Sep 1 21:50 b.html
7231392 -rw-r--r-- 1 rootmonk rootmonk 138 Sep 1 23:57 index.html
7231391 -rw-r--r-- 1 rootmonk rootmonk 0 Sep 1 21:49 robots.txt
$ rm -f localhost/b.html
$ la localhost/
total 20K
7231393 -rw-r--r-- 1 rootmonk rootmonk 129 Sep 1 21:30 a.shtml
7231392 -rw-r--r-- 1 rootmonk rootmonk 138 Sep 1 23:57 index.html
7231391 -rw-r--r-- 1 rootmonk rootmonk 0 Sep 1 21:49 robots.txt
b.html
deleted.
$ command wget2 -rc --stats-site localhost
[0] Downloading 'http://localhost/robots.txt' ...
Saving 'localhost//robots.txt'
HTTP response 200 OK [http://localhost/robots.txt]
[0] Downloading 'localhost' ...
HTTP ERROR response 416 Requested Range Not Satisfiable [localhost]
URI content encoding = 'iso-8859-1' (set by server response)
[0] Downloading 'http://localhost/a.shtml' ...
HTTP ERROR response 416 Requested Range Not Satisfiable [http://localhost/a.shtml]
Downloaded: 1 files, 1.54K bytes, 0 redirects, 2 errors
Site Statistics:
http://localhost:
Status No. of docs
200 1
http://localhost/robots.txt 0 (identity) : 0 (decompressed)
416 2
localhost 790 (identity) : 389 (decompressed)
http://localhost/a.shtml 790 (identity) : 389 (decompressed)
http://localhost:
localhost
|--http://localhost/robots.txt
|--http://localhost/a.shtml
$ la localhost/
total 20K
7231393 -rw-r--r-- 1 rootmonk rootmonk 129 Sep 1 21:30 a.shtml
7231392 -rw-r--r-- 1 rootmonk rootmonk 138 Sep 1 23:57 index.html
7231391 -rw-r--r-- 1 rootmonk rootmonk 0 Sep 1 21:49 robots.txt
No b.html
:(
That's because and please correct me if I'm wrong, we decide to parse local (downloaded) file from it's extension and not resp->content_type
Edited by Avinash Sonawane