Recursive download does not work for files larger than 10MiB
Script to generate test data: generate.sh
test2.html
never gets downloaded:
$ wget2 --recursive localhost:8080
[1] Downloading 'http://localhost:8080/robots.txt' ...
HTTP ERROR response 404 Not Found [http://localhost:8080/robots.txt]
[1] Downloading 'localhost:8080' ...
Saving 'localhost//index.html'
HTTP response 200 OK [localhost:8080]
URI content encoding = 'CP1252' (default, encoding not specified)
[0] Downloading 'http://localhost:8080/test1.html' ...
Saving 'localhost//test1.html'
HTTP response 200 OK [http://localhost:8080/test1.html]
URI content encoding = 'CP1252' (default, encoding not specified)
Downloaded: 2 files, 66.76M bytes, 0 redirects, 1 errors
$ find localhost -print
localhost
localhost/index.html
localhost/test1.html
I see that wget2 necessarily parses downloaded files in memory after it is completely downloaded. wget2 imposes a limit of 10MiB on content kept in memory, but there is no fallback on what should happen for files larger than that.
Possible solutions:
- Parse files as they are downloaded, postponing all actions that require the file to be downloaded successfully.
- Save files larger than 10MiB on disk and then parse them.
Both require significant amount of changes as wget2's html parser obligatorily requires entire file to be loaded into memory, while it should be possible not to do so. So I thought I'd discuss first before working on it.