Recursive download does not work for files larger than 10MiB
Script to generate test data: generate.sh
test2.html never gets downloaded:
$ wget2 --recursive localhost:8080  Downloading 'http://localhost:8080/robots.txt' ... HTTP ERROR response 404 Not Found [http://localhost:8080/robots.txt]  Downloading 'localhost:8080' ... Saving 'localhost//index.html' HTTP response 200 OK [localhost:8080] URI content encoding = 'CP1252' (default, encoding not specified)  Downloading 'http://localhost:8080/test1.html' ... Saving 'localhost//test1.html' HTTP response 200 OK [http://localhost:8080/test1.html] URI content encoding = 'CP1252' (default, encoding not specified) Downloaded: 2 files, 66.76M bytes, 0 redirects, 1 errors $ find localhost -print localhost localhost/index.html localhost/test1.html
I see that wget2 necessarily parses downloaded files in memory after it is completely downloaded. wget2 imposes a limit of 10MiB on content kept in memory, but there is no fallback on what should happen for files larger than that.
- Parse files as they are downloaded, postponing all actions that require the file to be downloaded successfully.
- Save files larger than 10MiB on disk and then parse them.
Both require significant amount of changes as wget2's html parser obligatorily requires entire file to be loaded into memory, while it should be possible not to do so. So I thought I'd discuss first before working on it.