wget2 issueshttps://gitlab.com/gnuwget/wget2/-/issues2022-12-02T11:03:22Zhttps://gitlab.com/gnuwget/wget2/-/issues/617wget2 segmentation fault in wget_hpkp_db_check_pubkey2022-12-02T11:03:22ZFazal Majidwget2 segmentation fault in wget_hpkp_db_check_pubkeyI see this behavior on both Alpine Linux and macOS Monterey:
```
zulfiqar ~/build>wget2 -c http://ftp.gnome.org/pub/gnome/sources/glib/2.74/glib-2.74.2.tar.xz
[0] Downloading 'http://ftp.gnome.org/pub/gnome/sources/glib/2.74/glib-2.74.2...I see this behavior on both Alpine Linux and macOS Monterey:
```
zulfiqar ~/build>wget2 -c http://ftp.gnome.org/pub/gnome/sources/glib/2.74/glib-2.74.2.tar.xz
[0] Downloading 'http://ftp.gnome.org/pub/gnome/sources/glib/2.74/glib-2.74.2.tar.xz' ...
HTTP response 302 Found [http://ftp.gnome.org/pub/gnome/sources/glib/2.74/glib-2.74.2.tar.xz]
Adding URL: https://ftp.gnome.org/pub/gnome/sources/glib/2.74/glib-2.74.2.tar.xz
[0] Downloading 'https://ftp.gnome.org/pub/gnome/sources/glib/2.74/glib-2.74.2.tar.xz' ...
HTTP response 301 Moved Permanently [https://ftp.gnome.org/pub/gnome/sources/glib/2.74/glib-2.74.2.tar.xz]
Adding URL: https://download.gnome.org/sources/glib/2.74/glib-2.74.2.tar.xz
[0] Downloading 'https://download.gnome.org/sources/glib/2.74/glib-2.74.2.tar.xz' ...
HTTP response 302 Found [https://download.gnome.org/sources/glib/2.74/glib-2.74.2.tar.xz]
Adding URL: https://ftp2.nluug.nl/windowing/gnome/sources/glib/2.74/glib-2.74.2.tar.xz
Adding URL: https://fr.rpmfind.net/linux/gnome.org/sources/glib/2.74/glib-2.74.2.tar.xz
[1] Downloading 'https://fr.rpmfind.net/linux/gnome.org/sources/glib/2.74/glib-2.74.2.tar.xz' ...
Saving 'glib-2.74.2.tar.xz'
Segmentation fault (core dumped)
zulfiqar ~/build>gdb /usr/local/bin/wget2 core
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/bin/wget2...
(No debugging symbols found in /usr/local/bin/wget2)
[New LWP 27187]
[New LWP 27188]
[New LWP 27186]
Core was generated by `wget2 -c http://ftp.gnome.org/pub/gnome/sources/glib/2.74/glib-2.74.2.tar.xz'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f23e69421e3 in wget_hpkp_db_check_pubkey ()
from /usr/local/lib/libwget.so.1
[Current thread is 1 (LWP 27187)]
(gdb) bt
#0 0x00007f23e69421e3 in wget_hpkp_db_check_pubkey ()
from /usr/local/lib/libwget.so.1
#1 0x00007f23e6957267 in openssl_revocation_check_fn ()
from /usr/local/lib/libwget.so.1
#2 0x00007f23e641faca in internal_verify ()
from /usr/local/ssl/lib/libcrypto.so.3
#3 0x00007f23e64211c1 in verify_chain ()
from /usr/local/ssl/lib/libcrypto.so.3
#4 0x00007f23e6421e3c in X509_verify_cert ()
from /usr/local/ssl/lib/libcrypto.so.3
#5 0x00007f23e638d996 in ocsp_verify_signer ()
from /usr/local/ssl/lib/libcrypto.so.3
#6 0x00007f23e638e09c in OCSP_basic_verify ()
from /usr/local/ssl/lib/libcrypto.so.3
#7 0x00007f23e6956af9 in check_ocsp_response ()
from /usr/local/lib/libwget.so.1
#8 0x00007f23e6957734 in openssl_revocation_check_fn ()
from /usr/local/lib/libwget.so.1
#9 0x00007f23e641faca in internal_verify ()
from /usr/local/ssl/lib/libcrypto.so.3
#10 0x00007f23e64211c1 in verify_chain ()
from /usr/local/ssl/lib/libcrypto.so.3
#11 0x00007f23e6421e3c in X509_verify_cert ()
from /usr/local/ssl/lib/libcrypto.so.3
#12 0x00007f23e65eb5f0 in ssl_verify_cert_chain ()
from /usr/local/ssl/lib/libssl.so.3
#13 0x00007f23e662acee in tls_post_process_server_certificate ()
from /usr/local/ssl/lib/libssl.so.3
#14 0x00007f23e6627185 in state_machine () from /usr/local/ssl/lib/libssl.so.3
#15 0x00007f23e6958601 in wget_ssl_open () from /usr/local/lib/libwget.so.1
#16 0x00007f23e694d6b2 in wget_tcp_connect () from /usr/local/lib/libwget.so.1
--Type <RET> for more, q to quit, c to continue without paging--
#17 0x00007f23e6946134 in wget_http_open () from /usr/local/lib/libwget.so.1
#18 0x0000559bf5b1577c in ?? ()
#19 0x0000559bf5b1da45 in ?? ()
#20 0x00007f23e6a0508b in ?? () from /lib/ld-musl-x86_64.so.1
#21 0x0000000000000000 in ?? ()
```https://gitlab.com/gnuwget/wget2/-/issues/616"https_proxy=http://local-squid-instance:3128/ wget2 https://google.com" cras...2022-11-19T18:03:40ZAskar Safin"https_proxy=http://local-squid-instance:3128/ wget2 https://google.com" crashes wget2Steps to reproduce:
Install local squid proxy in its (Debian) default configuration. Then type:
```bash
https_proxy=http://localhost:3128/ wget2 https://google.com
```
wget2 will print nothing and will return code 5. I. e. the above co...Steps to reproduce:
Install local squid proxy in its (Debian) default configuration. Then type:
```bash
https_proxy=http://localhost:3128/ wget2 https://google.com
```
wget2 will print nothing and will return code 5. I. e. the above command causes wget2 to crash without any output.
Full reproducing steps and "--debug" output here: https://builds.sr.ht/~safinaskar/job/877314 (click "view manifest »" to view full commented script).
My wget2 version is 1.99.1https://gitlab.com/gnuwget/wget2/-/issues/615feature request: no_proxy cidr2023-05-22T17:56:55ZMathieu CARBONNEAUXfeature request: no_proxy cidrwhile be very useful to have support on cidr, to exclude direct ip from range (cidr) to go to the proxy.
like go support : https://github.com/golang/net/commit/c21de06aaf072cea07f3a65d6970e5c7d8b6cd6d
exemple:
```
export no_proxy = "10...while be very useful to have support on cidr, to exclude direct ip from range (cidr) to go to the proxy.
like go support : https://github.com/golang/net/commit/c21de06aaf072cea07f3a65d6970e5c7d8b6cd6d
exemple:
```
export no_proxy = "10.0.0.0/8,172.16.0.0/12,192.168.0.0/16"
```
in that way when use `wget http://10.10.10.1` wget go directly on the address 10.10.10.1 not on the proxy...https://gitlab.com/gnuwget/wget2/-/issues/614Bug: OpenSSL does not safely share some user data among threads2022-10-08T23:20:43ZAnder JuaristiBug: OpenSSL does not safely share some user data among threadsThe `vflags` structure is being shared among threads in a completely unsafe manner.
This is because it is being set as "user data" globally such that all threads share the same pointer (`_ctx` is a global variable).
```
store = SSL_CTX...The `vflags` structure is being shared among threads in a completely unsafe manner.
This is because it is being set as "user data" globally such that all threads share the same pointer (`_ctx` is a global variable).
```
store = SSL_CTX_get_cert_store(_ctx);
if (!store) {
}
if (!X509_STORE_set_ex_data(store, store_userdata_idx, (void *) vflags)) {
retval = WGET_E_UNKNOWN;
goto bail;
}
```Ander JuaristiAnder Juaristihttps://gitlab.com/gnuwget/wget2/-/issues/613Support srcdoc HTML element2022-09-04T17:47:42ZTim RühsenSupport srcdoc HTML element**From the wget mailing list (3.9.2022):**
I tried searching the mailing list, bug tracker and source code for srcdoc support,
seems it is missing. If I missed something, please don’t hesitate to point it here.
I’m using GNU Wget 1.21....**From the wget mailing list (3.9.2022):**
I tried searching the mailing list, bug tracker and source code for srcdoc support,
seems it is missing. If I missed something, please don’t hesitate to point it here.
I’m using GNU Wget 1.21.3, trying to archive some websites for posterity.
The only missing feature for my case is this one.
Example, suppose the html file at https://example.com/subfolder/about.html contains:
```
<!DOCTYPE html>
<iframe srcdoc="
<img src=relative.jpg>
<img src="/absolute.jpg">
"></iframe>
```
The expected is that it selects for download the following:
https://example.com/subfolder/relative.jpg
https://example.com/absolute.jpg
documentation:
https://html.spec.whatwg.org/multipage/iframe-embed-object.html#attr-iframe-srcdoc
Browser support:
https://caniuse.com/?search=srcdoc
edge-case (or absence-of):
Seems that due some past oversight the iframe inherits the parents base url:
https://github.com/whatwg/html/issues/8105
Due to backwards compatibility this is not expected to change muchhttps://gitlab.com/gnuwget/wget2/-/issues/612Failed to link libwget for C++ modules2022-08-14T13:09:59ZMichael LeeFailed to link libwget for C++ modulesOn some platforms that have C++ compiler, dynamic linking against `libwget.so` is reported failed.
For example, the following program just performs a simple `wget_http_get` API invocation:
```cpp
// tst_wget201_httpget.cc
#include <ios...On some platforms that have C++ compiler, dynamic linking against `libwget.so` is reported failed.
For example, the following program just performs a simple `wget_http_get` API invocation:
```cpp
// tst_wget201_httpget.cc
#include <iostream>
#include <wget.h>
#define URL "https://www.w3.org/"
int main(void)
{
using namespace std;
wget_http_response* resp;
resp = wget_http_get(WGET_HTTP_URL, URL, 0);
if (resp) { // buffers are always 0 terminated
cout << "Response from " << URL << endl;
cout << "HTTP resp code: " << resp->code << endl;
}
wget_http_free_response(&resp);
return 0;
}
```
Compiled with
```bash
g++ tst_wget201_httpget.cc -I$MYPREFIX/include -L$MYPREFIX/lib -lwget -Wl,-rpath,$MYPREFIX/lib
```
Compiler complains:
```bash
include/wget.h:334:47: error: expected ‘,’ or ‘...’ before ‘src’
334 | wget_memtohex(const unsigned char * restrict src, size_t src_len, char * restrict dst, size_t dst_size);
| ^~~
include/wget.h:356:54: error: expected ‘,’ or ‘...’ before ‘fmt’
356 | wget_vpopenf(const char *type, const char *restrict fmt, va_list args) WGET_GCC_PRINTF_FORMAT(2,0);
| ^~~
(... other similar error)
```
And this is true for both version 2.0.1 and lastest snapshot.https://gitlab.com/gnuwget/wget2/-/issues/611--stats-site=csv:file: character messes up the result2022-07-31T09:37:07ZTim Rühsen--stats-site=csv:file: character messes up the resultWe need a classical CSV character escaping.
And / or a JSON output (https://datatracker.ietf.org/doc/html/rfc8259).We need a classical CSV character escaping.
And / or a JSON output (https://datatracker.ietf.org/doc/html/rfc8259).https://gitlab.com/gnuwget/wget2/-/issues/610Add a test for long filename handling2022-07-30T17:40:15ZTim RühsenAdd a test for long filename handlingFrom IRC/matrix:
```
Hi! wget seems to have a bug. When downloading a file with a 240char long filename it renames the file to a 236char long one and drops the extension. This is despite the fact that ext4 supports 240char long filenames...From IRC/matrix:
```
Hi! wget seems to have a bug. When downloading a file with a 240char long filename it renames the file to a 236char long one and drops the extension. This is despite the fact that ext4 supports 240char long filenames.
In addition to that if downloading this file with --mirror option the filename is truncated even more, because wget incorrectly assumes the path counts to the filesystem filename char limit.
This is very sad as I need to download a few dozen thousands of small files and the filenames have to be exactly the same. Is there a way to prevent wget from renaming files or prevent it from downloading files with filenames too long or deal with this somehow else?
```https://gitlab.com/gnuwget/wget2/-/issues/609HTTP Response 0 flooding.2022-08-01T21:10:31Zmarcel dopeHTTP Response 0 flooding.wget2 v2.0.1 on OpenSUSE Tumbleweed.
Trivia:
wget1 has a bug where it truncates filenames that aren't longer than the 255 char limit imposed by the filesystem. Downloading a file with name 240 char long truncated it to 236 chars, and d...wget2 v2.0.1 on OpenSUSE Tumbleweed.
Trivia:
wget1 has a bug where it truncates filenames that aren't longer than the 255 char limit imposed by the filesystem. Downloading a file with name 240 char long truncated it to 236 chars, and downloading the same file, but in a mirror mode (recursive plus create directories mirroring the whole path) it was truncated to 207 chars. The latter is because wget1 erroneously counts the path toward the filename char limit.
It's of utmost importance to me that the mirror I create be size, metadata and filename equal to the remote copy.
I tried the wget2 command shared next and it doesn't seem to have this bug and it's magnitude better in every aspect, so thank you for this.
BTW I've did extensive testing of wget2's behavior and found these differences with the documentation / expected behavior:
- `-R "index.html*"` - this option is ignored, indexes are downloaded anyway
- from the manpage of `--force-progress`: `This option will also force the progress bar to be printed to stderr when used alongside the --output-file option.` - this doesn't work, no progress bar
- `--progress=bar` - if specified nothing will be saved with -o output or if stdout is redirected to a file
- `--stats-all` - not implemented
- `--stats-site=csv:file` - `,` character messes up the result
- `--tries` - this option is ignored, with `-t inf` a re-run of wget2 I've done to fetch a few missing files resulted in timeout errors, despite the fact that the run took only a few minutes
**The problem:**
I use this command to mirror a website containing 10k small files: `wget2 -rNl inf -np --no-if-modified-since --retry-connrefused --waitretry=3600 --retry-on-http-error=*,\!404 --https-enforce=hard -R "index.html*" --fsync-policy=on --random-wait --max-threads=1 -t inf --backups=99 -w 1 URL`. It works perfectly for a while and then I'm flooded with
```
[0] Checking 'URL1' ...
HTTP response 0 [URL1]
[0] Downloading 'URL1' ...
HTTP response 0 [URL1]
```
or
```
[0] Downloading 'URL2' ...
HTTP response 0 [URL2]
```
The `--stats-site` output is
```
Status ms Size URL
0 0 0 FILE1
```
or
```
Status ms Size URL
0 1 0 FILE2
```
for each try.
The tries seem to happen very fast (it could very well be 1000 tries per second) judging by the output and output's filesize increase.
Aborting the wget2 process and restarting it results in HTTP responses 200 again (and eventually HTTP responses 0 again). This suggests to me that wget2 may be flooding the server/cloudflare/my openwrt router with hundreds of requests each second sabotaging the mirroring process when it could simply wait a second/minute/hour and it would get a 200 response. I wish the `--waitretry=3600` and -w 1 I specified would apply here. It's likely that raising wait time between each try when I get 0 response would fix this, enabling the mirror to succeed. Unfortunately currently I have to re-parse and check (size, timestamp) all the 10k files again to continue mirroring. This highly increases the load on the server.
Questions:
1. Why do I get HTTP Reponse 0 in the first place? What does it mean?
2. Is this a wget2 bug, openwrt misfeature or a real cloudflare/server response?
Note that I don't think I ever got a HTTP Response 0 when testing mirroring the same website with wget1.
Suggestions:
Make `--retry-connrefused --retry-on-http-error --random-wait` all work for every request made by wget2, not just the downloads and make the wait time for all these rise up from `-w` number to the `--waitretry` number. I suggest `--waitretry` shouldn't increase the value each try by just 1 second, but by multiplication of that value times 2, fe. 1st try = 1s, 2nd try = 2s, 3rd try = 4s.
wget2 could also postpone retries optionally or by default, ie. if you get a timeout, wait the retry wait time and try another file only to get back to the previous one once x time passes or other downloads finish.
Sort the `--stats-site` output in ascending/descending order based on HTTP code, that would greatly simplify verifying what errors occured during the operation if high number of files is involved.
Update:
I re-read the docs and did more testing. The `--stats-site` manpage entry mentions ``Status` HTTP response code or 0 if not applicable.`, but I get mixed signals from wget2. When I output these stats in a csv format, the Status column contains 0 values, so judging by the manpage it means it's not an HTTP response at all, but some unspecced internal wget2 thing?! But then, wget2's stdout clearly says that it got an HTTP response 0. I wonder what it all means and suspect one of these outputs is lying. If this indeed isn't a real HTTP response, but a wget2 internal thing, then I don't think `--retry-on-http-error` should apply to it as it does now.
I seem to get counter-intuitive results from my testing. Speeding up the mirroring process (ie. lowering/removing -w and specifying more threads) seems to lead to success or at least getting the 0 responses after downloading significantly more of the website despite much more load on everything involved. It's as if a long runtime triggered some wget2 bug. But... when I specify a high -w, no random wait and only 1 thread, the mirroring seems to succeed too.
`--tries` doesn't work - this option is ignored, with `-t inf` a re-run of wget2 I've done to fetch a few missing files resulted in timeout errors, despite the fact that the run took only a few minutes.https://gitlab.com/gnuwget/wget2/-/issues/608Incorrect libwget version (2.1.0) in 2.0.1 and master?2022-07-19T17:04:26ZBernard CafarelliIncorrect libwget version (2.1.0) in 2.0.1 and master?For 2.0.1 (and still valid in current master), libwget version is indicated as 2.1.0 since https://gitlab.com/gnuwget/wget2/-/commit/3d25515253c26f0510a1a10882a668ca81fcc4d6
Checking the changelog, I suppose this is a typo but I preferr...For 2.0.1 (and still valid in current master), libwget version is indicated as 2.1.0 since https://gitlab.com/gnuwget/wget2/-/commit/3d25515253c26f0510a1a10882a668ca81fcc4d6
Checking the changelog, I suppose this is a typo but I preferred to double-check here (this discrepancy was spotted in QA Gentoo run)
Downstream bug: https://bugs.gentoo.org/858575https://gitlab.com/gnuwget/wget2/-/issues/607Check if robots.txt parser complies with standardized format2022-09-20T09:39:16ZTim RühsenCheck if robots.txt parser complies with standardized formatAt https://github.com/rockdaboot/wget2/issues/263 we came up with the ida of checking the existing parser against the robots.txt proposed standard.
Some people at Google created an RFC draft. I guess it won't change much until it is fin...At https://github.com/rockdaboot/wget2/issues/263 we came up with the ida of checking the existing parser against the robots.txt proposed standard.
Some people at Google created an RFC draft. I guess it won't change much until it is finalized, so we can create a new parser.
See https://datatracker.ietf.org/doc/draft-koster-rep/https://gitlab.com/gnuwget/wget2/-/issues/606wget2 with --no-clobber truncates exists files2022-06-29T05:38:32ZMarco Marsalawget2 with --no-clobber truncates exists filesConsider this working example:
# wget2 --no-clobber "running-gnu-parallel-while-files-exist" -O
"/root/stackoverflow.html" --base
"https://stackoverflow.com/questions/19571996/"
[0] Downloading
'https://stackoverflow.com/questions/1957...Consider this working example:
# wget2 --no-clobber "running-gnu-parallel-while-files-exist" -O
"/root/stackoverflow.html" --base
"https://stackoverflow.com/questions/19571996/"
[0] Downloading
'https://stackoverflow.com/questions/19571996/running-gnu-parallel-while-files-exist'
...
File '/root/stackoverflow.html' already there; not retrieving.
HTTP response 200 OK
[https://stackoverflow.com/questions/19571996/running-gnu-parallel-while-files-exist]
if /root/stackoverflow.html already exists, it gets truncated.
It doesn't happen with wget 1.
My wget version is 2.0.0:
# wget2 --version
GNU Wget2 2.0.0 - multithreaded metalink/file/website downloader
+digest +https +ssl/gnutls +ipv6 +iri +large-file +nls -ntlm -opie -psl
-hsts
+iconv +idn +zlib -lzma -brotlidec -zstd -bzip2 -lzip -http2 -gpgmehttps://gitlab.com/gnuwget/wget2/-/issues/605Multithreading Not Working2022-06-27T21:16:50ZAnthony VillasenorMultithreading Not WorkingI want to download `abc.com/1.pdf, abc.com/2.pdf, ..., abc.com/999.pdf` using multithreading. I ran this on PowerShell but the brace expansion did not work as expected.
`.\wget2 --max-threads=10 "https://abc.com/{1..999}.pdf"`
Instead ...I want to download `abc.com/1.pdf, abc.com/2.pdf, ..., abc.com/999.pdf` using multithreading. I ran this on PowerShell but the brace expansion did not work as expected.
`.\wget2 --max-threads=10 "https://abc.com/{1..999}.pdf"`
Instead of the expected outcome, wget2 tries to get a single file in url `https://abc.com/{1..999}.pdf` and fails. How can I fix this issue?https://gitlab.com/gnuwget/wget2/-/issues/604How to perform wget -L in wget22022-06-25T16:46:00Zabdulbadii1How to perform wget -L in wget2How to perform `wget -L` to follow relative links only in wget2 and please list wget options not in wget2How to perform `wget -L` to follow relative links only in wget2 and please list wget options not in wget2https://gitlab.com/gnuwget/wget2/-/issues/603What is wget2_noinstall2022-06-10T16:10:45Zabdulbadii1What is wget2_noinstallWhy is there wget2_noinstall and what does it differ to wget2 from the build result ?Why is there wget2_noinstall and what does it differ to wget2 from the build result ?https://gitlab.com/gnuwget/wget2/-/issues/602Option -nd i.e. no directory in wget equivalence2022-06-04T14:04:30Zabdulbadii1Option -nd i.e. no directory in wget equivalenceWhat'd be wget option -nd i.e. no directory equivalence now in wget2 ?What'd be wget option -nd i.e. no directory equivalence now in wget2 ?https://gitlab.com/gnuwget/wget2/-/issues/601URLs containing '?' don't work in browser2022-05-15T19:02:40ZSimone DottoURLs containing '?' don't work in browserIn brief: when a URL contains a '?' it doesn't work in the browser, but if I manually escape it to '%3F' then it does.
The command I'm running is the following:
```
wget2 --page-requisites --convert-links --wait=3 "$page"
```
The ques...In brief: when a URL contains a '?' it doesn't work in the browser, but if I manually escape it to '%3F' then it does.
The command I'm running is the following:
```
wget2 --page-requisites --convert-links --wait=3 "$page"
```
The question is: shouldn't --convert-links take care of that problem, i.e. escaping? Is there an alternative way to solve this?
I don't want to open a useless issue, but I can't find any mention of escaping URLs in the manpage, escaping is only mentioned in regards to filenames.https://gitlab.com/gnuwget/wget2/-/issues/600Did not fall back to ipv42022-05-21T08:50:15ZsezanzebDid not fall back to ipv4Hi, this is the original post: https://unix.stackexchange.com/questions/701115/cant-wget-my-server-but-curl-works/701130?noredirect=1#comment1325736_701130
Someone asked to open an issue here, since it didn't fall back to ipv4
I think ...Hi, this is the original post: https://unix.stackexchange.com/questions/701115/cant-wget-my-server-but-curl-works/701130?noredirect=1#comment1325736_701130
Someone asked to open an issue here, since it didn't fall back to ipv4
I think the server I tested this probably still has this issue, I'll make sure the configuration is not changed too soon so that this is reproducible for me. I'd prefer not to share the servers ip though, please let me know if you need me to run anything to test stuff.https://gitlab.com/gnuwget/wget2/-/issues/599wget_buffer_init: seeding initial data2022-04-28T01:37:46ZAvinash Sonawanewget_buffer_init: seeding initial dataHello!
I'm looking closely at `wget_buffer_init()` and I have few queries where docs and code don't match.
Before we begin, I've made this small obvious refactoring change which doesn't change the existing behavior at all:
```
@@ -155,...Hello!
I'm looking closely at `wget_buffer_init()` and I have few queries where docs and code don't match.
Before we begin, I've made this small obvious refactoring change which doesn't change the existing behavior at all:
```
@@ -155,7 +155,6 @@ int wget_buffer_init(wget_buffer *buf, char *data, size_t size)
if (data && likely(size)) {
buf->size = size - 1;
buf->data = data;
- *buf->data = 0; // always 0 terminate data to allow string functions
buf->release_data = 0;
} else {
if (!size)
@@ -165,10 +164,10 @@ int wget_buffer_init(wget_buffer *buf, char *data, size_t size)
buf->error = 1;
return WGET_E_MEMORY;
}
- *buf->data = 0; // always 0 terminate data to allow string functions
buf->release_data = 1;
}
+ *buf->data = 0; // always 0 terminate data to allow string functions
buf->error = 0;
buf->release_buf = 0;
buf->length = 0;
```
Okay, so the docs say "You may provide some `data` to fill the buffer with it." and "If an existing buffer is provided in `buf`, it will be initialized with the provided `data`"
1. But we don't seem to be doing that. We're always emptying the data. `*buf->data = 0;` and `buf->length = 0;`
2. If we do want to fill the buffer with the provided `data` then:
1. we should ask the user to zero-terminate the data before calling `wget_buffer_init()`
2. we'll be needing the length of the `data` along with the `size` as `length of data <= size of data - 1`. We can calculate length using `strlen()` in `wget_buffer_init()` (preferred) or can get it from the user via an extra parameter.
3. What should happen if there is valid `0`-terminated `data` but passed `size` is `0`?
I think, in this case we should allocate the appropriate amount of memory for `buf->data` and copy the contents of `data` in it. WDYT?
4. Why do we set `buf->size` to one less than the actual buffer size? Why not set `buf->size` to the actual buffer size?https://gitlab.com/gnuwget/wget2/-/issues/598URL parser does unwanted transformations of URL2022-09-04T16:45:48Zsaur0nURL parser does unwanted transformations of URLWhen parsing an URL, in additional to transformations that are specified by standard (https://www.ietf.org/rfc/rfc2396.txt), IRI parser makes attempt to do transformations of XML entities, like "&amp;quot;" and "&amp;amp;". This is not c...When parsing an URL, in additional to transformations that are specified by standard (https://www.ietf.org/rfc/rfc2396.txt), IRI parser makes attempt to do transformations of XML entities, like "&quot;" and "&amp;". This is not correct, because XML and HTML entities are part of XML/HTML standard and not part of URL standard.
From my point of view, [part of code that works with HTML entities](https://gitlab.com/gnuwget/wget2/-/blob/master/libwget/iri.c#L295) should be removed from URL parser, because such unwanted behaviour might be the reason of vulnerabilities in different software systems.