Possibility to use several accept-regex or reject-regex
TL;DR: I think it should be nice to be able to specify several
--reject-regex options (so an URL could be tested against each regex), to allow complex website mirrorings without writing a very complicated regex.
One of the uses I have with wget/wget2 is to download whole websites off the Internet, so I can have them offline in case the site goes down.
If the site only consists of static pages, it is an easy task; however, it is rarely the case, and websites/blogs usually have backend code.
I will take the example of phpBB: I know that when I try to archive a phpBB website, I have to prevent the software from going to archive pages like
ucp.php and everything containing
&p= or an anchor (
#), since they may confuse the software and sometimes even cause infinite loops.
Therefore, it would be great if it was possible to specify several
--reject-regex options, so that we would be able to write one regex for every "condition", instead of writing a single regex which can be very cumbersome if there is a lot of conditions involved (my phpBB filters would include 26 conditions, for example...), and even impossible in some cases.
Then, when an URL would be tested in wget2, it would be checked across all regexes to see if one
reject-regex is matching, for example. If one is matching, then dismiss the URL.
PS/FYI: In the past, I was using HTTrack to mirror websites for offline use, but I stopped using it since it has been having issues for years like this one that prevent me doing a proper mirror, which are unlikely to be addressed (since the development looks inactive as now)