Unskip regex rules in scripts that prepares static DNR rules
Background
When developing the scripts that download filter lists, convert the filters to DNR rules and prepares them to be included in the extension's manifest file we faced a challenge with invalid rules. In particular, regular expression rules were more problematic because we couldn't find a way to easily validate them. Even syntax correct regex rules can be rejected by DNR because of unsupported features in RE2 or due to a memory limitation set by Chrome. Here is an overview of all filters that were not passing validation at the time. The problem is that the only reliable way of validating these rules is to give them to the DNR API as a dynamic filter and check the API response. However, in the scripts we do not have access to the browser context and therefore cannot validate them. And if even a single invalid rule slips through and makes into the manifest file, the extension will fail to install.
Since this problem was more complex than expected and we were in a tight deadline, we decided in adblockpluscore#431 (comment 952834309) to skip regex rules completely, ignoring all of them and not outputting any regex rule in the final manifest fragment.
Use case
We want to have regular expression filters back. Even thought they represent only a small percentage of the total number of filters, in some lists like EL Russian and EL French they are more common and used to block ads in a large number of domains.
What to change
Option 1: Delegate the rule validation to FLOPS
The FLOPS team takes care of fetching all filter lists files from their repositories, put them together using templates, transforming and checking filters when necessary and outputting the final lists to the servers from which we download them. The process for generating Mv3 lists is already split from Mv2 lists and we already download them from a separate path. FLOPS could add a validation step to the pipeline that converts rules from ABP to DNR and feeds them to a dummy extension that checks each rule against the DNR API. Rules that do not pass the validation are removed and don't make it to the final filter list file. In addition, this team has closer contact to filter list authors and could setup notifications when invalid filters are detected, so authors can fix them quickly.
Option 2: A web app to validate rules
A suggestion that came from the Chrome representative was to create a small application that loads a dummy extension into a Chrome instance and exposes a REST API with an endpoint that accepts rules, validates them and send the browser response back. This endpoint could then be called by the scripts to validate all regex rules. In case this option is chosen, a nice idea is to make this application public so the content blocker community can also benefit from it.
Option 3: Validate rules with an RE2 lib
Another possibility is to bring in an RE2 library and try to validate rules on our own, mimicking the same constraints that Chrome has (e.g. max 2KB per rule). This approach has the disadvantage of complicating the build process with increased build time and the requirement to be platform-specific as well as potentially breaking in the future if Chrome changes its constraints.
Option 4: Load regex rules dynamically instead of statically
This is the approach chosen by uBlock Origin, as described here. Instead of trying to validate rules outside the browser context to be able to include them as static rules in the manifest file, one could wait and only add the rules at run time as dynamic rules. That way we avoid the risk of messing with the installation and rules can be validated using the DNR API. The obvious downside is that it consumes the scarce dynamic rules and puts us closer to the limit of 5K.