Allow two-character keywords in matcher
Background
Currently we use the following regular expression for extracting keywords from URLs: [a-z0-9%]{3,}
. A "keyword" must be at least three characters long. But there is no fundamental reason for this limitation, rather the limiting factors are (1) "bad" keywords (#30 (closed)) and (2) initialization time. Even without filtering out any bad keywords, we still get significantly better performance by allowing two-character keywords.
The cost is in initialization time, which we can measure by counting the total number of keyword candidates in all the blocking and whitelist filters.
Here is some data:
- Default configuration of Adblock Plus (mainly EasyList plus Acceptable Ads):
- Excluding two-character keywords: ~133K keyword candidates
- Including two-character keywords: ~151K keyword candidates
- Default configuration plus EasyPrivacy and Fanboy's Social Blocking List:
- Excluding two-character keywords: ~182K keyword candidates
- Including two-character keywords: ~207K keyword candidates
- Default configuration plus EasyPrivacy and Fanboy's Social Blocking List, plus EasyList Germany and EasyList China:
- Excluding two-character keywords: ~222K keyword candidates
- Including two-character keywords: ~256K keyword candidates
Allowing two-character keywords increases the number of keyword candidates (obviously), but the practical cost is not very significant for the performance benefit, especially on desktop platforms. In benchmarks this change improves request blocking performance by ~7.5%. We are also looking to make more general improvements to initialization time that should then offset this cost.
I would therefore like to propose that we change the regular expression to [a-z0-9%]{2,}
or [a-z0-9%][a-z0-9%]+
. If #30 (closed) is accepted, we may also want to add js
to the static list of bad keywords.
What to change
Change the regular expression for matching a keyword in lib/matcher.js
from [a-z0-9%]{3,}
to either [a-z0-9%]{2,}
or [a-z0-9%][a-z0-9%]+
.
Hints for testers
Make sure request blocking is working correctly in general.
Unsubscribe from all lists and add only two filters /foo.js|
and /bar.js|
. The requests for both <script src="https://example.com/foo.js">
and <script src="https://example.com/bar.js">
should be blocked.