Readability seems to trip over `class="sidebar-visible"`
This one is a bit puzzleling. The following url turned up in one of my aggregator feeds: https://sdiehl.github.io/zero-to-qed/01_introduction.html; apparently a website generated with [mdBook](https://github.com/rust-lang/mdBook). Unfortunately, article-scraper panics on it.
Manually minimising it yields the following snippet:
```html
<!DOCTYPE HTML>
<html class="sidebar-visible">
<body>
</body>
</html>
```
<details>
<summary>Output of crash as of 48bbd21b20e6b00d3bcc03ef5a0750229902fb4a</summary>
```
RUST_BACKTRACE=1 cargo run -- --debug ftr --html ./foo.html
warning: unused import: `command`
--> article_scraper_cli/src/args.rs:1:12
|
1 | use clap::{command, Parser, Subcommand};
| ^^^^^^^
|
= note: `#[warn(unused_imports)]` (part of `#[warn(unused)]`) on by default
warning: `article_scraper_cli` (bin "article_scraper_cli") generated 1 warning (run `cargo fix --bin "article_scraper_cli" -p article_scraper_cli` to apply 1 suggestion)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.18s
Running `target/debug/article_scraper_cli --debug ftr --html ../full-text-rs/rss-examples/foo.html`
14:03:53 [WARN] No config found for url 'http://fakehost/test/base/'
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[@property="og:title"]/@content' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//title' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dc:title')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dcterm:title')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'og:title')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'weibo:article:title')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'weibo:webpage:title')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'twitter:title')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//author' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dc:creator')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dcterm:creator')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[@property="article:published_time"]/@content' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'twitter:image')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'og:image')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@property, 'twitter:image')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@property, 'og:image')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link[contains(@rel, 'image_src')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link[@rel='image_src']' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//h1' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//h2' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//font' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//table' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'google-dfp-ad-wrapper')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'sharedaddy') or contains(@id, 'sharedaddy')][not(ancestor::*[contains(@class, 'sharedaddy') or contains(@id, 'sharedaddy')])]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'i-amphtml-replaced-content') or contains(@id, 'i-amphtml-replaced-content')][not(ancestor::*[contains(@class, 'i-amphtml-replaced-content') or contains(@id, 'i-amphtml-replaced-content')])]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[contains(@src,'doubleclick.net')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//noscript' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//noscript' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//picture' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//figure' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe[contains(@src, 'youtube.com')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a[@onclick]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[@decoding]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[@loading]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class,' entry-unrelated ') or contains(@class,' instapaper_ignore ')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@style,'display:none')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@style,'display: none')]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[@style]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//form' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//input' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//textarea' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//select' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//button' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//comment()' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//script' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//style' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a[not(node())]' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[@type='text/css']' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//object' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//embed' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//footer' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//aside' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//object' yielded no results
14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe' yielded no results
thread 'main' (118374) panicked at article_scraper/src/full_text_parser/readability/mod.rs:364:60:
doc should have root
stack backtrace:
0: __rustc::rust_begin_unwind
at /rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/std/src/panicking.rs:698:5
1: core::panicking::panic_fmt
at /rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/core/src/panicking.rs:80:14
2: core::panicking::panic_display
at /rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/core/src/panicking.rs:264:5
3: core::option::expect_failed
at /rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/core/src/option.rs:2183:5
4: core::option::Option<T>::expect
at /nix/store/inqfckqlhvwlg88c81r4x0lncjlk1h22-rust-default-1.92.0/lib/rustlib/src/rust/library/core/src/option.rs:970:21
5: article_scraper::full_text_parser::readability::Readability::extract_body::{{closure}}
at ./article_scraper/src/full_text_parser/readability/mod.rs:364:60
6: core::option::Option<T>::unwrap_or_else
at /nix/store/inqfckqlhvwlg88c81r4x0lncjlk1h22-rust-default-1.92.0/lib/rustlib/src/rust/library/core/src/option.rs:1066:21
7: article_scraper::full_text_parser::readability::Readability::extract_body
at ./article_scraper/src/full_text_parser/readability/mod.rs:361:69
8: article_scraper::full_text_parser::FullTextParser::parse_page
at ./article_scraper/src/full_text_parser/mod.rs:243:33
9: article_scraper::full_text_parser::FullTextParser::parse_offline
at ./article_scraper/src/full_text_parser/mod.rs:131:18
10: article_scraper_cli::extract_ftr::{{closure}}
at ./article_scraper_cli/src/main.rs:122:42
11: article_scraper_cli::main::{{closure}}
at ./article_scraper_cli/src/main.rs:48:75
12: <core::pin::Pin<P> as core::future::future::Future>::poll
at /nix/store/inqfckqlhvwlg88c81r4x0lncjlk1h22-rust-default-1.92.0/lib/rustlib/src/rust/library/core/src/future/future.rs:133:9
13: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/park.rs:284:71
14: tokio::task::coop::with_budget
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/task/coop/mod.rs:167:5
15: tokio::task::coop::budget
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/task/coop/mod.rs:133:5
16: tokio::runtime::park::CachedParkThread::block_on
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/park.rs:284:31
17: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/context/blocking.rs:66:14
18: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/scheduler/multi_thread/mod.rs:87:22
19: tokio::runtime::context::runtime::enter_runtime
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/context/runtime.rs:65:16
20: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/scheduler/multi_thread/mod.rs:86:9
21: tokio::runtime::runtime::Runtime::block_on_inner
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/runtime.rs:370:50
22: tokio::runtime::runtime::Runtime::block_on
at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/runtime.rs:340:18
23: article_scraper_cli::main
at ./article_scraper_cli/src/main.rs:33:24
24: core::ops::function::FnOnce::call_once
at /nix/store/inqfckqlhvwlg88c81r4x0lncjlk1h22-rust-default-1.92.0/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```
</details>
What is a bit puzzling here is that this does not happen with other classes, for instance, the following works just fine:
```html
<!DOCTYPE HTML>
<html class="navy">
<body>
</body>
</html>
```
<details>
<summary>Output of successful parse</summary>
```
RUST_BACKTRACE=1 cargo run -- --debug ftr --html ./foo.html
warning: unused import: `command`
--> article_scraper_cli/src/args.rs:1:12
|
1 | use clap::{command, Parser, Subcommand};
| ^^^^^^^
|
= note: `#[warn(unused_imports)]` (part of `#[warn(unused)]`) on by default
warning: `article_scraper_cli` (bin "article_scraper_cli") generated 1 warning (run `cargo fix --bin "article_scraper_cli" -p article_scraper_cli` to apply 1 suggestion)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.18s
Running `target/debug/article_scraper_cli --debug ftr --html ../full-text-rs/rss-examples/foo.html`
14:08:46 [WARN] No config found for url 'http://fakehost/test/base/'
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[@property="og:title"]/@content' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//title' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dc:title')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dcterm:title')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'og:title')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'weibo:article:title')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'weibo:webpage:title')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'twitter:title')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//author' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dc:creator')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dcterm:creator')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[@property="article:published_time"]/@content' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'twitter:image')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'og:image')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@property, 'twitter:image')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@property, 'og:image')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link[contains(@rel, 'image_src')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link[@rel='image_src']' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//h1' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//h2' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//font' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//table' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'google-dfp-ad-wrapper')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'sharedaddy') or contains(@id, 'sharedaddy')][not(ancestor::*[contains(@class, 'sharedaddy') or contains(@id, 'sharedaddy')])]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'i-amphtml-replaced-content') or contains(@id, 'i-amphtml-replaced-content')][not(ancestor::*[contains(@class, 'i-amphtml-replaced-content') or contains(@id, 'i-amphtml-replaced-content')])]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[contains(@src,'doubleclick.net')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//noscript' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//noscript' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//picture' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//figure' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe[contains(@src, 'youtube.com')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a[@onclick]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[@decoding]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[@loading]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class,' entry-unrelated ') or contains(@class,' instapaper_ignore ')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@style,'display:none')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@style,'display: none')]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[@style]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//form' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//input' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//textarea' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//select' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//button' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//comment()' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//script' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//style' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a[not(node())]' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[@type='text/css']' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//object' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//embed' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//footer' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//aside' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//object' yielded no results
14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe' yielded no results
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: initialize node DIV : 5
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Looking at sibling node: DIV (None) with score 5
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Appending node: DIV (None)
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: initialize node DIV : 5
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Looking at sibling node: DIV (None) with score 5
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Appending node: DIV (None)
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: initialize node DIV : 5
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Looking at sibling node: DIV (None) with score 5
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Appending node: DIV (None)
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: initialize node DIV : 5
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Looking at sibling node: DIV (None) with score 5
14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Appending node: DIV (None)
14:08:46 [INFO] successfully written result to "result.html"
```
</details>
So it seems that maybe explicit rules for `sidebar-visible` exist? Removing `sidebar-visible` from the original page leads to an overall successful parse.
Judging from a quick grep across the code base, the match seems to sensitive to `sidebar`, which is part of `UNLIELY_CANDIDATES regex` and `NEGATIVE regex`. A quick testing across HTML classes::
- `sidebar`: crash
- `skyscraper` (also in both regex): crash
- `yom-remote` (only in `UNLIELY_CANDIDATES regex`): crash
- `widget` (only in `NEGATIVE regex`): works
So apparently some postprocessing of `UNLIELY_CANDIDATES regex` seems to trigger the root element to not be considered. That is where my current debugging has stopped for now.
As always: Thank you for taking the time to looking at this bug report and for all your work on article_scraper, and Merry Christmas :)
Best regards, Simon
issue