Readability seems to trip over `class="sidebar-visible"` (#9) · Issues · news-flash / article_scraper · GitLab

Readability seems to trip over `class="sidebar-visible"`

This one is a bit puzzleling. The following url turned up in one of my aggregator feeds: https://sdiehl.github.io/zero-to-qed/01_introduction.html; apparently a website generated with [mdBook](https://github.com/rust-lang/mdBook). Unfortunately, article-scraper panics on it. Manually minimising it yields the following snippet: ```html <!DOCTYPE HTML> <html class="sidebar-visible"> <body> </body> </html> ``` <details> <summary>Output of crash as of 48bbd21b20e6b00d3bcc03ef5a0750229902fb4a</summary> ``` RUST_BACKTRACE=1 cargo run -- --debug ftr --html ./foo.html warning: unused import: `command` --> article_scraper_cli/src/args.rs:1:12 | 1 | use clap::{command, Parser, Subcommand}; | ^^^^^^^ | = note: `#[warn(unused_imports)]` (part of `#[warn(unused)]`) on by default warning: `article_scraper_cli` (bin "article_scraper_cli") generated 1 warning (run `cargo fix --bin "article_scraper_cli" -p article_scraper_cli` to apply 1 suggestion) Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.18s Running `target/debug/article_scraper_cli --debug ftr --html ../full-text-rs/rss-examples/foo.html` 14:03:53 [WARN] No config found for url 'http://fakehost/test/base/' 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[@property="og:title"]/@content' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//title' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dc:title')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dcterm:title')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'og:title')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'weibo:article:title')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'weibo:webpage:title')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'twitter:title')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//author' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dc:creator')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dcterm:creator')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[@property="article:published_time"]/@content' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'twitter:image')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'og:image')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@property, 'twitter:image')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@property, 'og:image')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link[contains(@rel, 'image_src')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link[@rel='image_src']' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//h1' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//h2' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//font' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//table' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'google-dfp-ad-wrapper')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'sharedaddy') or contains(@id, 'sharedaddy')][not(ancestor::*[contains(@class, 'sharedaddy') or contains(@id, 'sharedaddy')])]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'i-amphtml-replaced-content') or contains(@id, 'i-amphtml-replaced-content')][not(ancestor::*[contains(@class, 'i-amphtml-replaced-content') or contains(@id, 'i-amphtml-replaced-content')])]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[contains(@src,'doubleclick.net')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//noscript' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//noscript' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//picture' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//figure' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe[contains(@src, 'youtube.com')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a[@onclick]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[@decoding]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[@loading]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class,' entry-unrelated ') or contains(@class,' instapaper_ignore ')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@style,'display:none')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@style,'display: none')]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[@style]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//form' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//input' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//textarea' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//select' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//button' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//comment()' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//script' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//style' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a[not(node())]' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[@type='text/css']' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//object' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//embed' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//footer' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//aside' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//object' yielded no results 14:03:53 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe' yielded no results thread 'main' (118374) panicked at article_scraper/src/full_text_parser/readability/mod.rs:364:60: doc should have root stack backtrace: 0: __rustc::rust_begin_unwind at /rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/std/src/panicking.rs:698:5 1: core::panicking::panic_fmt at /rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/core/src/panicking.rs:80:14 2: core::panicking::panic_display at /rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/core/src/panicking.rs:264:5 3: core::option::expect_failed at /rustc/ded5c06cf21d2b93bffd5d884aa6e96934ee4234/library/core/src/option.rs:2183:5 4: core::option::Option<T>::expect at /nix/store/inqfckqlhvwlg88c81r4x0lncjlk1h22-rust-default-1.92.0/lib/rustlib/src/rust/library/core/src/option.rs:970:21 5: article_scraper::full_text_parser::readability::Readability::extract_body::{{closure}} at ./article_scraper/src/full_text_parser/readability/mod.rs:364:60 6: core::option::Option<T>::unwrap_or_else at /nix/store/inqfckqlhvwlg88c81r4x0lncjlk1h22-rust-default-1.92.0/lib/rustlib/src/rust/library/core/src/option.rs:1066:21 7: article_scraper::full_text_parser::readability::Readability::extract_body at ./article_scraper/src/full_text_parser/readability/mod.rs:361:69 8: article_scraper::full_text_parser::FullTextParser::parse_page at ./article_scraper/src/full_text_parser/mod.rs:243:33 9: article_scraper::full_text_parser::FullTextParser::parse_offline at ./article_scraper/src/full_text_parser/mod.rs:131:18 10: article_scraper_cli::extract_ftr::{{closure}} at ./article_scraper_cli/src/main.rs:122:42 11: article_scraper_cli::main::{{closure}} at ./article_scraper_cli/src/main.rs:48:75 12: <core::pin::Pin<P> as core::future::future::Future>::poll at /nix/store/inqfckqlhvwlg88c81r4x0lncjlk1h22-rust-default-1.92.0/lib/rustlib/src/rust/library/core/src/future/future.rs:133:9 13: tokio::runtime::park::CachedParkThread::block_on::{{closure}} at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/park.rs:284:71 14: tokio::task::coop::with_budget at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/task/coop/mod.rs:167:5 15: tokio::task::coop::budget at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/task/coop/mod.rs:133:5 16: tokio::runtime::park::CachedParkThread::block_on at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/park.rs:284:31 17: tokio::runtime::context::blocking::BlockingRegionGuard::block_on at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/context/blocking.rs:66:14 18: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}} at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/scheduler/multi_thread/mod.rs:87:22 19: tokio::runtime::context::runtime::enter_runtime at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/context/runtime.rs:65:16 20: tokio::runtime::scheduler::multi_thread::MultiThread::block_on at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/scheduler/multi_thread/mod.rs:86:9 21: tokio::runtime::runtime::Runtime::block_on_inner at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/runtime.rs:370:50 22: tokio::runtime::runtime::Runtime::block_on at /home/noctux/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.44.2/src/runtime/runtime.rs:340:18 23: article_scraper_cli::main at ./article_scraper_cli/src/main.rs:33:24 24: core::ops::function::FnOnce::call_once at /nix/store/inqfckqlhvwlg88c81r4x0lncjlk1h22-rust-default-1.92.0/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. ``` </details> What is a bit puzzling here is that this does not happen with other classes, for instance, the following works just fine: ```html <!DOCTYPE HTML> <html class="navy"> <body> </body> </html> ``` <details> <summary>Output of successful parse</summary> ``` RUST_BACKTRACE=1 cargo run -- --debug ftr --html ./foo.html warning: unused import: `command` --> article_scraper_cli/src/args.rs:1:12 | 1 | use clap::{command, Parser, Subcommand}; | ^^^^^^^ | = note: `#[warn(unused_imports)]` (part of `#[warn(unused)]`) on by default warning: `article_scraper_cli` (bin "article_scraper_cli") generated 1 warning (run `cargo fix --bin "article_scraper_cli" -p article_scraper_cli` to apply 1 suggestion) Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.18s Running `target/debug/article_scraper_cli --debug ftr --html ../full-text-rs/rss-examples/foo.html` 14:08:46 [WARN] No config found for url 'http://fakehost/test/base/' 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[@property="og:title"]/@content' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//title' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dc:title')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dcterm:title')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'og:title')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'weibo:article:title')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'weibo:webpage:title')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'twitter:title')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//author' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dc:creator')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'dcterm:creator')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[@property="article:published_time"]/@content' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'twitter:image')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@name, 'og:image')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@property, 'twitter:image')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//meta[contains(@property, 'og:image')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link[contains(@rel, 'image_src')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link[@rel='image_src']' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//h1' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//h2' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//font' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//table' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'google-dfp-ad-wrapper')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'sharedaddy') or contains(@id, 'sharedaddy')][not(ancestor::*[contains(@class, 'sharedaddy') or contains(@id, 'sharedaddy')])]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class, 'i-amphtml-replaced-content') or contains(@id, 'i-amphtml-replaced-content')][not(ancestor::*[contains(@class, 'i-amphtml-replaced-content') or contains(@id, 'i-amphtml-replaced-content')])]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[contains(@src,'doubleclick.net')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//noscript' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//noscript' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//picture' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//figure' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe[contains(@src, 'youtube.com')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a[@onclick]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[@decoding]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img[@loading]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@class,' entry-unrelated ') or contains(@class,' instapaper_ignore ')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@style,'display:none')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[contains(@style,'display: none')]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[@style]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//form' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//input' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//textarea' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//select' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//button' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//comment()' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//script' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//style' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a[not(node())]' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//*[@type='text/css']' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//object' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//embed' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//footer' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//link' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//aside' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//img' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//a' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//object' yielded no results 14:08:46 [DEBUG] (1) article_scraper::util: Evaluation of xpath '//iframe' yielded no results 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: initialize node DIV : 5 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Looking at sibling node: DIV (None) with score 5 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Appending node: DIV (None) 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: initialize node DIV : 5 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Looking at sibling node: DIV (None) with score 5 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Appending node: DIV (None) 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: initialize node DIV : 5 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Looking at sibling node: DIV (None) with score 5 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Appending node: DIV (None) 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: initialize node DIV : 5 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Looking at sibling node: DIV (None) with score 5 14:08:46 [DEBUG] (1) article_scraper::full_text_parser::readability: Appending node: DIV (None) 14:08:46 [INFO] successfully written result to "result.html" ``` </details> So it seems that maybe explicit rules for `sidebar-visible` exist? Removing `sidebar-visible` from the original page leads to an overall successful parse. Judging from a quick grep across the code base, the match seems to sensitive to `sidebar`, which is part of `UNLIELY_CANDIDATES regex` and `NEGATIVE regex`. A quick testing across HTML classes:: - `sidebar`: crash - `skyscraper` (also in both regex): crash - `yom-remote` (only in `UNLIELY_CANDIDATES regex`): crash - `widget` (only in `NEGATIVE regex`): works So apparently some postprocessing of `UNLIELY_CANDIDATES regex` seems to trigger the root element to not be considered. That is where my current debugging has stopped for now. As always: Thank you for taking the time to looking at this bug report and for all your work on article_scraper, and Merry Christmas :) Best regards, Simon

issue