Lots of tags have [text] as their name.
I scraped the site https://genius.com/The-decemberists-the-crane-wife-1-2-and-3-lyrics
I was using it as an example to test my plain text parsing code. I'm trying to extract just the readable text from the site. I tried recursively getting all tags and throwing out any that contain "script" in the name. However, I kept getting javascript in my output. To test what was happening, I just printed out the node.name() and node.text() of all leaf nodes.
pub fn bs_plain_text(html: &str) -> String {
let s = soup::Soup::new(html);
let mut output: String = String::new();
for node in s.recursive(true) {
if node.children().count() == 0 {
if node.name().contains("script") == false {
let line = format!("\nName: {}, Text: {}", node.name(), node.text());
output.push_str(&line);
}
}
}
output
}
I run this code and dump the output string to a text file. In this text file, I see a lot of lines like this:
Name: [text], Text: window['Genius.ads'] = window['Genius.ads'] || [];
Name: [text], Text:
!function(){if('PerformanceLongTaskTiming' in window){var g=window.__tti={e:[]};
g.o=new PerformanceObserver(function(l){g.e=g.e.concat(l.getEntries())});
g.o.observe({entryTypes:['longtask']})}}();
In the docs, it looks like the .name() function is used to get the tag name. This does not seem to work though. In fact, MOST tags have [text] as their names when parsing this document. Is there something besides .name() that I should be using to get the tag?