Oga skips tags but Nokogiri has no problem
This could be similar to issue #188. I hit this problem while trying to parse a utility company website. To ease debugging and to remove proprietary information, I simplified the web page and placed it in a test script:
require 'oga'
require 'nokogiri'
text = <<-HTML
<html>
<head>
</head>
<body>
<table>
<thead><th>Hello</th><th>Hello 2</th></thead>
<tr><td>World</td><td>World 2</td></tr>
</table>
</body>
</html>
HTML
table_elem = Oga.parse_html(text).at_css('table')
puts 'Oga - children of table element:'
table_elem.children.each { |elem| p elem }
puts
table_elem = Nokogiri.HTML(text).at_css('table')
puts 'Nokogiri - children of table element:'
table_elem.children.each { |elem| p elem }
The output:
Oga - children of table element:
Text("\n ")
Element(name: "thead")
Nokogiri - children of table element:
#<Nokogiri::XML::Text:0x3fcc8c8803a4 "\n ">
#<Nokogiri::XML::Element:0x3fcc8c88028c name="thead" children=[#<Nokogiri::XML::Element:0x3fcc8c880098 name="th" children=[#<Nokogiri::XML::Text:0x3fcc8c4d7e78 "Hello">]>, #<Nokogiri::XML::Element:0x3fcc8c4d7c20 name="th" children=[#<Nokogiri::XML::Text:0x3fcc8c4d79a0 "Hello 2">]>]>
#<Nokogiri::XML::Text:0x3fcc8c4d7554 "\n ">
#<Nokogiri::XML::Element:0x3fcc8c4d743c name="tr" children=[#<Nokogiri::XML::Element:0x3fcc8c4d71d0 name="td" children=[#<Nokogiri::XML::Text:0x3fcc8c4d6f78 "World">]>, #<Nokogiri::XML::Element:0x3fcc8c4d6b18 name="td" children=[#<Nokogiri::XML::Text:0x3fcc8c4d67a8 "World 2">]>]>
#<Nokogiri::XML::Text:0x3fcc8c4d63e8 "\n ">
Of course, the HTML does not pass validation because there needs to be a TR element in the THEAD. However, Nokogiri sees both child elements of table but Oga sees only the first one.