My test html file is here: http://pastebin.com/L88nYbQY
As you can see there are some unclosed input tags, and some self closing ones.
This causes the following code to return everything from the opening #qcbody div to the end of the file, ignoring the closing div tag.
require 'nokogiri'
f = File.open('t.html', 'r')
@doc = Nokogiri::XML(f)
@doc.at_css('#qcbody').to_html
I’m sure people have gotten around this problem in a variety of ways. How would you do it?
Give this a try:
In IRB:
The difference between using
Nokogiri::XMLandNokogiri::HTMLis the leniency when parsing the document. XML is required to validate and be correct. Some XML parsers would reject an XML file that doesn’t meet the standard. Nokogiri allows us to set how picky it is. (And in the case of XML, you can look at theerrorsarray after parsing to see if there is a problem.)For HTML, Nokogiri relaxes the parser so there’s a better chance of handling real-world HTML. I’ve seen it handle some really ugly markup and keep on going when lesser parsers blew their lunch. If you look at
Nokogiri::HTML.parseit hasoptions = XML::ParseOptions::DEFAULT_HTMLdefined, which are the relaxed settings. You can override that if you want to make sure the HTML conforms.