When parsing HTML document, how Nokogiri handle <br> tags? Suppose we have document that looks like this one:
<div>
Hi <br>
How are you? <br>
</div>
Do Nokogiri know that <br> tags are something special not just regular XML tags and ignore them when parsing node feed? I think Nokogiri is that smart, but I want to make sure before I accept this project involving scraping site written as HTML4. You know what I mean (How are you? is not a content of the first <br> as it would be in XML).
You must parse this fragment using the HTML parser, as obviously this is not valid XML. When using the HTML one, Nokogiri then behaves as you’d expect it:
prints
Mechanize is based on Nokogiri for doing web scraping, so it is quite appropriate for the task.