I have sample HTML that I have marked up with some special tags that will be used by a different program, an example of the html is below. You should note the <START:organization>..<END> elements.
<html>
<head/>
<body>
<ul>
<li> <START:organization> Advanced Integrated Pest Management <END> </li>
<li> <START:organization> American Bakers Association <END> </li>
</ul>
</body>
</html>
I wanted to use Nokogiri to preprocess the HTML to easily remove irrelevant tags like <script>. I created the following extension to the Nokogiri Document class:
module Nokogiri
module HTML
class Document
def prepare_html
xpath("//script").remove
to_html.remove_new_lines
end
end
end
end
The problem is that Nokogiri is changing the <START:organization> element to <organization>.
Is there anyway that I can preserve the HTML to maintain my custom markup tags?
like the other two said, if your input is not standard XML nor HTML you cant really expect a parser designed for that to work.
nevertheless you could do one of the following:
case
using ( curious what it is ) to
follow standards
parser for the DSL you are using