I’m trying to convert HTML pages to Markdown using the reverse-markdown Ruby gem. Unfortunately it fails with:
/usr/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:95:in `rescue in parse': #<REXML::ParseException: Missing end tag for 'img' (got "td") (REXML::ParseException)
The source contains some IMG, INPUT, etc. tags which end with > instead of />.
I’ve tried the tidy_ffi gem:
doc = Nokogiri::HTML(TidyFFI::Tidy.new(Nokogiri::HTML(page).to_html,
:numeric_entities => 1,
:output_html => 1,
:merge_divs => 0,
:merge_spans => 0,
:join_styles => 0,
:clean => 1,
:indent => 1,
:wrap => 0,
:drop_empty_paras => 0,
:literal_attributes => 1).clean)
but that made no difference. Any suggestions?
Reverse-markdown actually assumes the markdown processor produces well-formed XHTML. If yours doesn’t, you may want to try the html2markdown gem. It parses using Nokogiri, and is likely more robust (disclaimer: I have not used it).