After spending some hours with the Ruby Debugger I finally learned that I need to clean up some malformed HTML pages before I can feed those to Hpricot. The best solution I found so far is the Tidy Ruby interface.
Tidy works great from the command line and also the Ruby interface works. However, it requires dl/import, which fails to load in JRuby:
$ jirb irb(main):001:0> require 'rubygems' => true irb(main):002:0> require 'tidy' LoadError: no such file to load -- dl/import
Is this library available for JRuby? A web search revealed that it wasn’t available last year.
Alternatively, can someone suggest other ways to clean up malformed HTML in JRuby?
Update
Following Markus’ suggestion I now use Tidy via popen instead of libtidy. I posted the code which pipes the document data through tidy for future reference. Hopefully, this is robust and portable.
def clean(data) cleaned = nil tidy = IO.popen('tidy -f 'log/tidy.log' --force-output yes -wrap 0 -utf8', 'w+') begin tidy.write(data) tidy.close_write cleaned = tidy.read tidy.close_read rescue Errno::EPIPE $stderr.print 'Running 'tidy' failed: ' + $! tidy.close end return cleaned if cleaned and cleaned != '' return data end
You could use it from the command line from within JRuby with
%x{...}or backticks. You may also want to considerpopen(and pipe things through it).Not elegant perhaps, but more likely to get you going with minimal hassle than trying to mess with unsupported libraries.