I have a directory with of 300+ HTML files that I need to parse data from and place into a new HTML template which works well with the exception of pre-rendered HTML Entities such as square root √ that appear in some of the files. I have read a ton of posts over the last few hours about encoding in Ruby 1.9 and tried things like:
File.read( "_pending/testdir/filename.html", :encoding=>"UTF-8" )
and
trans = Iconv.new( 'UTF-8', 'IBM437' )
input_text = File.read( "_pending/testdir/filename.html" )
output_text = trans.iconv( input_text )
puts output_text
All with no luck. Once converted the square root symbol still appears as √ in the browser as well as in the raw HTML markup with the exception of the Iconv solution which outputs AªAo to console when it puts.
Setup
Windows Server 2008 R2
ruby 1.9.3p194 (2012-04-20) [i386-mingw32]
The encoding name returns IBM437 when I do HTML_FILE.external_encoding.name.
You have to be kidding me…
The fix was to set a content header in the HTML template page. I’m guessing the console was outputting valid UTF-8 but not in a recognizable format.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>While I feel pretty stupid right now, I’m sure someone else is going to run into something similar so if that’s you: I feel your pain; I just hope you didn’t spend the last 6 hours troubleshooting as I have.