I’m encountering some weird encoding issues. I need to parse an HTML document from the web, and I’m using the ‘Content-Type’ charset meta-data to determine the encoding type.
One page has been giving me trouble and is encoded by ‘Shift_jis’ (Japanese) – The parser result contains some garbled characters.
When I parse the same document using UTF-8 the characters that were garbled before are parsed correctly but everything else is now garbled.
I’m assuming the document contains text in two different encoding types.
I there anyway I could parse this document correctly ?
Also, I don’t how, but all the browsers seem to deal well with the issue and are presenting the page nicely.
Would really appreciate any thoughts on this.
The page that I need to parse : http://ao.recruit.co.jp/form.html
First of all, what the browser sees is:
What is shown in rendered html is not the same because of the CSS
text-indent: -9999pxand the background image laid over it. But it’s there. Removing them will show the text browser is seeing.Out of the box, decoding as Shift-Jis should give you
莨夂、セ讎りヲ?, but if you want same results as in a browser, you should use a customCharsetDecoderwithIGNORE:This will give you same result as with browsers. Of course, it won’t parse the text from the image file.