I’m encountering some weird encoding issues. I need to parse an HTML document from

Question

0

Asked: June 17, 20262026-06-17T09:28:28+00:00 2026-06-17T09:28:28+00:00

I’m encountering some weird encoding issues. I need to parse an HTML document from

0

I’m encountering some weird encoding issues. I need to parse an HTML document from the web, and I’m using the ‘Content-Type’ charset meta-data to determine the encoding type.
One page has been giving me trouble and is encoded by ‘Shift_jis’ (Japanese) – The parser result contains some garbled characters.

When I parse the same document using UTF-8 the characters that were garbled before are parsed correctly but everything else is now garbled.

I’m assuming the document contains text in two different encoding types.

I there anyway I could parse this document correctly ?

Also, I don’t how, but all the browsers seem to deal well with the issue and are presenting the page nicely.

Would really appreciate any thoughts on this.

The page that I need to parse : http://ao.recruit.co.jp/form.html

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T09:28:29+00:00

First of all, what the browser sees is:

莨夂､ｾ讎りｦ

What is shown in rendered html is not the same because of the CSS text-indent: -9999px and the background image laid over it. But it’s there. Removing them will show the text browser is seeing.

Out of the box, decoding as Shift-Jis should give you 莨夂､ｾ讎りｦ?, but if you want same results as in a browser, you should use a custom CharsetDecoder with IGNORE:

URL url = new URL( "http://ao.recruit.co.jp/form.html");
BufferedInputStream bis = new BufferedInputStream(url.openStream());
CharsetDecoder decoder = Charset.forName("Shift-Jis").newDecoder();

decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);

Reader inputReader = new InputStreamReader(bis, decoder);

String result = IOUtils.toString(inputReader);
System.out.print(result);

This will give you same result as with browsers. Of course, it won’t parse the text from the image file.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m encountering some weird encoding issues. I need to parse an HTML document from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply