I have the following code, which objective is to detect the encoding/charset of a

Question

0

Asked: June 7, 20262026-06-07T13:17:47+00:00 2026-06-07T13:17:47+00:00

I have the following code, which objective is to detect the encoding/charset of a

0

I have the following code, which objective is to detect the encoding/charset of a given webpage, using regular expressions.

I need to test two of the following regex (regexHTML1 and regexHTML2). In this case, the correct regex is the second one, regexHTML2, which outputs:

Found: <meta id="HtmlHead1_desc" name="description" content="Televisores,TV 3D, TV, vídeo e MP3. Compre online Televisores,TV 3D na Fnac" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta
Found: UTF-8

With this code:

URL url = new URL("http://www.fnac.pt/imagem-e-som/TV-3D/Televisores/s21075?bl=HGAChead");
is = url.openStream();

String regexHTML1 = "<meta.*content=\\\".*;.*charset=(.*)\\\"\\s*/?>";
String regexHTML2 = "<meta.*content=\\\".*;.*charset=(.*)\\\"\\s*/?>\\s*<meta";

//  Scanner s = new Scanner(is);        
//  s.findWithinHorizon(regexHTML1, 0);
//  MatchResult result = s.match();
//  for (int i = 0; i <= result.groupCount(); i++)
//      System.out.println("Found: " + result.group(i));
//  s.close();

Scanner s2 = new Scanner(is);
s2.findWithinHorizon(regexHTML2, 0);
MatchResult result2 = s2.match();
for (int i = 0; i <= result2.groupCount(); i++)
    System.out.println("Found: " + result2.group(i));
s2.close();

However, if I uncomment the commented code block that tests the first regex (regexHTML1), the output is:

Found: <meta id="HtmlHead1_desc" name="description" content="Televisores,TV 3D, TV, vídeo e MP3. Compre online Televisores,TV 3D na Fnac" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="PICS-Label" content="(PICS-1.1 &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true comment &quot;RSACi North America Server&quot; by &quot;webmaster@fnac.com&quot; for &quot;http://www.fnac.com/&quot; on &quot;1997.06.30T14:21-0500&quot; r (n 0 s 0 v 0 l 0))" /><link rel="shortcut icon" href="/favicon.ico" /><link id="HtmlHead1_canonicalLink" rel="canonical" href="http://www.fnac.pt/imagem-e-som/TV-3D/Televisores/s21075" />
Found: UTF-8" /><meta http-equiv="PICS-Label" content="(PICS-1.1 &quot;http://www.rsac.org/ratingsv01.html&quot; l gen true comment &quot;RSACi North America Server&quot; by &quot;webmaster@fnac.com&quot; for &quot;http://www.fnac.com/&quot; on &quot;1997.06.30T14:21-0500&quot; r (n 0 s 0 v 0 l 0))" /><link rel="shortcut icon" href="/favicon.ico" /><link id="HtmlHead1_canonicalLink" rel="canonical" href="http://www.fnac.pt/imagem-e-som/TV-3D/Televisores/s21075

Since regexHTML1 is not appropriate. But when testing regexHTML2 (the correct one) it throws an exception:

java.lang.IllegalStateException: No match result available

How is this possible?
The regexHTML2 only works when I’m not testing regexHTML1…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T13:17:49+00:00

Editorial Team

2026-06-07T13:17:49+00:00Added an answer on June 7, 2026 at 1:17 pm

Input streams are consumed as they are read (i.e., they know their current position). So because you are using the same stream, it gets consumed by the initial scanning operation and nothing is left for the second scanner.

Use two different streams, or download the entire stream into a String or something similar, and match against that.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the following code, which objective is to detect the encoding/charset of a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply