I have the following code, which objective is to detect the encoding/charset of a given webpage, using regular expressions.
I need to test two of the following regex (regexHTML1 and regexHTML2). In this case, the correct regex is the second one, regexHTML2, which outputs:
Found: <meta id="HtmlHead1_desc" name="description" content="Televisores,TV 3D, TV, vídeo e MP3. Compre online Televisores,TV 3D na Fnac" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta
Found: UTF-8
With this code:
URL url = new URL("http://www.fnac.pt/imagem-e-som/TV-3D/Televisores/s21075?bl=HGAChead");
is = url.openStream();
String regexHTML1 = "<meta.*content=\\\".*;.*charset=(.*)\\\"\\s*/?>";
String regexHTML2 = "<meta.*content=\\\".*;.*charset=(.*)\\\"\\s*/?>\\s*<meta";
// Scanner s = new Scanner(is);
// s.findWithinHorizon(regexHTML1, 0);
// MatchResult result = s.match();
// for (int i = 0; i <= result.groupCount(); i++)
// System.out.println("Found: " + result.group(i));
// s.close();
Scanner s2 = new Scanner(is);
s2.findWithinHorizon(regexHTML2, 0);
MatchResult result2 = s2.match();
for (int i = 0; i <= result2.groupCount(); i++)
System.out.println("Found: " + result2.group(i));
s2.close();
However, if I uncomment the commented code block that tests the first regex (regexHTML1), the output is:
Found: <meta id="HtmlHead1_desc" name="description" content="Televisores,TV 3D, TV, vídeo e MP3. Compre online Televisores,TV 3D na Fnac" /><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="PICS-Label" content="(PICS-1.1 "http://www.rsac.org/ratingsv01.html" l gen true comment "RSACi North America Server" by "webmaster@fnac.com" for "http://www.fnac.com/" on "1997.06.30T14:21-0500" r (n 0 s 0 v 0 l 0))" /><link rel="shortcut icon" href="/favicon.ico" /><link id="HtmlHead1_canonicalLink" rel="canonical" href="http://www.fnac.pt/imagem-e-som/TV-3D/Televisores/s21075" />
Found: UTF-8" /><meta http-equiv="PICS-Label" content="(PICS-1.1 "http://www.rsac.org/ratingsv01.html" l gen true comment "RSACi North America Server" by "webmaster@fnac.com" for "http://www.fnac.com/" on "1997.06.30T14:21-0500" r (n 0 s 0 v 0 l 0))" /><link rel="shortcut icon" href="/favicon.ico" /><link id="HtmlHead1_canonicalLink" rel="canonical" href="http://www.fnac.pt/imagem-e-som/TV-3D/Televisores/s21075
Since regexHTML1 is not appropriate. But when testing regexHTML2 (the correct one) it throws an exception:
java.lang.IllegalStateException: No match result available
How is this possible?
The regexHTML2 only works when I’m not testing regexHTML1…
Input streams are consumed as they are read (i.e., they know their current position). So because you are using the same stream, it gets consumed by the initial scanning operation and nothing is left for the second scanner.
Use two different streams, or download the entire stream into a
Stringor something similar, and match against that.