I’ve this three text, and one regexp. (OK, it’s HTML, but …please, don’t focus

Question

0

Asked: June 13, 20262026-06-13T14:12:16+00:00 2026-06-13T14:12:16+00:00

I’ve this three text, and one regexp. (OK, it’s HTML, but …please, don’t focus

0

I’ve this three text, and one regexp. (OK, it’s HTML, but …please, don’t focus on it !!!!)

<h3 class="pubAdTitleBlock "><a href="/it/pubblicazioni/libri/Che-speranza-cè-per-i-morti/1101987030/" title="Che speranza c’è per i morti?">Che speranza c’è per i morti? (volantino N. 16)</a></h3>

<h3 class="pubAdTitleBlock "><a href="/it/pubblicazioni/libri/cosa-insegna-la-bibbia/È-questo-che-Dio-voleva/" title="È questo che Dio voleva?">Cosa insegna realmente la Bibbia?</a></h3>

<h3 class="pubAdTitleBlock">Cantiamo a Geova</h3>

This is the regexp

regexp = "<h3[^>]*>(<a[^>]*>)?([^<]+)(</a>)?</h3>";

I’ve three 3 groups:

the opening <a> tag (optional)
the text (it’s a book title, it’s the goal of regexp)
the closing </a> tag (optional)

Problem: The second row is matched, the third is matched. The first no. Why ?

Matching code:

pattern = Pattern.compile(regexp);
matcher = pattern.matcher(fullString);
idx = 0;
while (matcher.find()) {
  ...
}

matcher.find() simply skips the first row. This is not the first row of the file, it’s the 10th. It’s the first of the example.

Can be the literal parenthesis the problem? how to fix the regexp ?

EDIT: I’ve tried

String regexp = "<h3[^>]*>(.+)</h3>";

But also this regexp skip the first row … I really cannot understand !!!!

EDIT 2:

I’m having a dubt: can be a problem if there is the accented charcter ?

EDIT 3:

I’m trying to do data scraping from here: http://www.jw.org/it/pubblicazioni/libri/?contentLanguageFilter=it&sortBy=3

I’ve an input stream, then I convert to a single string using this code:

 // copied from http://stackoverflow.com/questions/309424/read-convert-an-inputstream-to-a-string
public static String convertStreamToString(InputStream is) {
    try {
        return new java.util.Scanner(is, "UTF-8").useDelimiter("\\A").next();
    } catch (java.util.NoSuchElementException e) {
        return "";
    }

Then I’m apllying the regexp …

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T14:12:18+00:00

Not sure but maybe this is what you are looking for

String data = "<h3 class=\"pubAdTitleBlock \"><a href=\"/it/pubblicazioni/libri/Che-speranza-cè-per-i-morti/1101987030/\" title=\"Che speranza c’è per i morti?\">Che speranza c’è per i morti? (volantino N. 16)</a></h3>"
        + "<h3 class=\"pubAdTitleBlock \"><a href=\"/it/pubblicazioni/libri/cosa-insegna-la-bibbia/È-questo-che-Dio-voleva/\" title=\"È questo che Dio voleva?\">Cosa insegna realmente la Bibbia?</a></h3>"
        + "<h3 class=\"pubAdTitleBlock\">Cantiamo a Geova</h3>";

Pattern pattern = Pattern
        .compile("<h3[^>]*>(?:<a[^>]*>)?([^<]+)(?:</a>)?</h3>");
Matcher matcher = pattern.matcher(data);
while (matcher.find()) 
    System.out.println(matcher.group(1));

Output:

Che speranza c’è per i morti? (volantino N. 16)
Cosa insegna realmente la Bibbia?
Cantiamo a Geova

Little explanation:

groups like (?:someregex) will not be counted by regex mechanism. Thanks to that in (?:a)(b)(?:c)(d) group (b) will be indexed as 1 and (d) as 2.

Edit1

(I know its blasphemy to use regex to parse HTML but since OP wants it…)
You forgot to mention that parsed HTML contains white spaces like tabulations and new line marks inside <h3 >. Try it this way:

String data = convertStreamToString(new URL(
        "http://www.jw.org/it/pubblicazioni/libri/?contentLanguageFilter=it&sortBy=3")
        .openStream());

Pattern pattern = Pattern
        .compile("<h3[^>]*>\\s*(?:<a[^>]*>)?([^<]+)(?:</a>)\\s*?</h3>");
Matcher matcher = pattern.matcher(data);
int counter=0;
while (matcher.find())
    System.out.println(++counter +")"+matcher.group(1));

Output:

1)Accostiamoci a Geova
2)Accostiamoci a Geova — caratteri grandi
....
11)Cosa insegna realmente la Bibbia?
12)Cosa insegna realmente la Bibbia? — caratteri grandi

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve this three text, and one regexp. (OK, it’s HTML, but …please, don’t focus

Leave an answerCancel reply

1 Answer

Edit1

Leave an answer
Cancel reply