I need to scrape some content from a HTTP response with Java. The required

Question

0

Asked: May 26, 20262026-05-26T21:14:20+00:00 2026-05-26T21:14:20+00:00

I need to scrape some content from a HTTP response with Java. The required

0

I need to scrape some content from a HTTP response with Java. The required fields in the response are: foo, bar and bla. My current pattern is very slow. Any ideas how to improve that?

Response:

...
<div class="ui-a">
<div class="ui-b">
    <p><strong>foo</strong></p>
    <p>bar</p>
</div>
<div class="ui-c">
    <p><strong>bla</strong></p>
    <p>...</p>
</div>
</div>

<div class="ui-a">
<div class="ui-b">
    <p><strong>foo1</strong></p>
    <p>bar1</p>
</div>
<div class="ui-c">
    <p><strong>bla1</strong></p>
    <p>...</p>
</div>

Pattern:

.*?<div class="ui-a">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>.*?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T21:14:21+00:00

Since you can’t make use of an HTML parser, try something like this:

import java.util.regex.*;

public class Main {
    public static void main (String[] args) {
        String html =
                "...\n" +
                "<div class=\"ui-a\">\n" +
                "<div class=\"ui-b\">\n" +
                "    <p><strong>foo</strong></p>\n" +
                "    <p>bar</p>\n" +
                "</div>\n" +
                "<div class=\"ui-c\">\n" +
                "    <p><strong>bla</strong></p>\n" +
                "    <p>...</p>\n" +
                "</div>\n" +
                "</div>\n" +
                "\n" +
                "<div class=\"ui-a\">\n" +
                "<div class=\"ui-b\">\n" +
                "    <p><strong>foo1</strong></p>\n" +
                "    <p>bar1</p>\n" +
                "</div>\n" +
                "<div class=\"ui-c\">\n" +
                "    <p><strong>bla1</strong></p>\n" +
                "    <p>...</p>\n" +
                "</div>";

        Pattern p = Pattern.compile(
                "(?sx)                               # enable DOT-ALL and COMMENTS     \n" +
                "<div\\s+class=\"ui-a\">             # match '<div...ui-a...>'         \n" +
                "(?:(?!<strong>).)*+                 # match everything up to <strong> \n" +
                "<strong>([^<>]++)</strong>          # match <strong>...</strong>      \n" +
                "(?:(?!<p>).)*+                      # match up to <p>                 \n" +
                "<p>([^<>]++)</p>                    # match <p>...</p>                \n" +
                "(?:(?!<div\\s+class=\"ui-c\">).)*+  # match up to '<div...ui-a...>'   \n" +
                "<div\\s+class=\"ui-c\">             # match '<div...ui-c...>'         \n" +
                "(?:(?!<strong>).)*+                 # match everything up to <strong> \n" +
                "<strong>([^<>]++)</strong>          # match <strong>...</strong>      \n"
        );

        Matcher m = p.matcher(html);

        while(m.find()) {
            System.out.println("---------------");
            for(int i = 1; i <= m.groupCount(); i++) {
                System.out.printf("group(%d) = %s\n", i, m.group(i));
            }
        }
    }
}

which will print the following to the console:

---------------
group(1) = foo
group(2) = bar
group(3) = bla
---------------
group(1) = foo1
group(2) = bar1
group(3) = bla1

Note my changes:

*+ and ++: http://www.regular-expressions.info/possessive.html
instead of .*?, I used (?:(?!...).)*+. The first, .*? will keep track of all possible matches it makes to be able to back-track at a later stage. The latter, (?:(?!...).)*+, will not keep track of these matches.

That should make it quicker (not sure by how much…).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to scrape some content from a HTTP response with Java. The required

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply