Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6534377
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T10:14:36+00:00 2026-05-25T10:14:36+00:00

Here is my code for Regex matching which worked for a webpage: public class

  • 0

Here is my code for Regex matching which worked for a webpage:

public class RegexTestHarness {

    public static void main(String[] args) {

        File aFile = new File("/home/darshan/Desktop/test.txt");
        FileInputStream inFile = null;
        try {
            inFile = new FileInputStream(aFile);
        } catch (FileNotFoundException e) {
            e.printStackTrace(System.err);
            System.exit(1);
        }

        BufferedInputStream in = new BufferedInputStream(inFile);
        DataInputStream data = new DataInputStream(in);
        String string = new String();
        try {
            while (data.read() != -1) {
                string += data.readLine();
            }

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        Pattern pattern = Pattern
                .compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>");
        Matcher matcher = pattern.matcher(string);
        boolean found = false;
        while (matcher.find()) {
            System.out.println("Name: " + matcher.group(1) );
            found = true;
        }
        if(!found){
            System.out.println("Pattern Not found");
        }
    }
}

But the same code doesn’t work on the crwaler code for which I’m testing the regex, my crawler code is:(I’m using Websphinx)

// Our own Crawler class extends the WebSphinx Crawler
public class MyCrawler extends Crawler {

    MyCrawler() {
        super(); // Do what the parent crawler would do
    }

    // We could choose not to visit a link based on certain circumstances
    // For now we always visit the link
    public boolean shouldVisit(Link l) {
        // String host = l.getHost();
        return false; // always visit a link
    }

    // What to do when we visit the page
    public void visit(Page page) {
        System.out.println("Visiting: " + page.getTitle());
        String content = page.getContent();

        System.out.println(content);

        Pattern pattern = Pattern.compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>");
        Matcher matcher = pattern.matcher(content);
        boolean found = false;
        while (matcher.find()) {
            System.out.println("Name: " + matcher.group(1) );
            found = true;
        }
        if(!found){
            System.out.println("Pattern Not found");
        }
    }
}

This is my code for running the crawler:

public class WebSphinxTest {

    public static void main(String[] args) throws MalformedURLException, InterruptedException {

        System.out.println("Testing Websphinx. . .");

        // Make an instance of own our crawler
        Crawler crawler = new MyCrawler();
        // Create a "Link" object and set it as the crawler's root
        Link link = new Link("http://justeat.in/restaurant/spices/5633/indian-tandoor-chinese-and-seafood/sarjapur-road/bangalore");
        crawler.setRoot(link);

        // Start running the crawler!
        System.out.println("Starting crawler. . .");
        crawler.run(); // Blocking function, could implement a thread, etc.

    }

}

A little detail about the crawler code. shouldvisit(Link link) filters whether to visit a link or not. visit(Page page) decides what to do when we get the page.

In the above example, test.txt and content contains the same String

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T10:14:36+00:00Added an answer on May 25, 2026 at 10:14 am

    In your RegexTestHarness you’re reading in lines from a file and concatenating the lines without line breaks after which you do your matching (readLine() returns the contents of the line without the line breaks!).

    So in the input of your MyCrawler class, there probably are line break characters in the input. And since the regex meta-char . by default does not match line break chars, it doesn’t work in MyCrawler.

    To fix this, append (?s) in from of all your patterns that contain a . meta char. So:

    Pattern.compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>")
    

    would become:

    Pattern.compile("(?s)<div class=\"rest_title\">.*?<h1>(.*?)</h1>")
    

    The DOT-ALL flag, (?s), will cause the . to match any character, including line break chars.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

const static int foo = 42; I saw this in some code here on
Edit: The code here still has some bugs in it, and it could do
See here: http://code.google.com/p/ie7-js/ Does anyone have any experience or remarks about this javascript? Is
I have this code here: var infiltrationResult; while(thisOption) { var trNode = document.createElement('tr'); var
We have some old C code here that's built with nmake. Is there an
I am using pseudo-code here, but this is in JavaScript. With the most efficient
I have a piece of code here that i really could use some help
I've got this code here: SqlCommand CodeStatus = new SqlCommand(SQL, DB); DB.Open(); Reader =
I am compiling a legacy C code here and there is a lot of
I am a total Groovy newbie. I saw the following code here . def

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.