Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6534377
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T10:14:36+00:00 2026-05-25T10:14:36+00:00

Here is my code for Regex matching which worked for a webpage: public class

  • 0

Here is my code for Regex matching which worked for a webpage:

public class RegexTestHarness {

    public static void main(String[] args) {

        File aFile = new File("/home/darshan/Desktop/test.txt");
        FileInputStream inFile = null;
        try {
            inFile = new FileInputStream(aFile);
        } catch (FileNotFoundException e) {
            e.printStackTrace(System.err);
            System.exit(1);
        }

        BufferedInputStream in = new BufferedInputStream(inFile);
        DataInputStream data = new DataInputStream(in);
        String string = new String();
        try {
            while (data.read() != -1) {
                string += data.readLine();
            }

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        Pattern pattern = Pattern
                .compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>");
        Matcher matcher = pattern.matcher(string);
        boolean found = false;
        while (matcher.find()) {
            System.out.println("Name: " + matcher.group(1) );
            found = true;
        }
        if(!found){
            System.out.println("Pattern Not found");
        }
    }
}

But the same code doesn’t work on the crwaler code for which I’m testing the regex, my crawler code is:(I’m using Websphinx)

// Our own Crawler class extends the WebSphinx Crawler
public class MyCrawler extends Crawler {

    MyCrawler() {
        super(); // Do what the parent crawler would do
    }

    // We could choose not to visit a link based on certain circumstances
    // For now we always visit the link
    public boolean shouldVisit(Link l) {
        // String host = l.getHost();
        return false; // always visit a link
    }

    // What to do when we visit the page
    public void visit(Page page) {
        System.out.println("Visiting: " + page.getTitle());
        String content = page.getContent();

        System.out.println(content);

        Pattern pattern = Pattern.compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>");
        Matcher matcher = pattern.matcher(content);
        boolean found = false;
        while (matcher.find()) {
            System.out.println("Name: " + matcher.group(1) );
            found = true;
        }
        if(!found){
            System.out.println("Pattern Not found");
        }
    }
}

This is my code for running the crawler:

public class WebSphinxTest {

    public static void main(String[] args) throws MalformedURLException, InterruptedException {

        System.out.println("Testing Websphinx. . .");

        // Make an instance of own our crawler
        Crawler crawler = new MyCrawler();
        // Create a "Link" object and set it as the crawler's root
        Link link = new Link("http://justeat.in/restaurant/spices/5633/indian-tandoor-chinese-and-seafood/sarjapur-road/bangalore");
        crawler.setRoot(link);

        // Start running the crawler!
        System.out.println("Starting crawler. . .");
        crawler.run(); // Blocking function, could implement a thread, etc.

    }

}

A little detail about the crawler code. shouldvisit(Link link) filters whether to visit a link or not. visit(Page page) decides what to do when we get the page.

In the above example, test.txt and content contains the same String

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T10:14:36+00:00Added an answer on May 25, 2026 at 10:14 am

    In your RegexTestHarness you’re reading in lines from a file and concatenating the lines without line breaks after which you do your matching (readLine() returns the contents of the line without the line breaks!).

    So in the input of your MyCrawler class, there probably are line break characters in the input. And since the regex meta-char . by default does not match line break chars, it doesn’t work in MyCrawler.

    To fix this, append (?s) in from of all your patterns that contain a . meta char. So:

    Pattern.compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>")
    

    would become:

    Pattern.compile("(?s)<div class=\"rest_title\">.*?<h1>(.*?)</h1>")
    

    The DOT-ALL flag, (?s), will cause the . to match any character, including line break chars.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Here a code to demonstrate an annoying problem: class A { public: A(): m_b(1),
I have this code here, which is intended to allow any type of arguments:
Regex-noob here. Looking for some C# regex code to syntax highlight twitter text. So
Ok the error is showing up somewhere in this here code if($error==false) { $query
Here is code from MSDN . I don't understand why the work isn't just
enter code here I have a table on SQL server 2005 with bigint primary
The code here is X++. I know very little about it, though I am
enter code here Hi All, I have a simple windows service application that connects
Edit: The code here still has some bugs in it, and it could do
See here: http://code.google.com/p/ie7-js/ Does anyone have any experience or remarks about this javascript? Is

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.