I want to write a program preferably Servlet that will search the particular keyword in a website which I’ll pass as a argument in the url field which will be present on JSP(view) page ,So my controller will connect to that URL and will search the content.
Is it possible?
I am new to web crawling technique. Will the Web crawling will work?
Please help me out.
Thanks,
@rs
Yes it’s possible, but it’s not a servlet your need for this. You need something that fetches HTML content from the desired URL, and then you make your own logic to parse the HTML text and extract what you want.
A basic such client would be Apache HTTP Client: http://hc.apache.org/httpclient-3.x/. This guy however only fetches HTML, it doens’t execute the javascript or use the rich media content (such as Flash). This however is very similar to how Google Web Crawlers work.
A more advanced such client is HTML Unit: http://htmlunit.sourceforge.net/. This guy does javascript as well.
Also if you really want to compare how GoogleBot actually fetches pages you can use this simulator from Google: http://www.google.com/support/webmasters/bin/answer.py?answer=158587 (you need to login with your gmail accoutn into Google WebMaster Tools to use it)