I checked for this specific question and could not find any. I am writing a program in Java which analyses content from web pages, so I need a regular expression which can weed out all the links and tags (href, img, etc…), so that I could display only the pure content written and visible in the webpages. Thanks a lot.
Hi I wanted to make it more specific:
URLConnection connection = wordURL.openConnection("http://en.wikipedia.org/wiki/Bloom_filter");
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
String word = "bloom filter";
String regexp2 = word;
Pattern pattern2 = Pattern.compile(regexp2);
String HTML_REGEX = "(<.+?>)+"; // as per your answer(Martijn Courteaux)
while ((line = br.readLine()) != null)
{
String content;
if ( (content = line.replaceAll(HTML_REGEX, "\n") )!= null)
{
Matcher matcher2 = pattern2.matcher(line);
if(matcher2.find())
{
System.out.println(line);
}
}
}
But unfortunately it still prints out paragraph (<p>) tag and also <li> tag with some rubbish inside </li>. I would like to restrict it to display only those words where “bloom filter” is present.Thanks again.
I really know it isn’t good to use a regex with html. But if he really wants to this might help:
prints:
As you can see, it will work, but it is definitely not what you want.
You can reduce the number of newlines by using this regex:
prints:
I tried your code, and it didn’t work indeed. After some editing this worked:
What you did wrong was:
content, butlinewhich of course contains the tags…word“bloom filter” using a regex, which is case sensitive. So, just lowercase the strings and useString.contains(CharSequence target), which tells you if the target string is a part of the whole string.