I have the following code
private String anchorRegex = "\\<\\s*?a\\s+.*?href\\s*?=\\s*?([^\\s]*?).*?\\>";
private Pattern anchorPattern = Pattern.compile(anchorRegex, Pattern.CASE_INSENSITIVE);
String content = getContentAsString();
Matcher matcher = anchorPattern.matcher(content);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
The call to getContentAsString() returns the HTML content from a web page. The problem I’m having is that the only thing that gets printed in my System.out is a space. Can anyone see what’s wrong with my regex?
Regex drives me crazy sometimes.
You need to delimit your capturing group from the following
.*?. There’s probably double quotes"around the href, so use those:Your regex contains:
The
([^\s]*?)says to reluctantly find all non-whitespace characters and save them in a group. But the reluctant*?depends on the next part, which is.; any character. So the matching of the href aborts at the first possible chance and it is the.*?which matches the rest of the URL.