I have a few regular expressions which are run against very long strings. However, the only part of the string which concerns the RE is near the beginning. Most of the REs are similar to:
\\s+?(\\w+?).*
The REs capture a few groups near the start, and don’t care what the rest of the string is. For performance reasons, is there a way to have the RE engine avoid looking at all the characters consumed by the terminating .*?
Note: The application with the REs is written using the java.regex classes.
Edit: For example I have the following RE:
.*?id="number"[^>]*?>([^<]+?).*
Which is run against large HTML files which are stored as StringBuilders. The tag with id="number" is always near the start of the HTML file.
When using the java.util.regex classes, there are a number of ways to match against a given string.
Matcher.matchesalways matches against the whole input string.Matcher.findlooks for something matching your regular expression somewhere within the input string. Finally,Matcher.lookingAtmatches your regular expression against the beginning of your input string.If you are using
Matcher.matchesyou may require the.*at the end to match the whole string. However, you might be better off using one of the other methods instead, which would allow you to leave off the.*. It sounds likeMatcher.lookingAtmay be appropriate for your purposes.