I’m currently trying to filter a text-file which contains words that are separated with a ‘-‘. I want to count the words.
scanner.useDelimiter(('[.,:;()?!\' \t\n\r]+'));
The problem which occurs simply is: words that contain a ‘-‘ will get separated and counted for being two words. So just escaping with \- isn’t the solution of choice.
How can I change the delimiter-expression, so that words like ‘foo-bar’ will stay, but the ‘-‘ alone will be filtered out and ignored?
Thanks 😉
OK, I’m guessing at your question here: you mean that you have a text file with some ‘real’ prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?
Example:
So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for ‘or’ is ‘|’. There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: ‘\s’.